Lecture 26B: Spark | Notion

❓Motivation

MapReduce is not enough because it is too slow (writing output of map task slows it down)
Iterative (ex: training jobs for ML) and interactive (ex: adding adhoc queries to a running job) workloads are not suitable for MapReduce

Spark can be used under the hood of all these applications! ———————————->

💽 Resilient Distributed Datasets (RDDs)

Solution proposed by Apache Spark
Immutable partitioned collection of record
Cannot modify individual RDDs, any transformation you apply will be applied on ALL RDDs
RDDs can be cached efficiently

RDDs can be transformed through filter, join, map, etc. or action can be taken on existing RDD like count, print, etc.
FASTER THAN MAPREDUCE BECAUSE persisting onto disk is chosen by you, but in MapReduce its forced into the program to have a disk write after computation

😵 Fault Tolerence

Log course grain of each operation
Recompute the lost partition upon failure
IF NO FAILURES OCCUR, YOU INCUR NO OVERHEAD OF CHECKPOINTING (unlike Pregel)

📝 GraphX on Spark example

Graph data stored in the form of triplets (first image)