❓Motivation
- MapReduce is not enough because it is too slow (writing output of map task slows it down)
- Iterative (ex: training jobs for ML) and interactive (ex: adding adhoc queries to a running job) workloads are not suitable for MapReduce
Spark can be used under the hood of all these applications! ———————————->

💽 Resilient Distributed Datasets (RDDs)
- Solution proposed by Apache Spark
- Immutable partitioned collection of record
- Cannot modify individual RDDs, any transformation you apply will be applied on ALL RDDs
- RDDs can be cached efficiently

- RDDs can be transformed through
filter, join, map, etc. or action can be taken on existing RDD like count, print, etc.
- FASTER THAN MAPREDUCE BECAUSE persisting onto disk is chosen by you, but in MapReduce its forced into the program to have a disk write after computation
😵 Fault Tolerence
- Log course grain of each operation
- Recompute the lost partition upon failure
- IF NO FAILURES OCCUR, YOU INCUR NO OVERHEAD OF CHECKPOINTING (unlike Pregel)
📝 GraphX on Spark example
- Graph data stored in the form of triplets (first image)