Lecture 26A: Stream Processing, Graph Processing, ML

❓Motivation

Batch processing: MapReduce is slow, need to wait for entire computation to finish
Stream processing: allows you to see results in real-time
- Ex: calculating surge prices, aggregating LinkedIn updates into an auto-email, understanding Netflix user behavior
- Apache’s Storm is a framework that allows stream processing
Graph processing: use case is for data in the form of a graph (social media, etc.) and when need to do tasks related to graph algorithms (PageRank, shortest path, etc.)
- Google’s Pregal is framework that allows graph processing

Stream is a constant input or unbounded number of incoming tuples, which is an ordered list of elements <user1, tweet> <user2, tweet> ...
Spout is the source of stream(s)
- Data is generated from a crawler or DB
- The same spout can be the source of multiple streams
Topology is a directed graph representation of a Storm application

Bolt takes input stream → performs some processing on it → produces a new stream
- FILTER on condition
- JOIN on condition to output pairs (A, B) → NOTE that joins are allowed within a window because input is a constant stream so we cannot possible join all pairs across the entire duration of processing
- APPLY or TRANSFORM and many other bolt flavors available …
Grouping strategies for parallelizing
- Shuffle Grouping: distributed tasks evenly in round-robin
- Fields Grouping: range-based partitioning (ex: Bolt 1 handles [A-H] and Bolt 2 handles [I-Q] and Bolt 3 handles [R-Z]
- All Grouping: all bolts receives all inputs (ex: useful for Join where all bolts get Stream A and Stream B is partitioned across bolts so each bolt does its chunk of the join)

Master Node (AKA coordinator AKA leader)
- Runs Nimbus (daemon) that distributes code across cluster, assigns tasks to machines, monitors failures across machines
Worker Node
- Runs on each machine
- Runs Supervisor (daemon) that listens for work assigned to it
- Runs Executors that contain a group of tasks
Zookeeper
- Coordinates between Nimbus and Supervisors
- Maintains state for these too

Anchoring: every output is anchored to the input it came from, the framework tracks all these anchors
EMIT(input_tuple, output) → emits an output tuple anchored to the input tuple
ACK(input_tuple) → sent from bolt indicating that tuple was processed
FAIL(input_tuple) → sent from bolt if there was an exception or error → sends input tuples again
MUST ACK or FAIL every anchor otherwise it hogs memory to keep track of all anchors

Each vertex has a value → GATHER STEP (get immediate neighbors) → APPLY STEP (each vertex does some local computation) → SCATTER STEP (each vertex updates new value and sends it out to neighbors) → graph processing terminates after fixed num. iterations or all vertices converge to the same value