❓Motivation
- Batch processing: MapReduce is slow, need to wait for entire computation to finish
- Stream processing: allows you to see results in real-time
- Ex: calculating surge prices, aggregating LinkedIn updates into an auto-email, understanding Netflix user behavior
- Apache’s Storm is a framework that allows stream processing
- Graph processing: use case is for data in the form of a graph (social media, etc.) and when need to do tasks related to graph algorithms (PageRank, shortest path, etc.)
- Google’s Pregal is framework that allows graph processing
Stream Processing
- Stream is a constant input or unbounded number of incoming tuples, which is an ordered list of elements
<user1, tweet> <user2, tweet> ...
- Spout is the source of stream(s)
- Data is generated from a crawler or DB
- The same spout can be the source of multiple streams
- Topology is a directed graph representation of a Storm application

- Bolt takes input stream → performs some processing on it → produces a new stream
- FILTER on condition
- JOIN on condition to output pairs (A, B) → NOTE that joins are allowed within a window because input is a constant stream so we cannot possible join all pairs across the entire duration of processing
- APPLY or TRANSFORM and many other bolt flavors available …
- Grouping strategies for parallelizing
- Shuffle Grouping: distributed tasks evenly in round-robin
- Fields Grouping: range-based partitioning (ex: Bolt 1 handles
[A-H] and Bolt 2 handles [I-Q] and Bolt 3 handles [R-Z]
- All Grouping: all bolts receives all inputs (ex: useful for
Join where all bolts get Stream A and Stream B is partitioned across bolts so each bolt does its chunk of the join)
⛈️ Storm Cluster
- Master Node (AKA coordinator AKA leader)
- Runs Nimbus (daemon) that distributes code across cluster, assigns tasks to machines, monitors failures across machines
- Worker Node
- Runs on each machine
- Runs Supervisor (daemon) that listens for work assigned to it
- Runs Executors that contain a group of tasks
- Zookeeper
- Coordinates between Nimbus and Supervisors
- Maintains state for these too

😵 Fault Tolerance in Storm
- Anchoring: every output is anchored to the input it came from, the framework tracks all these anchors
EMIT(input_tuple, output) → emits an output tuple anchored to the input tuple
ACK(input_tuple) → sent from bolt indicating that tuple was processed
FAIL(input_tuple) → sent from bolt if there was an exception or error → sends input tuples again
- MUST
ACK or FAIL every anchor otherwise it hogs memory to keep track of all anchors
Graph Processing
📚 Basics
- Each vertex has a value → GATHER STEP (get immediate neighbors) → APPLY STEP (each vertex does some local computation) → SCATTER STEP (each vertex updates new value and sends it out to neighbors) → graph processing terminates after fixed num. iterations or all vertices converge to the same value