Apache Storm for stream processing
Rupesh Choudhary
Rupesh Choudhary
Experienced Database Professional and Data Engineer | PostgreSQL |SRE| | Snowflake | DevOps | Data Engineering| Vice President
This is my first post after a long time . I have been thinking for a while to start blogging on things I have been playing around during my leisure time . In next few series of this blog post I will walk you through Apache storm architecture and data model. As part of this blog post I would like to share some of the key features of Apache storm and how it helps processing streaming data from multiple sources like twitter feeds,ZMQ,KAFKA etc.
To start with Apache storm understanding its framework is vital.In its core abstraction Apache storm is based on streaming data framework that has the capability of highest ingestion rate .
- It is distributed fault tolerant real time computation big data processing system. Though it is stateless , it manages distributed environment and cluster state via Apache Zookeeper.
- It is simple and allows all kinds of manipulations in real-time data and parallel.
- While Hadoop processes data via MR jobs ,Storm cluster runs topologies . The major difference between the two is that MR jobs starts ,processes and ends eventually while a topology once started, is intended to keep on processing live data forever unless otherwise killed or terminated.
- There are two kind of nodes on storm cluster similar to Hadoop - Master node and worker nodes . Master runs a daemon called "Nimbus" that is similar to Hadoop's "Resource Manager" .Nimbus is responsible for distributing code around the cluster and assigning tasks to machines and monitoring for failures.
- Each worker node runs a daemon called supervisor. Each supervisor can run one or more worker processes which are separate JVM processes on its node and listens for work assigned by Nimbus and stops and starts worker processes as necessary.
- Each worker process executes a subset of a topology ; a running topology consists of many worker processes spreading across many machines.
- These worker processes can execute one or more tasks in parallel ( Spout/bolt).
- Spout is a source of data stream which can be from "unreliable"( fire-and-forget) or "reliable" ( can replay failed tuples).
- Bolt consumes one or more incoming streams from spouts and produces new streams . Bolt can do anything from running functions, filter tuples , joins, talk to DB , etc .Complex stream transformation requires multiple steps and thus multiple bolts.
Above diagram consists of sample topology of storm in which spout acts a data receiver from external source and creator of stream for bolts for actual processing. Bolts can be changed serially or in parallel depending on what kind of processing we want to do.
Storm topology comprises following patterns.
- Streaming joins
- Batching
- BasicBolt
- In-memory caching + Fields grouping combo
- Streaming top N
- TimeCacheMap for efficiently keeping a cache of things that have been recently updated.
- CoordinatedBolt and KeyedFairBolt for distributed RPC.
Streaming Joins
- Streaming joins combines one or more streams based on some common field .unlike database join that has finite input and clear semantics for a join , a streaming join has infinite input and unclear semantics for what a join should be
Batching
- Often-time for efficiency reasons or otherwise you may want to process a group of tuples together rather than processing them individually .If you want reliability in your data processing , the right way to do is to hold on the tuples in an instance variable while the bolt waits to do the batching.
- Once you do the batch operation , you the ack all the tuples you were holding on .
- If bolts emit tuples , then you may want to use multi anchoring to ensure reliability .It all depends on the specific application type .
BasicBolt
- Many bolts follow a similar pattern of reading an input tuple , emitting zero or more tuples based on that input tuple , and then acking that input tuple immediately at the end of the execute method.
- Bolts which match this pattern are the things like functions and filters.
In-memory caching + fields grouping combo
- Caching becomes more powerful when you combine it with a fields of grouping.
For example, suppose you have a bolt that expands short URLs (like bit.ly, t.co, etc.) into long URLs. You can increase performance by keeping an LRU cache of short URL to long URL expansions to avoid doing the same HTTP requests over and over.
Streaming Top N
- A common continuous computation done on Storm is a "streaming top N" of some sort.
- Suppose you have a bolt that emits tuples of the form ["value", "count"] and you want a bolt that emits the top N tuples based on count.
- The simplest way to do this is to have a bolt that does a global grouping on the stream and maintains a list in memory of the top N items.
- This approach obviously doesn't scale to large streams since the entire stream has to go through one task.
- A better way to do the computation is to do many top N's in parallel across partitions of the stream, and then merge those top N's together to get the global top N.