Apache Flume
Neelam Pawar
Gen-AI Ambassador with Specialization in LLM Evaluation ,11/11 GCP Certified,CKA, CKS, Ethical hacker | Ex-Microsoft
Apache Flume is a tool that can handle the ingestion of unstructured data which can be log file or streaming data.
Feature of Flume which make it useful:
- It Store data in buffer and can prevent HDFS system from overloading of write server.
- Flume configuration can be extended to other location where our write server are located so there will very less latency.
- Fault-tolerance : As it store data in buffer we can prevent any type of data loss
Components of Flume :
Source
This component of flume agent receive data from any application that produce the data.and this can be in form of Avro or any other https form.Data is received by opening ports on .Each source is connected to one channel and data is transferred into channels once event is triggered.
Event
- Data flow in form of event .It have 2 parts Header and body
- A header contains the key-value pairs that are used for showing routing information Unique identification and tracking of events.
- Body: The body contains the actual data, which is an array of bytes.
Channel
It is buffer that keeps events until sink write to storage target.Multiple source can write to channel and multiple sink can read events from same channel.It support JDBC and kafka channel,priority tracking of events.
Sinks
It is part of flume agent which deliver data to final destination.It work on Pull method and once data is written in destination it inform channel to remove that events.It uses a transactional approach to guarantee the reliable delivery of the events.The sources and sinks encapsulate in a transaction the storage/retrieval, respectively, of the events placed in or provided by a transaction provided by the channel.
Use case :
Recommendation systems, sentiment analysis using Twitter
The generic template of a Flume configuration file
#list sources, sinks and channels in the agent <Agent>.sources = <Source> <Agent>.sinks = <Sink> <Agent>.channels = <Channel1> <Channel2> # define the flow <Agent>.sources.<Source>.channels = <Channel1> <Channel2> <Agent>.sinks.<Sink>.channel = <Channel1> # source properties <Agent>.sources.<Source>.<someProperty> = <someValue> # sink properties <Agent>.sinks.<Sink>.<someProperty> = <someValue> # channel properties <Agent>.channels.<Channel>.<someProperty> = <someValue>
Flume flow
It represents the path taken by the data from its source to reach the target destination using Flume agents.
a)Multi-agent flow: In these flows, more than one flume agents are used. These flows are preferred when the rate at which data is generated is high. Multiple agents can be connected in a series-like configuration, wherein the sink of one agent is connected to the source of another agent.
b)Fan-Out flow: In this type flows, multiple channels are connected to the same source. These can be either multiplexing or replicating in nature. The channel selector property can be used for configuring these flows.
c)Tiered data collection flow: These flows can be used for configuring multiple Flume agents such that they receive data from the initial sources and consolidate the data onto fewer agents that can finally dump the data into the final sink.It is used in log collection.
Tuning parameter to consider:
a)Tiered data collection flows : It help to distribute load and scale-able architecture.
b)Type of channel:File ,Memory .File channel is durable channel as events are stored in disk.In Memory channel events are buffered in RAM so its volatile in nature but fats in nature.To have better performance we should have mix of these 2 channel.
c)Batch Size: It refers to maximum number of events that will be consumed from a channel for single transaction.It have impact on throughput,latency and duplication under failure.
d) Channel capacity: No of events a channel can hold at a time.
e) Channel Transaction capacity : No of events a channel accept or sends in one transaction.