登录查看更多内容

Apache Flume

Neelam Pawar

Gen-AI Ambassador with Specialization in LLM Evaluation ,11/11 GCP Certified,CKA, CKS, Ethical hacker | Ex-Microsoft

发布日期: 2020年7月18日

+ 关注

Apache Flume is a tool that can handle the ingestion of unstructured data which can be log file or streaming data.

Feature of Flume which make it useful:

It Store data in buffer and can prevent HDFS system from overloading of write server.
Flume configuration can be extended to other location where our write server are located so there will very less latency.
Fault-tolerance : As it store data in buffer we can prevent any type of data loss

Components of Flume :

Source

This component of flume agent receive data from any application that produce the data.and this can be in form of Avro or any other https form.Data is received by opening ports on .Each source is connected to one channel and data is transferred into channels once event is triggered.

Event

Data flow in form of event .It have 2 parts Header and body

A header contains the key-value pairs that are used for showing routing information Unique identification and tracking of events.
Body: The body contains the actual data, which is an array of bytes.

Channel

It is buffer that keeps events until sink write to storage target.Multiple source can write to channel and multiple sink can read events from same channel.It support JDBC and kafka channel,priority tracking of events.

Sinks

It is part of flume agent which deliver data to final destination.It work on Pull method and once data is written in destination it inform channel to remove that events.It uses a transactional approach to guarantee the reliable delivery of the events.The sources and sinks encapsulate in a transaction the storage/retrieval, respectively, of the events placed in or provided by a transaction provided by the channel.

Use case :

Recommendation systems, sentiment analysis using Twitter

The generic template of a Flume configuration file

#list sources, sinks and channels in the agent 
<Agent>.sources = <Source> 
<Agent>.sinks = <Sink> 
<Agent>.channels = <Channel1> <Channel2>
 
# define the flow 
<Agent>.sources.<Source>.channels = <Channel1> <Channel2>
<Agent>.sinks.<Sink>.channel = <Channel1>

# source properties 
<Agent>.sources.<Source>.<someProperty> = <someValue> 

# sink properties 
<Agent>.sinks.<Sink>.<someProperty> = <someValue>

# channel properties 
<Agent>.channels.<Channel>.<someProperty> = <someValue>

Flume flow

It represents the path taken by the data from its source to reach the target destination using Flume agents.

a)Multi-agent flow: In these flows, more than one flume agents are used. These flows are preferred when the rate at which data is generated is high. Multiple agents can be connected in a series-like configuration, wherein the sink of one agent is connected to the source of another agent.

b)Fan-Out flow: In this type flows, multiple channels are connected to the same source. These can be either multiplexing or replicating in nature. The channel selector property can be used for configuring these flows.

c)Tiered data collection flow: These flows can be used for configuring multiple Flume agents such that they receive data from the initial sources and consolidate the data onto fewer agents that can finally dump the data into the final sink.It is used in log collection.

Tuning parameter to consider:

a)Tiered data collection flows : It help to distribute load and scale-able architecture.

b)Type of channel:File ,Memory .File channel is durable channel as events are stored in disk.In Memory channel events are buffered in RAM so its volatile in nature but fats in nature.To have better performance we should have mix of these 2 channel.

c)Batch Size: It refers to maximum number of events that will be consumed from a channel for single transaction.It have impact on throughput,latency and duplication under failure.

d) Channel capacity: No of events a channel can hold at a time.

e) Channel Transaction capacity : No of events a channel accept or sends in one transaction.

要查看或添加评论，请登录

Neelam Pawar的更多文章

Unlocking the Next Billion Users: A Guide to Growing Your User Base

2023年2月26日

Unlocking the Next Billion Users: A Guide to Growing Your User Base

Bottom of the pyramid (BOP) or the poorest two-thirds of the human pyramid in terms of economics, are resilient…
QR Code - Art of potential

2022年12月22日

QR Code - Art of potential

The utilization of this 2D digit asset has expanded by 200% across all industries, according to research by Bitly, and…
Ethical Fashion: Step Towards Sustainability

2022年12月18日

Ethical Fashion: Step Towards Sustainability

Looking at the numbers only gives us a hint of what we are going to face in the coming few years if we do not start…
Big-Data Ingestion

2020年7月17日

Big-Data Ingestion

Data Ingestion Data ingestion is the transportation of data from assorted sources to a storage medium where it can be…
Decade learning: Dedicated to all women

2020年1月1日

Decade learning: Dedicated to all women

Remove self-imposed barrier: Do not show hesitation in taking credit or announcing how capable you are. Utilize every…

4 条评论
Karma Yoga in Life

2018年11月14日

Karma Yoga in Life

Doing Karma ,engaging in action is inevitable for anyone.It is different meaning to each individual,Some think that…

See all articles

Apache Flume

Neelam Pawar

Gen-AI Ambassador with Specialization in LLM Evaluation ,11/11 GCP Certified,CKA, CKS, Ethical hacker | Ex-Microsoft

Feature of Flume which make it useful:

Components of Flume :

Source

Event

Channel

Sinks

Use case :

Flume flow

Tuning parameter to consider:

Neelam Pawar的更多文章

社区洞察

其他会员也浏览了

Learn Kafka In Just 5 minutes

Kafka Simplified

Apache HBase

--- Apache Kafka vs Solace PubSub+: A Comprehensive Guide for Modern Messaging Systems

Apache Cassandra vs ScyllaDB

Mastering Spark Session Creation and Configuration in Apache Spark

?? Apache Kafka Internals-Part1

A Comprehensive Analysis - Data Processing Part Deux: Apache Spark vs Apache Storm

Anatomy of Apache Spark's RDD

Comparing Apache Kafka and Apache Pulsar: A Comprehensive Technical-Professional Analysis

Feature of Flume which make it useful:

Components of Flume :

Source

Event

Channel

Sinks

Use case :

Flume flow

Tuning parameter to consider:

Neelam Pawar的更多文章

Unlocking the Next Billion Users: A Guide to Growing Your User Base

QR Code - Art of potential

Ethical Fashion: Step Towards Sustainability

Big-Data Ingestion

Decade learning: Dedicated to all women

Karma Yoga in Life

社区洞察

其他会员也浏览了

Learn Kafka In Just 5 minutes

Kafka Simplified

Apache HBase

--- Apache Kafka vs Solace PubSub+: A Comprehensive Guide for Modern Messaging Systems

Apache Cassandra vs ScyllaDB

Mastering Spark Session Creation and Configuration in Apache Spark

?? Apache Kafka Internals-Part1

A Comprehensive Analysis - Data Processing Part Deux: Apache Spark vs Apache Storm

Anatomy of Apache Spark's RDD

Comparing Apache Kafka and Apache Pulsar: A Comprehensive Technical-Professional Analysis