Apache Flume
Picture Source: Upgrad

Apache Flume

Apache Flume is a tool that can handle the ingestion of unstructured data which can be log file or streaming data.

Feature of Flume which make it useful:

  1. It Store data in buffer and can prevent HDFS system from overloading of write server.
  2. Flume configuration can be extended to other location where our write server are located so there will very less latency.
  3. Fault-tolerance : As it store data in buffer we can prevent any type of data loss

Components of Flume :

Source

This component of flume agent receive data from any application that produce the data.and this can be in form of Avro or any other https form.Data is received by opening ports on .Each source is connected to one channel and data is transferred into channels once event is triggered.

Event

  • Data flow in form of event .It have 2 parts Header and body
  1. A header contains the key-value pairs that are used for showing routing information Unique identification and tracking of events.
  2. Body: The body contains the actual data, which is an array of bytes.

Channel

It is buffer that keeps events until sink write to storage target.Multiple source can write to channel and multiple sink can read events from same channel.It support JDBC and kafka channel,priority tracking of events.

Sinks

It is part of flume agent which deliver data to final destination.It work on Pull method and once data is written in destination it inform channel to remove that events.It uses a transactional approach to guarantee the reliable delivery of the events.The sources and sinks encapsulate in a transaction the storage/retrieval, respectively, of the events placed in or provided by a transaction provided by the channel.

Use case :

Recommendation systems, sentiment analysis using Twitter

The generic template of a Flume configuration file

#list sources, sinks and channels in the agent 
<Agent>.sources = <Source> 
<Agent>.sinks = <Sink> 
<Agent>.channels = <Channel1> <Channel2>
 
# define the flow 
<Agent>.sources.<Source>.channels = <Channel1> <Channel2>
<Agent>.sinks.<Sink>.channel = <Channel1>

# source properties 
<Agent>.sources.<Source>.<someProperty> = <someValue> 

# sink properties 
<Agent>.sinks.<Sink>.<someProperty> = <someValue>

# channel properties 
<Agent>.channels.<Channel>.<someProperty> = <someValue>

Flume flow 

It represents the path taken by the data from its source to reach the target destination using Flume agents.

a)Multi-agent flow: In these flows, more than one flume agents are used. These flows are preferred when the rate at which data is generated is high. Multiple agents can be connected in a series-like configuration, wherein the sink of one agent is connected to the source of another agent.

b)Fan-Out flow: In this type flows, multiple channels are connected to the same source. These can be either multiplexing or replicating in nature. The channel selector property can be used for configuring these flows.

c)Tiered data collection flow: These flows can be used for configuring multiple Flume agents such that they receive data from the initial sources and consolidate the data onto fewer agents that can finally dump the data into the final sink.It is used in log collection.

Tuning parameter to consider:

a)Tiered data collection flows : It help to distribute load and scale-able architecture.

b)Type of channel:File ,Memory .File channel is durable channel as events are stored in disk.In Memory channel events are buffered in RAM so its volatile in nature but fats in nature.To have better performance we should have mix of these 2 channel.

c)Batch Size: It refers to maximum number of events that will be consumed from a channel for single transaction.It have impact on throughput,latency and duplication under failure.

d) Channel capacity: No of events a channel can hold at a time.

e) Channel Transaction capacity : No of events a channel accept or sends in one transaction.


要查看或添加评论,请登录

Neelam Pawar的更多文章

  • Unlocking the Next Billion Users: A Guide to Growing Your User Base

    Unlocking the Next Billion Users: A Guide to Growing Your User Base

    Bottom of the pyramid (BOP) or the poorest two-thirds of the human pyramid in terms of economics, are resilient…

  • QR Code - Art of potential

    QR Code - Art of potential

    The utilization of this 2D digit asset has expanded by 200% across all industries, according to research by Bitly, and…

  • Ethical Fashion: Step Towards Sustainability

    Ethical Fashion: Step Towards Sustainability

    Looking at the numbers only gives us a hint of what we are going to face in the coming few years if we do not start…

  • Big-Data Ingestion

    Big-Data Ingestion

    Data Ingestion Data ingestion is the transportation of data from assorted sources to a storage medium where it can be…

  • Decade learning: Dedicated to all women

    Decade learning: Dedicated to all women

    Remove self-imposed barrier: Do not show hesitation in taking credit or announcing how capable you are. Utilize every…

    4 条评论
  • Karma Yoga in Life

    Karma Yoga in Life

    Doing Karma ,engaging in action is inevitable for anyone.It is different meaning to each individual,Some think that…

社区洞察

其他会员也浏览了