Big-Data Ingestion

Data Ingestion

Data ingestion is the transportation of data from assorted sources to a storage medium where it can be accessed, used, and analyzed by an organization.

Batch processing: Ingestion layer periodically collects and groups source data and sends it to the destination system. Groups may be processed based on any logical ordering, the activation of certain conditions, or a simple schedule.Tool used is Map Reduce Example:Payroll,Billing

Real-time processing:Data is sourced, manipulated, and loaded as soon as it’s created or recognized by the data ingestion layer. This kind of ingestion is more expensive, since it requires systems to constantly monitor sources and accept new information.Example:Bank ATM,Radar systems

Data Ingestion Parameters

  • Data Velocity – Data Velocity deals with the speed at which data flows in from different sources like machines, networks, human interaction, media sites, social media. The movement of data can be massive or continuous.
  • Data Size – Data size implies enormous volume of data. Data is generated from different sources that may increase timely.
  • Data Frequency (Batch, Real-Time) – Data can be processed in real time or batch, in real time processing as data received on same time, it further proceeds but in batch time data is stored in batches, fixed at some time interval and then further moved.
  • Data Format (Structured, Semi-Structured, Unstructured) – Data can be in different formats, mostly it can be the structured format, i.e., tabular one or unstructured format, i.e., images, audios, videos or semi-structured, i.e., JSON files, CSS files, etc.

Tools used

Apache Sqoop: Sqoop is short for ‘SQL to Hadoop.It is used to import data from a relational database system or a mainframe into HDFS.The import process is performed in parallel.

Apache Flume :Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.It can be used to ingest real-time data as well.

Apache Kafka:It is distributed streaming platform to ingest real-time streaming data.

Apache Gobblin: Gobblin is an open-source data ingestion framework for extracting, transforming and loading large volumes of data from different data sources. It supports both streaming and batch data ecosystems.

File Formats

Text/CSV Files :They are readable and ubiquitously parsable. They come in handy when doing a dump from a database or bulk loading data from Hadoop into an analytic database. However, CSV files do not support block compression.

XML and JSON:XML defines a set of rules that can be used to encode documents in a machine- and a human-readable format.It take much bandwidth then JSON.JSON is an open-standard file format consisting of key-value pairs.Both of these do not support block compression, splitting .

Avro: Avro files store metadata with the data but also allow specification of an independent schema for reading the file.These files are splittable, support block compression.It save lot of bandwidth over wires.


要查看或添加评论,请登录

Neelam Pawar的更多文章

  • Unlocking the Next Billion Users: A Guide to Growing Your User Base

    Unlocking the Next Billion Users: A Guide to Growing Your User Base

    Bottom of the pyramid (BOP) or the poorest two-thirds of the human pyramid in terms of economics, are resilient…

  • QR Code - Art of potential

    QR Code - Art of potential

    The utilization of this 2D digit asset has expanded by 200% across all industries, according to research by Bitly, and…

  • Ethical Fashion: Step Towards Sustainability

    Ethical Fashion: Step Towards Sustainability

    Looking at the numbers only gives us a hint of what we are going to face in the coming few years if we do not start…

  • Apache Flume

    Apache Flume

    Apache Flume is a tool that can handle the ingestion of unstructured data which can be log file or streaming data…

  • Decade learning: Dedicated to all women

    Decade learning: Dedicated to all women

    Remove self-imposed barrier: Do not show hesitation in taking credit or announcing how capable you are. Utilize every…

    4 条评论
  • Karma Yoga in Life

    Karma Yoga in Life

    Doing Karma ,engaging in action is inevitable for anyone.It is different meaning to each individual,Some think that…

社区洞察

其他会员也浏览了