Big-Data Ingestion
Neelam Pawar
Gen-AI Ambassador with Specialization in LLM Evaluation ,11/11 GCP Certified,CKA, CKS, Ethical hacker | Ex-Microsoft
Data Ingestion
Data ingestion is the transportation of data from assorted sources to a storage medium where it can be accessed, used, and analyzed by an organization.
Batch processing: Ingestion layer periodically collects and groups source data and sends it to the destination system. Groups may be processed based on any logical ordering, the activation of certain conditions, or a simple schedule.Tool used is Map Reduce Example:Payroll,Billing
Real-time processing:Data is sourced, manipulated, and loaded as soon as it’s created or recognized by the data ingestion layer. This kind of ingestion is more expensive, since it requires systems to constantly monitor sources and accept new information.Example:Bank ATM,Radar systems
Data Ingestion Parameters
- Data Velocity – Data Velocity deals with the speed at which data flows in from different sources like machines, networks, human interaction, media sites, social media. The movement of data can be massive or continuous.
- Data Size – Data size implies enormous volume of data. Data is generated from different sources that may increase timely.
- Data Frequency (Batch, Real-Time) – Data can be processed in real time or batch, in real time processing as data received on same time, it further proceeds but in batch time data is stored in batches, fixed at some time interval and then further moved.
- Data Format (Structured, Semi-Structured, Unstructured) – Data can be in different formats, mostly it can be the structured format, i.e., tabular one or unstructured format, i.e., images, audios, videos or semi-structured, i.e., JSON files, CSS files, etc.
Tools used
Apache Sqoop: Sqoop is short for ‘SQL to Hadoop.It is used to import data from a relational database system or a mainframe into HDFS.The import process is performed in parallel.
Apache Flume :Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.It can be used to ingest real-time data as well.
Apache Kafka:It is distributed streaming platform to ingest real-time streaming data.
Apache Gobblin: Gobblin is an open-source data ingestion framework for extracting, transforming and loading large volumes of data from different data sources. It supports both streaming and batch data ecosystems.
File Formats
Text/CSV Files :They are readable and ubiquitously parsable. They come in handy when doing a dump from a database or bulk loading data from Hadoop into an analytic database. However, CSV files do not support block compression.
XML and JSON:XML defines a set of rules that can be used to encode documents in a machine- and a human-readable format.It take much bandwidth then JSON.JSON is an open-standard file format consisting of key-value pairs.Both of these do not support block compression, splitting .
Avro: Avro files store metadata with the data but also allow specification of an independent schema for reading the file.These files are splittable, support block compression.It save lot of bandwidth over wires.