登录查看更多内容

Big-Data Ingestion

Neelam Pawar

Gen-AI Ambassador with Specialization in LLM Evaluation ,11/11 GCP Certified,CKA, CKS, Ethical hacker | Ex-Microsoft

发布日期: 2020年7月17日

Data Ingestion

Data ingestion is the transportation of data from assorted sources to a storage medium where it can be accessed, used, and analyzed by an organization.

Batch processing: Ingestion layer periodically collects and groups source data and sends it to the destination system. Groups may be processed based on any logical ordering, the activation of certain conditions, or a simple schedule.Tool used is Map Reduce Example:Payroll,Billing

Real-time processing:Data is sourced, manipulated, and loaded as soon as it’s created or recognized by the data ingestion layer. This kind of ingestion is more expensive, since it requires systems to constantly monitor sources and accept new information.Example:Bank ATM,Radar systems

Data Ingestion Parameters

Data Velocity – Data Velocity deals with the speed at which data flows in from different sources like machines, networks, human interaction, media sites, social media. The movement of data can be massive or continuous.
Data Size – Data size implies enormous volume of data. Data is generated from different sources that may increase timely.
Data Frequency (Batch, Real-Time) – Data can be processed in real time or batch, in real time processing as data received on same time, it further proceeds but in batch time data is stored in batches, fixed at some time interval and then further moved.
Data Format (Structured, Semi-Structured, Unstructured) – Data can be in different formats, mostly it can be the structured format, i.e., tabular one or unstructured format, i.e., images, audios, videos or semi-structured, i.e., JSON files, CSS files, etc.

Tools used

Apache Sqoop: Sqoop is short for ‘SQL to Hadoop.It is used to import data from a relational database system or a mainframe into HDFS.The import process is performed in parallel.

Apache Flume :Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.It can be used to ingest real-time data as well.

Apache Kafka:It is distributed streaming platform to ingest real-time streaming data.

Apache Gobblin: Gobblin is an open-source data ingestion framework for extracting, transforming and loading large volumes of data from different data sources. It supports both streaming and batch data ecosystems.

File Formats

Text/CSV Files :They are readable and ubiquitously parsable. They come in handy when doing a dump from a database or bulk loading data from Hadoop into an analytic database. However, CSV files do not support block compression.

XML and JSON:XML defines a set of rules that can be used to encode documents in a machine- and a human-readable format.It take much bandwidth then JSON.JSON is an open-standard file format consisting of key-value pairs.Both of these do not support block compression, splitting .

Avro: Avro files store metadata with the data but also allow specification of an independent schema for reading the file.These files are splittable, support block compression.It save lot of bandwidth over wires.

要查看或添加评论，请登录

Neelam Pawar的更多文章

Unlocking the Next Billion Users: A Guide to Growing Your User Base

2023年2月26日

Unlocking the Next Billion Users: A Guide to Growing Your User Base

Bottom of the pyramid (BOP) or the poorest two-thirds of the human pyramid in terms of economics, are resilient…
QR Code - Art of potential

2022年12月22日

QR Code - Art of potential

The utilization of this 2D digit asset has expanded by 200% across all industries, according to research by Bitly, and…
Ethical Fashion: Step Towards Sustainability

2022年12月18日

Ethical Fashion: Step Towards Sustainability

Looking at the numbers only gives us a hint of what we are going to face in the coming few years if we do not start…
Apache Flume

2020年7月18日

Apache Flume

Apache Flume is a tool that can handle the ingestion of unstructured data which can be log file or streaming data…
Decade learning: Dedicated to all women

2020年1月1日

Decade learning: Dedicated to all women

Remove self-imposed barrier: Do not show hesitation in taking credit or announcing how capable you are. Utilize every…

4 条评论
Karma Yoga in Life

2018年11月14日

Karma Yoga in Life

Doing Karma ,engaging in action is inevitable for anyone.It is different meaning to each individual,Some think that…

See all articles

Big-Data Ingestion

Neelam Pawar

Gen-AI Ambassador with Specialization in LLM Evaluation ,11/11 GCP Certified,CKA, CKS, Ethical hacker | Ex-Microsoft

Data Ingestion

Data Ingestion Parameters

Tools used

File Formats

Neelam Pawar的更多文章

社区洞察

其他会员也浏览了

Docker & Kafka on AWS: The Ultimate Guide for Data Engineers

?? DATA Pill #104 - What can LLMs never do?, Kafka Connect: A Love/Hate Relationship

The Evolution of Data Engineering: From Batch Processing to Real-Time Insights

AWS Data Engineering Essentials Guidebook

Data Engineering on AWS

Multiple Spark Writers with Apache Hudi

Simplifying Data Work with Amazon EMR and PySpark for Data Processing and Analysis

Event processing of data streams optimizing SQS processing and efficient end-user querying

Data Engineering

75 Big Data Terms To Make Your Father Proud

Data Ingestion

Data Ingestion Parameters

Tools used

File Formats

Neelam Pawar的更多文章

Unlocking the Next Billion Users: A Guide to Growing Your User Base

QR Code - Art of potential

Ethical Fashion: Step Towards Sustainability

Apache Flume

Decade learning: Dedicated to all women

Karma Yoga in Life

社区洞察

其他会员也浏览了

Docker & Kafka on AWS: The Ultimate Guide for Data Engineers

?? DATA Pill #104 - What can LLMs never do?, Kafka Connect: A Love/Hate Relationship

The Evolution of Data Engineering: From Batch Processing to Real-Time Insights

AWS Data Engineering Essentials Guidebook

Data Engineering on AWS

Multiple Spark Writers with Apache Hudi

Simplifying Data Work with Amazon EMR and PySpark for Data Processing and Analysis

Event processing of data streams optimizing SQS processing and efficient end-user querying

Data Engineering

75 Big Data Terms To Make Your Father Proud