登录查看更多内容

Data Science: Common file format and their usage

Paras Nigam

VP, Engineering | AI & Cybersecurity | Generative AI Expert | GCC & GVC | IIM Calcutta | MIT USA | 3AI Thought Leader | Entrepreneur |

发布日期: 2022年2月21日

+ 关注

Few common data file formats used in the textual data science projects

CSV

CSV stands for 'Comma Separated Values', it stores data in a tabular format very similar to excel and is most commonly used for processing data using pandas. This is used when the data need to be stored in a tabular format and contents like text, numbers, dates are to be processed. It is a row-oriented format.

Parquet

Parquet is a column-oriented data format. It provides efficient column-based complex data processing. Parquet support efficient compression and encoding schemes. Hence lower storage costs for data files. It also increases the effectiveness of querying using serverless technologies. Compared to CSV we can get a significant reduction in storage and processing costs.

JSON

JSON stands for JavaScript Object Notation. You might be thinking about where JSON is used in data science, but the fact is JSON is everywhere. anytime there is a need for data interchange we jump off to JSON and in data science, we need to ingest, digest, and spit data where other formats help in the digest phase Json still plays a significant role in ingesting and spit, in the forms of ingesting streaming data, generating reports, etc.

AVRO

Avro is a row-oriented storage format for Hadoop. Avro stores the schema in JSON format, making it easy to read. Although data is not human-readable since stored in binary. Avro's advantage is the schema, much richer than Parquet's. This is good to use when the data has to be read as a whole and processed compared to parquet which is better when data is read groups based on column operation.

领英推荐

Dataprep - An Auto_EDA library

360DigiTMG 1 年前

Bamboolib - an Auto EDA library

360DigiTMG 1 年前

Data Lifecycle Management with Pandas: A Short Course…

ITVersity, Inc. 1 个月前

Python Pickle

Pickle is a way of serializing python objects. We can use the pickle to serialize a machine learning algorithm and save the serialized format to a file. Pickle basically converts an object to a binary stream that can be written to a file. The binary file can be de-serialized and can load back to the python object.

Text files

Last but not the least, in fact heavily used to ingest unstructured data. If not ingestion streams we end up reading data from files and most of the unstructured data still lives in text files which need to be read, transformed, and made ready for the algorithms. I think I don't need to go any further on our fav text files.

Arati Satpute

3 年

One quick question: since Word and PDF file formats are so extensively used in the industry for textual data, shouldn't we consider these?file formats as well?

查看更多评论

要查看或添加评论，请登录

Paras Nigam的更多文章

Amazon Kinesis vs Amazon SQS

2022年3月12日

Amazon Kinesis vs Amazon SQS

When we start work on a use case that needs queuing and processing of streaming data, Kinesis and SQS are the first to…
Apache Arrow: The future of in-memory columnar data, like dataframes

2022年2月25日

Apache Arrow: The future of in-memory columnar data, like dataframes

I mentioned Parquet briefly in my last article, if you have not read it, do take a look here. Parquet provides…
Governments need to shake hands with Behavioral Security

2019年6月8日

Governments need to shake hands with Behavioral Security

Historically governments have been profiling the criminals, terrorists, enemies and other most wanted people. This is…
Python __name__ value initialization and usage

2018年4月26日

Python __name__ value initialization and usage

Python is an interpreted language, whenever python interpreter reads the code file it executes all the code found in…
Python *args and **kwargs

2018年4月26日

Python *args and **kwargs

*args and **kwargs are two python magic variables which are commonly used to pass arguments into the functions. The…
The Wickedly Confusing Highway of Learning

2016年5月20日

The Wickedly Confusing Highway of Learning

An exercise well executed for past six months and its results. Action: It was few months back when I thought of…
The inherently dumb data requires Algorithm Economy

2016年4月13日

The inherently dumb data requires Algorithm Economy

Over the past few years every conversation I had with a CxO, product visionaries or a developer comes back to the…
Top 10 Product RoadMap Tips

2016年4月13日

Top 10 Product RoadMap Tips

Collaborative Roadmap: A product roadmap provides context around day-to-day activities of the team members. Even though…

See all articles

Data Science: Common file format and their usage

Paras Nigam

VP, Engineering | AI & Cybersecurity | Generative AI Expert | GCC & GVC | IIM Calcutta | MIT USA | 3AI Thought Leader | Entrepreneur |

CSV

Parquet

JSON

AVRO

领英推荐

Python Pickle

Text files

Paras Nigam的更多文章

社区洞察

其他会员也浏览了

Harnessing the Power of PySpark in DataBricks Delta Live Tables

Mastering Data Analytics Tools with Digicrome Academy

Technologies in Data Science

Building a Solid Foundation in Data

My Top 3 Data Science Projects for Beginners: This Guide will get you started!

Understanding Data Science: Part 2 - Key Tools and Technologies

A Taxonomy of the AI Database Ecosystem

Pandas for Data Science

Denver Real Estate Big Data: Coding

Mastering the PySpark Developer Interview: Key Questions, Answers, and LinkedIn's Role

CSV

Parquet

JSON

AVRO

领英推荐

Python Pickle

Text files

Paras Nigam的更多文章

Amazon Kinesis vs Amazon SQS

Apache Arrow: The future of in-memory columnar data, like dataframes

Governments need to shake hands with Behavioral Security

Python __name__ value initialization and usage

Python *args and **kwargs

The Wickedly Confusing Highway of Learning

The inherently dumb data requires Algorithm Economy

Top 10 Product RoadMap Tips

社区洞察

其他会员也浏览了

Harnessing the Power of PySpark in DataBricks Delta Live Tables

Mastering Data Analytics Tools with Digicrome Academy

Technologies in Data Science

Building a Solid Foundation in Data

My Top 3 Data Science Projects for Beginners: This Guide will get you started!

Understanding Data Science: Part 2 - Key Tools and Technologies

A Taxonomy of the AI Database Ecosystem

Pandas for Data Science

Denver Real Estate Big Data: Coding

Mastering the PySpark Developer Interview: Key Questions, Answers, and LinkedIn's Role

Python name value initialization and usage