Data Science: Common file format and  their usage

Data Science: Common file format and their usage


Few common data file formats used in the textual data science projects

CSV

CSV stands for 'Comma Separated Values', it stores data in a tabular format very similar to excel and is most commonly used for processing data using pandas. This is used when the data need to be stored in a tabular format and contents like text, numbers, dates are to be processed. It is a row-oriented format.

Parquet

Parquet is a column-oriented data format. It provides efficient column-based complex data processing. Parquet support efficient compression and encoding schemes. Hence lower storage costs for data files. It also increases the effectiveness of querying using serverless technologies. Compared to CSV we can get a significant reduction in storage and processing costs.

JSON

JSON stands for JavaScript Object Notation. You might be thinking about where JSON is used in data science, but the fact is JSON is everywhere. anytime there is a need for data interchange we jump off to JSON and in data science, we need to ingest, digest, and spit data where other formats help in the digest phase Json still plays a significant role in ingesting and spit, in the forms of ingesting streaming data, generating reports, etc.

AVRO

Avro is a row-oriented storage format for Hadoop. Avro stores the schema in JSON format, making it easy to read. Although data is not human-readable since stored in binary. Avro's advantage is the schema, much richer than Parquet's. This is good to use when the data has to be read as a whole and processed compared to parquet which is better when data is read groups based on column operation.

Python Pickle

Pickle is a way of serializing python objects. We can use the pickle to serialize a machine learning algorithm and save the serialized format to a file. Pickle basically converts an object to a binary stream that can be written to a file. The binary file can be de-serialized and can load back to the python object.

Text files

Last but not the least, in fact heavily used to ingest unstructured data. If not ingestion streams we end up reading data from files and most of the unstructured data still lives in text files which need to be read, transformed, and made ready for the algorithms. I think I don't need to go any further on our fav text files.






Arati Satpute

Data Scientist | Machine Learning | Generative AI | Data Analysis | Data Visualization | Data Science | Python | PySpark | Cloud | Microsoft Azure | Generative AI | Ex-Cognizant | Ex-Wipro

3 年

One quick question: since Word and PDF file formats are so extensively used in the industry for textual data, shouldn't we consider these?file formats as well?

回复

要查看或添加评论,请登录

Paras Nigam的更多文章

  • Amazon Kinesis vs Amazon SQS

    Amazon Kinesis vs Amazon SQS

    When we start work on a use case that needs queuing and processing of streaming data, Kinesis and SQS are the first to…

  • Apache Arrow: The future of in-memory columnar data, like dataframes

    Apache Arrow: The future of in-memory columnar data, like dataframes

    I mentioned Parquet briefly in my last article, if you have not read it, do take a look here. Parquet provides…

  • Governments need to shake hands with Behavioral Security

    Governments need to shake hands with Behavioral Security

    Historically governments have been profiling the criminals, terrorists, enemies and other most wanted people. This is…

  • Python __name__ value initialization and usage

    Python __name__ value initialization and usage

    Python is an interpreted language, whenever python interpreter reads the code file it executes all the code found in…

  • Python *args and **kwargs

    Python *args and **kwargs

    *args and **kwargs are two python magic variables which are commonly used to pass arguments into the functions. The…

  • The Wickedly Confusing Highway of Learning

    The Wickedly Confusing Highway of Learning

    An exercise well executed for past six months and its results. Action: It was few months back when I thought of…

  • The inherently dumb data requires Algorithm Economy

    The inherently dumb data requires Algorithm Economy

    Over the past few years every conversation I had with a CxO, product visionaries or a developer comes back to the…

  • Top 10 Product RoadMap Tips

    Top 10 Product RoadMap Tips

    Collaborative Roadmap: A product roadmap provides context around day-to-day activities of the team members. Even though…

社区洞察

其他会员也浏览了