Data Science: Common file format and their usage
Paras Nigam
VP, Engineering | AI & Cybersecurity | Generative AI Expert | GCC & GVC | IIM Calcutta | MIT USA | 3AI Thought Leader | Entrepreneur |
Few common data file formats used in the textual data science projects
CSV
CSV stands for 'Comma Separated Values', it stores data in a tabular format very similar to excel and is most commonly used for processing data using pandas. This is used when the data need to be stored in a tabular format and contents like text, numbers, dates are to be processed. It is a row-oriented format.
Parquet
Parquet is a column-oriented data format. It provides efficient column-based complex data processing. Parquet support efficient compression and encoding schemes. Hence lower storage costs for data files. It also increases the effectiveness of querying using serverless technologies. Compared to CSV we can get a significant reduction in storage and processing costs.
JSON
JSON stands for JavaScript Object Notation. You might be thinking about where JSON is used in data science, but the fact is JSON is everywhere. anytime there is a need for data interchange we jump off to JSON and in data science, we need to ingest, digest, and spit data where other formats help in the digest phase Json still plays a significant role in ingesting and spit, in the forms of ingesting streaming data, generating reports, etc.
AVRO
Avro is a row-oriented storage format for Hadoop. Avro stores the schema in JSON format, making it easy to read. Although data is not human-readable since stored in binary. Avro's advantage is the schema, much richer than Parquet's. This is good to use when the data has to be read as a whole and processed compared to parquet which is better when data is read groups based on column operation.
领英推荐
Python Pickle
Pickle is a way of serializing python objects. We can use the pickle to serialize a machine learning algorithm and save the serialized format to a file. Pickle basically converts an object to a binary stream that can be written to a file. The binary file can be de-serialized and can load back to the python object.
Text files
Last but not the least, in fact heavily used to ingest unstructured data. If not ingestion streams we end up reading data from files and most of the unstructured data still lives in text files which need to be read, transformed, and made ready for the algorithms. I think I don't need to go any further on our fav text files.
Data Scientist | Machine Learning | Generative AI | Data Analysis | Data Visualization | Data Science | Python | PySpark | Cloud | Microsoft Azure | Generative AI | Ex-Cognizant | Ex-Wipro
3 年One quick question: since Word and PDF file formats are so extensively used in the industry for textual data, shouldn't we consider these?file formats as well?