登录查看更多内容

Is Your Data Schema AI Ready

Gavin Whyte

Co-Founder and Board Member BrewAI - Deep Learning Automation

发布日期: 2018年1月7日

Happy New Year to all, 2018 is going to deliver some fantastic advancements in AI, but is your data ready for this new paradigm. How do you build a data model thats scales, given today vast amount of structured and unstructured data, without any of the performance issues.

How can we design a data model that scales. For years I have worked with CSV, databases and various our formats. As our daily work as scientists, we deal with a lot of tabular data, also know as Dataframes, whether it is R or python etc. As Analysts sql queries are very important for both the BAU work, and day to day statistic and analysis.

Our main challenges is when we integrate data between various systems can we integrate data between various systems eg simply passing data to a pandas dataframe.

Whats the simplest thing that could possibly work ?

The useful thing is to take a list for your requirements and to ask “ What's the simplest thing that could possibly work ?

Here's a list I been observing over the years.

Distributed and accessibly simultaneously.
Compact wire-format as these datasets will likely be transferred quite often.
Support for at minimum for data conversion to support Spark , Python and R various other tooling requirements.
Efficient serialisation and deserialisation of datasets across supported systems.
Capacity to make common SQL like joins on existing datasets with requiring heroic data manipulation.
The ability to power a sql interface.
Power Machine Learning algorithms quiet easily with vast amounts of data, share data between Pandas Data frame and R.

Simplicity

When it comes to data storage the simplest and most fundamental building blocks are files. To scale out your goal is to use a simple, file-base design for your system. Just bear with me and I will tell you why shortly.

And because we are going to access these files using multiple systems, all we required is a system that can map a file to its contents. - especially an object store.

For file themselves the right format will be key, The CSV format is a decent start, but we know that it is not able to encode schema information on the file itself. While we could compress CSV before sending, that's true of any file thus not a real solution. While I am discussing CSV, CSV is about the worst format for serialisation and deserialisation.

Ideally our file formats should be self describing, giving us the freedom to use a “schema-on-read” approach where we simply dump the files somewhere (without first specifying there schema, as we would in a DBMS systems) and decode the schema only when accessing them.

That would allows us to power an SQL interface as many systems support creating SQL interfaces over file formats of this type. Perhaps the most restrictive requirement is the ability to make joining data sets and adding new columns to existing data sets easy. Since most file formats store data row-by-row, this seems like a non-starter. After all how would we add new columns to an existing data set, short of essentially reading the data, jamming new column values in row-by-row and then writing it out too a new file, there no obvious simple solution.

Se we know the kind of system we want, but still stuck on the file format. Luckily there through the use of Apache projects we can build our system rather easily.

The perfect storm that can scale from few megs to terabytes - Sounds exciting.

Having built schemas to support few gig to petabytes, Apache Parquet is whats required. Wait for it - it is a file format. But just not any file format, its columnar format.

In a columnar format, rather than essentially as a list of independent rows, each file contains the values in one or more columns of data. Parquet in particular also includes the schema of the data alongside the data itself at the end of the file. Columnar formats and systems are now new.

Let's look at the benefits of parquet.

Data for a single column is stored contagiously and all values share the same data-type, allowing you to compress the data using simple and well know compressions tricks. Parquet support using actual compression algorithms on the data and different algorithms for direct columns.

When doing analysis on large datasets it, its likely that some subsets of columns are needed at a given time especially after the application of principal component analysis. Arranging data by columns means that column unused in a given query never need to be read from disk- a huge performance boost.

A full data set rarely fits in memory. There disk access patterns have become an extremely import differentiator of storage systems. Serialisation and deserialisation of data written in a columnar format is usually much faster due to the fact that a given column data is stored contiguously. Note parquet is based on the Google paper describing Dremel.

So parquet especially on S3, actually satisfies most of the requirements.

Its columnar format makes adding new columns to existing data easily.
Files are compressed by the encoding scheme resulting in small parquet files compared to the same data in CSV
All major systems provide support Parquet as a file format.
Spark Natively support Parquet
S3 handles all the distributed system requirements.
And you can execute sql queries on vast amount of data.

So you may ask, what about the DataScientist how can we use the system,

Looking that the following library Apache Arrow, Arrow is a columnar in-memory data format series of libraries. Apart from python it provides libraries for a growing number of programming languages. This allows you to read and write to parquet formats quiet easily. Arrow is first class citizen in the Arrow project.

So how do we convert data to parquet

import numpy as np

import pandas as pd

In [4]: import pyarrow as pa

df = pd.DataFrame({'one': [-1, np.nan, 2.5],

'two': ['foo', 'bar', 'baz'],

'three': [True, False, True]})

table = pa.Table.from_pandas(df)

Write Table

import pyarrow.parquet as pq

pq.write_table(table, 'example.parquet')

Read Table

table2 = pq.read_table('example.parquet')

table2.to_pandas()

Out[10]:

one three two

0 -1.0 True foo

1 NaN False bar

2 2.5 True baz

And you can go fine grain read and writing by only reading specific columns.

Conclusion

Parquet is a very powerful file format and system to read and write data. Uber today uses presto and parquet and can access over five petabytes of data completes more than 90 percent of queries in 60 seconds (Via Sql queries). As described it all comes down to the way data is stored and accessed.

Leo Mao

Architecture & Platform Practice Lead - Data Intelligence Platform

7 年

Nice work Gavin! Totally agree to your point as if there is a simpler way to achieve the same result always choose the simple one for 'readability, supportability and easier enhancement in the future'

1 次回应

Liang K.

7 年

Gavin. Well written article. One question: what's your recommendation of the underlining database infrastructure do you recommend? Do you recommend all in with Redshift? Also, what's your opinion on NOSQL vs SQL? What's the best application area of NOSQL in machine learning?

查看更多评论

要查看或添加评论，请登录

Gavin Whyte的更多文章

10 Symptoms You need DataOps

2021年9月7日

10 Symptoms You need DataOps

Your data team is burning out flooded with multiple requests. Business users do not trust the data.
Deep Learning Evolution Explained

2021年5月23日

Deep Learning Evolution Explained

To understand the shift from machine learning too deep learning, in the last decade machine learning that dominated the…
Why XBRL GL will be a game changer, to be able codify data, and more important how the CFO and CDO can work together to produce a Golden Master record

2020年5月4日

Why XBRL GL will be a game changer, to be able codify data, and more important how the CFO and CDO can work together to produce a Golden Master record

When every someone thinks about extensible business reporting you are probably associating it with the preparation of…

2 条评论
CFO crises mode, modern times call for modern tools

2020年4月26日

CFO crises mode, modern times call for modern tools

The current economic crises, has forced many business to make difficult decisions are resource allocation and cash flow…
AI has a privacy problem, lets dwell into various techniques proposed to solve this.

2020年4月24日

AI has a privacy problem, lets dwell into various techniques proposed to solve this.

Artificial intelligence has already transformed many industries and many more to follow. But privacy remains an…
AI and Data Privacy

2020年4月4日

AI and Data Privacy

AI is becoming very attractive to companies both internally for business and for consumers. Data is produced in…
The Future of Work and AI

2019年10月20日

The Future of Work and AI

We experience consistently high profile individuals sounding alarm bells, on the impact of AI on jobs. Renowned…

4 条评论
Business Process Management of AI and RPA

2019年10月12日

Business Process Management of AI and RPA

Let's take a look at typical scenario inside an organisation The business climate is every changing. An enterprise…

1 条评论
AI and RPA as a Driver for Enterprise Transformation

2019年10月12日

AI and RPA as a Driver for Enterprise Transformation

Enterprises all round the world have always dwelled on transformation. To better their operational efficiencies…

1 条评论
Intelligent Robotic Process Automation to improve human productivity

2019年10月6日

Intelligent Robotic Process Automation to improve human productivity

Robotic Process Automation enables business professionals to easily configure software robots to automate repetitive…

5 条评论

See all articles

Is Your Data Schema AI Ready

Gavin Whyte

Co-Founder and Board Member BrewAI - Deep Learning Automation

Gavin Whyte的更多文章

社区洞察

其他会员也浏览了

?? The Rising Star of ML Ops: VectorDB - Why They're Outperforming SQL & NoSQL for Embedding Storage

From Raw Data to Actionable Insights

Adventures in Data Science: From Wrangling Rogue Data to Predicting the Future (and Everything in Between)

A Journey from BI to AI

Microsoft Fabric’s Semantic Link – Integration of Power BI into the ‘Circle of Life’

Decision Trees and Random Forests in Data Science

Introduction to Pandas Profiling

AI and Machine Learning for Qlik Developers - Case Study 3 - part 1

Want To Become a Data Scientist? Try Feynman Technique.

Understand the data science process

Gavin Whyte的更多文章

10 Symptoms You need DataOps

Deep Learning Evolution Explained

Why XBRL GL will be a game changer, to be able codify data, and more important how the CFO and CDO can work together to produce a Golden Master record

CFO crises mode, modern times call for modern tools

AI has a privacy problem, lets dwell into various techniques proposed to solve this.

AI and Data Privacy

The Future of Work and AI

Business Process Management of AI and RPA

AI and RPA as a Driver for Enterprise Transformation

Intelligent Robotic Process Automation to improve human productivity

社区洞察

其他会员也浏览了

?? The Rising Star of ML Ops: VectorDB - Why They're Outperforming SQL & NoSQL for Embedding Storage

From Raw Data to Actionable Insights

Adventures in Data Science: From Wrangling Rogue Data to Predicting the Future (and Everything in Between)

A Journey from BI to AI

Microsoft Fabric’s Semantic Link – Integration of Power BI into the ‘Circle of Life’

Decision Trees and Random Forests in Data Science

Introduction to Pandas Profiling

AI and Machine Learning for Qlik Developers - Case Study 3 - part 1

Want To Become a Data Scientist? Try Feynman Technique.

Understand the data science process