Is Your Data Schema AI Ready

Is Your Data Schema AI Ready

Happy New Year to all, 2018 is going to deliver some fantastic advancements in AI, but is your data ready for this new paradigm. How do you build a data model thats scales, given today vast amount of structured and unstructured data, without any of the performance issues.

How can we design a data model that scales. For years I have worked with CSV, databases and various our formats. As our daily work as scientists, we deal with a lot of tabular data, also know as Dataframes, whether it is R or python etc. As Analysts sql queries are very important for both the BAU work, and day to day statistic and analysis. 

Our main challenges is when we integrate data between various systems can we integrate data between various systems eg simply passing data to a pandas dataframe.

Whats the simplest thing that could possibly work ?

The useful thing is to take a list for your requirements and to ask “ What's the simplest thing that could possibly work ?

 Here's a list I been observing over the years.

  1. Distributed and accessibly simultaneously.
  2. Compact wire-format as these datasets will likely be transferred quite often.
  3. Support for at minimum for data conversion to support Spark , Python and R various other tooling requirements.
  4. Efficient serialisation and deserialisation of datasets across supported systems.
  5. Capacity to make common SQL like joins on existing datasets with requiring heroic data manipulation.
  6. The ability to power a sql interface.
  7. Power Machine Learning algorithms quiet easily with vast amounts of data, share data between Pandas Data frame and R.

Simplicity 

When it comes to data storage the simplest and most fundamental building blocks are files. To scale out your goal is to use a simple, file-base design for your system. Just bear with me and I will tell you why shortly.

And because we are going to access these files using multiple systems, all we required is a system that can map a file to its contents. - especially an object store.

For file themselves the right format will be key, The CSV format is a decent start, but we know that it is not able to encode schema information on the file itself. While we could compress CSV before sending, that's true of any file thus not a real solution. While I am discussing CSV, CSV is about the worst format for serialisation and deserialisation. 

Ideally our file formats should be self describing, giving us the freedom to use a “schema-on-read” approach where we simply dump the files somewhere (without first specifying there schema, as we would in a DBMS systems) and decode the schema only when accessing them.

That would allows us to power an SQL interface as many systems support creating SQL interfaces over file formats of this type. Perhaps the most restrictive requirement is the ability to make joining data sets and adding new columns to existing data sets easy. Since most file formats store data row-by-row, this seems like a non-starter. After all how would we add new columns to an existing data set, short of essentially reading the data, jamming new column values in row-by-row and then writing it out too a new file, there no obvious simple solution.

Se we know the kind of system we want, but still stuck on the file format. Luckily there through the use of Apache projects we can build our system rather easily.

The perfect storm that can scale from few megs to terabytes - Sounds exciting.

Having built schemas to support few gig to petabytes, Apache Parquet is whats required. Wait for it - it is a file format. But just not any file format, its columnar format. 

In a columnar format, rather than essentially as a list of independent rows, each file contains the values in one or more columns of data. Parquet in particular also includes the schema of the data alongside the data itself at the end of the file. Columnar formats and systems are now new.

Let's look at the benefits of parquet.

Data for a single column is stored contagiously and all values share the same data-type, allowing you to compress the data using simple and well know compressions tricks. Parquet support using actual compression algorithms on the data and different algorithms for direct columns.

When doing analysis on large datasets it, its likely that some subsets of columns are needed at a given time especially after the application of principal component analysis. Arranging data by columns means that column unused in a given query never need to be read from disk- a huge performance boost. 

A full data set rarely fits in memory. There disk access patterns have become an extremely import differentiator of storage systems. Serialisation and deserialisation of data written in a columnar format is usually much faster due to the fact that a given column data is stored contiguously. Note parquet is based on the Google paper describing Dremel. 

So parquet especially on S3, actually satisfies most of the requirements.

  1. Its columnar format makes adding new columns to existing data easily.
  2. Files are compressed by the encoding scheme resulting in small parquet files compared to the same data in CSV
  3. All major systems provide support Parquet as a file format.
  4. Spark Natively support Parquet 
  5. S3 handles all the distributed system requirements.
  6. And you can execute sql queries on vast amount of data.

So you may ask, what about the DataScientist how can we use the system,

Looking that the following library Apache Arrow, Arrow is a columnar in-memory data format series of libraries. Apart from python it provides libraries for a growing number of programming languages. This allows you to read and write to parquet formats quiet easily. Arrow is first class citizen in the Arrow project. 

So how do we convert data to parquet 

import numpy as np

import pandas as pd

In [4]: import pyarrow as pa

df = pd.DataFrame({'one': [-1, np.nan, 2.5],

          'two': ['foo', 'bar', 'baz'],

          'three': [True, False, True]})

table = pa.Table.from_pandas(df)

Write Table 

import pyarrow.parquet as pq

pq.write_table(table, 'example.parquet')

Read Table 

table2 = pq.read_table('example.parquet')

table2.to_pandas()

Out[10]: 

  one three two

0 -1.0 True foo

1 NaN False bar

2 2.5 True baz

And you can go fine grain read and writing by only reading specific columns.

Conclusion 

Parquet is a very powerful file format and system to read and write data. Uber today uses presto and parquet and can access over five petabytes of data completes more than 90 percent of queries in 60 seconds (Via Sql queries). As described it all comes down to the way data is stored and accessed. 

Leo Mao

Architecture & Platform Practice Lead - Data Intelligence Platform

7 年

Nice work Gavin! Totally agree to your point as if there is a simpler way to achieve the same result always choose the simple one for 'readability, supportability and easier enhancement in the future'

Gavin. Well written article. One question: what's your recommendation of the underlining database infrastructure do you recommend? Do you recommend all in with Redshift? Also, what's your opinion on NOSQL vs SQL? What's the best application area of NOSQL in machine learning?

回复

要查看或添加评论,请登录

Gavin Whyte的更多文章

社区洞察

其他会员也浏览了