Next Gen Data Analytics - Open Data Architecture

Next Gen Data Analytics - Open Data Architecture

Rise of Data Warehouse

Here is a classical example of how data warehouses can be built.

Es wurde kein Alt-Text für dieses Bild angegeben.

We all have seen such solutions to collect data, process and to consume data. The model works fine as long you have structured data and a reasonable size in 100x of GB. If data size grows to terabytes and petabytes, you need to invest more time to understand how to partition the data and precalculate cubes and BI Extracts.  


THINGS HAVE CHANGED!

  1. The data. Data is the new oil and has become more various. The amount of relational data is not the majority of total anymore. Instead the non and semi structured data as json, parquet, voice, images and video are dominating the area. To integrate that data, new modern storage of HDFS or object store has arised to solve data integration challenges. Now it is very easy to store all your data in a lake.
  2. The rise of self-service discovery and analytics tools. Tableau, Power BI and Jupyter notebook gives analysts and data scientists freedom to explore the data on their own. 

Unfortunately, the access to data lake is not easy. The amount of raw data is massive and retrieval from the data lake is usually less performant. In order to keep control on compute resource usage, only a special group of people can access the data lake directly. The next problem is to keep data secure and closely governed. Because of the nature of data lakes, there is no easy way to control access to data or align to an enterprise wide master data management.

Next question is, how the data in the lake can be consumed with acceptable performance whilst also being governed?

Es wurde kein Alt-Text für dieses Bild angegeben.

First and foremost, natural tendencies are to reuse methods that have been in place for years - let’s build or extend a data warehouse. This will give us the required and aligned structure and finally to get a fast layer on which we can build data marts or cubes.

Es wurde kein Alt-Text für dieses Bild angegeben.

However, the old pattern cannot satisfy current required capabilities. Having easy ingestion, and the desire of analysts to access data in a self-service manner add stress to the data warehouse which is, by its nature, built as a monolith. The data warehouse creates potential conflict due to its inflexibility and the agile nature of the self-service layer. As a result, data engineers are not able to satisfy and accomplish very volatile requirements from analysts.

What we need is:

●     Fast data access without complex ETL processes or cubes

●     To provide an easy way to get access to the data lake without duplicating the data.

●     Governed, secured and audited data access

●     An easily searchable semantic layer.

Dremio’s Data Lake Engine delivers lightning fast query speed and a self-service semantic layer operating directly against your data lake storage.

Dremio provides connections to S3, ADLS, Hadoop, or wherever your data is. Apache Arrow, Data Reflections, C3 and other Dremio technologies work together to speed up queries by up to 1,000x. An abstraction layer enables IT to apply security and business meaning, while enabling analysts and data scientists to explore data and derive new virtual datasets. Dremio works directly with your data lake storage. You don't have to send your data to Dremio, or have it stored in proprietary formats that lock you in. Dremio is built on open source technologies such as Apache Arrow, and can run in any cloud or data centre. Alongside this powerful joining abilities mean that your data lake can benefit from other sources too.

Let’s build Open Data Lake with Query Engine

Es wurde kein Alt-Text für dieses Bild angegeben.

Real life implementation for self hosted data lake

Components: Zenko Scality (S3 interface), Dremio CE, Docker, Python3

On your node with pre installed docker run Scality

docker run -d --name cloudserver -p 8000:8000 -e SCALITY_ACCESS_KEY_ID=AccessKey -e SCALITY_SECRET_ACCESS_KEY=SecretKey -e REMOTE_MANAGEMENT_DISABLE=1 zenko/cloudserver

Now let’s import some data in scality via s3 interface and python. We will upload airports (csv) and arrival delays (parquet) data.

import boto3
import os

path = '/Users/viktor/Dremio/data/'
bucket = 'flights'

client = boto3.client(
's3',
aws_access_key_id='AccessKey',
aws_secret_access_key='SecretKey',
endpoint_url='https://127.0.0.1:8000'
)


def loadData():

    client.create_bucket(Bucket=bucket)

    with open(path+ 'airport.csv', "rb") as fb:
        client.upload_fileobj(fb, bucket, 'airport.csv')

    files = os.listdir(path + 'AirportDelays/')
    for f in files:
        with open(path+'AirportDelays/'+ f, "rb") as fb:
            client.upload_fileobj(fb, bucket,'AirportDelays/'+f)   

if __name__ == "__main__":

 
    loadData()

Next step we will run Dremio

docker run --quiet -p 9047:9047 -p 31010:31010 -p 45678:45678 dremio/dremio-oss

Dremio is up and running. We need to allow both containers to communicate with each other because both are deployed on the same node, for now. Therefore, we will create a network and add containers to this network.

docker network create scality-dremio

docker network connect scality-dremio [Dremio’s container ID]

docker network connect scality-dremio [Scality’s container ID]

docker network inspect scality-dremio

In the output under “Containers”.”Name”: “cloudserver” copy the IPv4Address.

"Containers": {

          "9a90436bf9...": {

                "Name": "cloudserver",

                "EndpointID": "46f50cab08726d4989...",

                "MacAddress": "02:42:ac:13:00:03",

                "IPv4Address": "172.19.0.3/16",

                "IPv6Address": ""

            },

Now we are ready to switch to Dremio UI. Open browser and go to localhost:9047. Initially Dremio will ask you to create a new user. Fill all required fields to create one.

Es wurde kein Alt-Text für dieses Bild angegeben.


After your registration Dremio UI will appear.

Es wurde kein Alt-Text für dieses Bild angegeben.

Let’s add our Scality Source to Dremio to be able to analyze data. Click “Add Source”. A new window will pop up, where we will choose Amazon S3.

Es wurde kein Alt-Text für dieses Bild angegeben.

Now type Name of the source, AccessKey and SecretKey, which you’ve defined by running Scality. Do not save for now, and on the left menu bar choose Advanced Options.

Es wurde kein Alt-Text für dieses Bild angegeben.

Check the “Enable compatibility mode (experimental)” and add new connection property: fs.s3a.endpoint : [your Scality IP address]:8000

Es wurde kein Alt-Text für dieses Bild angegeben.

Finally save the source settings. In Dremio UI you will see a new source has been added. Click the source and you will see the bucket that we have previously created.

Es wurde kein Alt-Text für dieses Bild angegeben.

Now select the bucket. Inside the bucket we see one csv file and one folder AirportDelays, that contains parquet files.

Es wurde kein Alt-Text für dieses Bild angegeben.

 We want to make the dataset more understandable to our business users. To import the metadata click click on the symbol, below Actions.

Es wurde kein Alt-Text für dieses Bild angegeben.

A new window will appear, which will recognise the parquet format. Click the Save button. Dremio will format this file as a dataset, which is only a metadata representation of data. Dremio is not copying the data. 

Es wurde kein Alt-Text für dieses Bild angegeben.

Dremio will open a preview window of the AirportDelays data.

Es wurde kein Alt-Text für dieses Bild angegeben.

Now repeat the same action with airport.csv. Go to Dataset, choose scality-s3, bucket flights and format the csv file as a dataset.

Es wurde kein Alt-Text für dieses Bild angegeben.

As a result, we will have 2 physical datasets (PDS) in our bucket

Es wurde kein Alt-Text für dieses Bild angegeben.

Now we can start analysing this data. First click on AirportDelays. In the SQL Editor we can write different statements. Let’s count the number of rows. We have approximately 48 million rows.

Es wurde kein Alt-Text für dieses Bild angegeben.

Then run a “select * from AirportDelays” statement. We only have the Airport codes and we are missing the information on the real city names. The city name information resides in the airport.csv file. So we can easily join the information from csv to our AirportDelays parquet based data.

Es wurde kein Alt-Text für dieses Bild angegeben.

For this Join action we can either write a direct sql statement or use Dremio’s UI wizard. Click Join, just above the OP_CARRIER column header. In a new window search for airport.csv.

Es wurde kein Alt-Text für dieses Bild angegeben.

Now define the type of join (Inner Join) and drag and drop the join condition ORIGIN and field E (contains the airport code).

Es wurde kein Alt-Text für dieses Bild angegeben.

The result is joined from two different sources. To make a field of the joined dataset more readable we can edit in UI column C to City.

Es wurde kein Alt-Text für dieses Bild angegeben.

Next we will analyze the information about delays by date and city. Click on the button Group By which is to the left of the Join button that we clicked previously.

Es wurde kein Alt-Text für dieses Bild angegeben.

Drag and drop FL_DATE and City in Dimensions area, and add some Metrics DEP_DELAY as count and sum.

Es wurde kein Alt-Text für dieses Bild angegeben.

Now we are happy with our dataset and we will save the result as a virtual data set (VDS). Still, no data has actually been copied from our data lake.

Es wurde kein Alt-Text für dieses Bild angegeben.

Go to Datasets at the top of the page and in your space you will see the new “delays” dataset.

Es wurde kein Alt-Text für dieses Bild angegeben.

Select the dataset. Now we want to analyze this directly in Tableau.

Es wurde kein Alt-Text für dieses Bild angegeben.

A small file (.tds) with connection information has been downloaded. Open that file. Tableau will launch and prompt for your login credentials. These are the credentials that we set up earlier.

Es wurde kein Alt-Text für dieses Bild angegeben.

After successful login we can see our dataset. Tableau can now run directly on the data lake storage.

Es wurde kein Alt-Text für dieses Bild angegeben.


Es wurde kein Alt-Text für dieses Bild angegeben.

what we can see, is that the read of ca 50 mil rows on S3 like storage took us about 12 sec. 

Es wurde kein Alt-Text für dieses Bild angegeben.

If response time is still not sufficient and requires to be faster, we can enable acceleration and build a Dremio Reflection.

Es wurde kein Alt-Text für dieses Bild angegeben.

If you rerun your Tableau query you will see that it takes now less than a second.

Es wurde kein Alt-Text für dieses Bild angegeben.

Conclusion

We have built a data lake on S3 compatible object store, Scality, and we gave the ability for Data Engineers to easily wrangle and curate data. For Tableau users, it is now possible to run queries directly on the data lake.

 

To learn more about visit university.dremio.com



要查看或添加评论,请登录

社区洞察

其他会员也浏览了