Next Gen Data Analytics - Open Data Architecture
Rise of Data Warehouse
Here is a classical example of how data warehouses can be built.
We all have seen such solutions to collect data, process and to consume data. The model works fine as long you have structured data and a reasonable size in 100x of GB. If data size grows to terabytes and petabytes, you need to invest more time to understand how to partition the data and precalculate cubes and BI Extracts.
THINGS HAVE CHANGED!
- The data. Data is the new oil and has become more various. The amount of relational data is not the majority of total anymore. Instead the non and semi structured data as json, parquet, voice, images and video are dominating the area. To integrate that data, new modern storage of HDFS or object store has arised to solve data integration challenges. Now it is very easy to store all your data in a lake.
- The rise of self-service discovery and analytics tools. Tableau, Power BI and Jupyter notebook gives analysts and data scientists freedom to explore the data on their own.
Unfortunately, the access to data lake is not easy. The amount of raw data is massive and retrieval from the data lake is usually less performant. In order to keep control on compute resource usage, only a special group of people can access the data lake directly. The next problem is to keep data secure and closely governed. Because of the nature of data lakes, there is no easy way to control access to data or align to an enterprise wide master data management.
Next question is, how the data in the lake can be consumed with acceptable performance whilst also being governed?
First and foremost, natural tendencies are to reuse methods that have been in place for years - let’s build or extend a data warehouse. This will give us the required and aligned structure and finally to get a fast layer on which we can build data marts or cubes.
However, the old pattern cannot satisfy current required capabilities. Having easy ingestion, and the desire of analysts to access data in a self-service manner add stress to the data warehouse which is, by its nature, built as a monolith. The data warehouse creates potential conflict due to its inflexibility and the agile nature of the self-service layer. As a result, data engineers are not able to satisfy and accomplish very volatile requirements from analysts.
What we need is:
● Fast data access without complex ETL processes or cubes
● To provide an easy way to get access to the data lake without duplicating the data.
● Governed, secured and audited data access
● An easily searchable semantic layer.
Dremio’s Data Lake Engine delivers lightning fast query speed and a self-service semantic layer operating directly against your data lake storage.
Dremio provides connections to S3, ADLS, Hadoop, or wherever your data is. Apache Arrow, Data Reflections, C3 and other Dremio technologies work together to speed up queries by up to 1,000x. An abstraction layer enables IT to apply security and business meaning, while enabling analysts and data scientists to explore data and derive new virtual datasets. Dremio works directly with your data lake storage. You don't have to send your data to Dremio, or have it stored in proprietary formats that lock you in. Dremio is built on open source technologies such as Apache Arrow, and can run in any cloud or data centre. Alongside this powerful joining abilities mean that your data lake can benefit from other sources too.
Let’s build Open Data Lake with Query Engine
Real life implementation for self hosted data lake
Components: Zenko Scality (S3 interface), Dremio CE, Docker, Python3
On your node with pre installed docker run Scality
docker run -d --name cloudserver -p 8000:8000 -e SCALITY_ACCESS_KEY_ID=AccessKey -e SCALITY_SECRET_ACCESS_KEY=SecretKey -e REMOTE_MANAGEMENT_DISABLE=1 zenko/cloudserver
Now let’s import some data in scality via s3 interface and python. We will upload airports (csv) and arrival delays (parquet) data.
import boto3 import os path = '/Users/viktor/Dremio/data/' bucket = 'flights' client = boto3.client( 's3', aws_access_key_id='AccessKey', aws_secret_access_key='SecretKey', endpoint_url='https://127.0.0.1:8000' ) def loadData(): client.create_bucket(Bucket=bucket) with open(path+ 'airport.csv', "rb") as fb: client.upload_fileobj(fb, bucket, 'airport.csv') files = os.listdir(path + 'AirportDelays/') for f in files: with open(path+'AirportDelays/'+ f, "rb") as fb: client.upload_fileobj(fb, bucket,'AirportDelays/'+f) if __name__ == "__main__":
loadData()
Next step we will run Dremio
docker run --quiet -p 9047:9047 -p 31010:31010 -p 45678:45678 dremio/dremio-oss
Dremio is up and running. We need to allow both containers to communicate with each other because both are deployed on the same node, for now. Therefore, we will create a network and add containers to this network.
docker network create scality-dremio docker network connect scality-dremio [Dremio’s container ID] docker network connect scality-dremio [Scality’s container ID]
docker network inspect scality-dremio
In the output under “Containers”.”Name”: “cloudserver” copy the IPv4Address.
"Containers": { "9a90436bf9...": { "Name": "cloudserver", "EndpointID": "46f50cab08726d4989...", "MacAddress": "02:42:ac:13:00:03", "IPv4Address": "172.19.0.3/16", "IPv6Address": ""
},
Now we are ready to switch to Dremio UI. Open browser and go to localhost:9047. Initially Dremio will ask you to create a new user. Fill all required fields to create one.
After your registration Dremio UI will appear.
Let’s add our Scality Source to Dremio to be able to analyze data. Click “Add Source”. A new window will pop up, where we will choose Amazon S3.
Now type Name of the source, AccessKey and SecretKey, which you’ve defined by running Scality. Do not save for now, and on the left menu bar choose Advanced Options.
Check the “Enable compatibility mode (experimental)” and add new connection property: fs.s3a.endpoint : [your Scality IP address]:8000
Finally save the source settings. In Dremio UI you will see a new source has been added. Click the source and you will see the bucket that we have previously created.
Now select the bucket. Inside the bucket we see one csv file and one folder AirportDelays, that contains parquet files.
We want to make the dataset more understandable to our business users. To import the metadata click click on the symbol, below Actions.
A new window will appear, which will recognise the parquet format. Click the Save button. Dremio will format this file as a dataset, which is only a metadata representation of data. Dremio is not copying the data.
Dremio will open a preview window of the AirportDelays data.
Now repeat the same action with airport.csv. Go to Dataset, choose scality-s3, bucket flights and format the csv file as a dataset.
As a result, we will have 2 physical datasets (PDS) in our bucket
Now we can start analysing this data. First click on AirportDelays. In the SQL Editor we can write different statements. Let’s count the number of rows. We have approximately 48 million rows.
Then run a “select * from AirportDelays” statement. We only have the Airport codes and we are missing the information on the real city names. The city name information resides in the airport.csv file. So we can easily join the information from csv to our AirportDelays parquet based data.
For this Join action we can either write a direct sql statement or use Dremio’s UI wizard. Click Join, just above the OP_CARRIER column header. In a new window search for airport.csv.
Now define the type of join (Inner Join) and drag and drop the join condition ORIGIN and field E (contains the airport code).
The result is joined from two different sources. To make a field of the joined dataset more readable we can edit in UI column C to City.
Next we will analyze the information about delays by date and city. Click on the button Group By which is to the left of the Join button that we clicked previously.
Drag and drop FL_DATE and City in Dimensions area, and add some Metrics DEP_DELAY as count and sum.
Now we are happy with our dataset and we will save the result as a virtual data set (VDS). Still, no data has actually been copied from our data lake.
Go to Datasets at the top of the page and in your space you will see the new “delays” dataset.
Select the dataset. Now we want to analyze this directly in Tableau.
A small file (.tds) with connection information has been downloaded. Open that file. Tableau will launch and prompt for your login credentials. These are the credentials that we set up earlier.
After successful login we can see our dataset. Tableau can now run directly on the data lake storage.
what we can see, is that the read of ca 50 mil rows on S3 like storage took us about 12 sec.
If response time is still not sufficient and requires to be faster, we can enable acceleration and build a Dremio Reflection.
If you rerun your Tableau query you will see that it takes now less than a second.
Conclusion
We have built a data lake on S3 compatible object store, Scality, and we gave the ability for Data Engineers to easily wrangle and curate data. For Tableau users, it is now possible to run queries directly on the data lake.
To learn more about visit university.dremio.com