Proposal for a Management Architecture for Large Volumes of Data
Article written by?Pedro Bonillo, ML and Big Data consultant at?INNOVANT
Introduction
Before describing the components that make up the Data Architecture, it is necessary to establish a conceptual framework about them. The four high-level components that make up this Architecture are: (1) Database; (2) Big Data Analytics or BI; (3) Data Science; and, (4) Governance.
Databases
In the first aspect, the issue of databases is addressed. The model presented by Edgar Frank Codd, which has allowed us to store the data as a matrix of rows and columns; where for each new attribute a new column is created, despite the fact that many rows do not have values for that new attribute, it has only allowed organizations to manage 30% of their data, according to research by the Garner Group
This is why a new concept has emerged to store and manage data called key-value. In this way, different rows can have different attributes. There are mainly 3 types of databases that are key values in Big Data: (1) Transactional with high availability; (2) Document management; and, (3) Georeferences.
Big Data Analytics
The term Business Intelligence has evolved in the context of Big Data as Big Data Analytics. Where you do not necessarily work with the database concept, but rather with a distributed file system. This file system continues to be populated through data extraction, transformation, cleaning and loading tools. But now these tools write in the new key-value format and with the data compressed in the distributed file system in a zone called raw data.
The forerunner of these distributed file systems was?Google File System (GFS), which evolved into what is now known as Hadoop Distributed File System (HDFS), or simply?Apache Hadoop.
This new type of file system uses the concept of distributed computing, in order to be able to process large amounts of data (more than a million records, with high variability and with queries in less than 5 seconds), between several servers. An optimized version of the algorithm, to handle distributed computing clusters, through?apache mesos, is?Spark. Thus, currently, most organizations use spark to schedule programs and divide tasks of summarization, grouping, selection, on the data that is stored in the Distributed File System. In the case of AWS, the distributed file system is?S3, in Google Cloud Platform CS (cloud storage), in Microsoft Azure Storage.
Spark then allows you to manage map reduce tasks on the Distributed File System. However, to be able to use spark it is necessary to write a program in Java, Python, Scala or R language. In AWS the version of Spark is?EMR?(Elastic Map Reduce), in Google?Data Proc?and in Microsoft Azure?HDInsight with Databricks. This program commonly loads the data that is in the HDFS (distributed file system), into a distributed data structure between data nodes such as RDD or a data frame (array of distributed data rows) and then performs queries on the standard SQL query language, to finally write the result in the same HDFS (in an area called curated data) or in another Relational Database or a file separated by commas, for example to be consumed by some report visualization tool.
Taking into account that to perform a query on the data in the HDFS, a program would have to be made; It was necessary to establish another mechanism in which the administrators and functional databases will continue making SQL queries, this mechanism is?HIVE. Hive is a translator, interpreter of?SQL?statements in Map Reduce algorithms that are scheduled with Yarn or Spark on the data in the HDFS. In such a way that direct queries can be made on the data in hadoop. Hive has a metastore, which can reside in an in-memory relational database or in an object-relational database such as?Microsoft SQL Server?or?PostgreSQL, this metastore stores the link between the metadata with which the tables are created and the data in hadoop, which through partitions can populate these tables, but the data actually resides in hadoop. The AWS equivalent of HIVE is?Athena?at?Google Bigquery?and?Azure Synapse?at Microsoft.
In the event that it is necessary to automate a series of queries (in the style of a store procedure),?pig?can also be used, which is a tool that uses a procedural language for automating queries in hadoop.
It is important to mention that the data in hadoop is not indexed, so the?Lucene API?is also often used through?ElasticSearch?to place the indexed data in information cubes that can later be consulted through web services, by applications of reports or by other applications within the organization.
Finally, Big Data Analytics requires Data Warehouse tools (Snowflake,?Redshift) and presentation and analysis of multidimensional reports, such as?Tableau,?QlickView, Microsoft?PowerBI,?Looker?and?hue?(hadoop user experience).
Data Science
The identification of patterns through statistical models, in large volumes of data, either to establish the probability of an event occurring (regression model) or to establish a characterization (classification model) it’s called Data Science. To perform Data Science, the
object-oriented tool?R Studio?is commonly used, which is an evolution of the S language, used in the 90s in most computer-assisted statistical studies.
R Studio, allows performing structural analysis and transformations on a set of data; Quantitative analysis using descriptive statistics, identify variables to predict and the type of model (regression or classification), separate the data set into train (data for model training) and test (data for testing the model); perform analysis using decision trees and regression to indicate the importance of attributes; perform attribute engineering (for instance, extract information from a text, categorize numerical attributes, add attributes, reduce the dimensionality of attributes; and apply supervised analysis models (boosting, random forest, glm, logistic regression, depending on the type of problem).
When it is required to apply a model to a large amount of data, it is necessary to store it in the distributed file system and make a program to execute it with spark, which has all the functionality of?R Studio?through the?MLib?library.
To code in R from a web server it is necessary to implement an R studio Server.
Jupyterhub?is a server with a multi-user proxy that interconnects several instances of Jupyter Notebook, which is a client-server application that allows you to create and share web documents in json format and execute code in the different programming languages that are installed on the server.?Jupyter Notebook?has proven to be a very valuable tool for programmers, who only need to connect via the web to Jupyter Hub to access their notebooks (workbooks) and execute the corresponding code.
Governance
In Big Data, Governance is achieved through the modeling and implementation of eight processes: (1) Data operations management; (2) Master data and reference management; (3) Documentation and content management; (4) Data security management; (5) Data Development; (6) Data quality management; (7) Meta-Data Management; and, (8) Management of data architecture Data Warehousing & BI Management.
Data Warehouse and Data Lake
Data Warehouses store large volumes of data in defined and static structures that determine the type of analysis that can be done on the data. Therefore, its analytical capacity (reports, dashboards, etc.) is limited by the very structures of the sources.
领英推荐
A Data Lake is a repository of all the data that an organization has or can obtain. The data is “ingested raw” and can be saved “raw”, without any predefined schema or structure, or in a structured way. This provides an unlimited window for the use of the data, in such a way that any query, navigation, analysis, etc. can be carried out on it.
Proposed data architecture
The proposed Data Architecture has the Data Lake as an integrating component through the AWS S3 and?Lake Formation?tools.
In such a way that all the data of the related Databases (AWS RDS); NoSql Databases (AWS Dynamo,?Documentdb); Machine learning algorithms (AWS sagemaker); Data Warehouse fact and dimension tables (AWS redshift); transaction log files processed by microservices (AWS Kinesis Data Stream,?Kinesis Firehose, and?Kinesis Data Analytics); and the data from all aggregations, calculations, rules (AWS Glue?and EMR) interacts with the Data Lake. (see, Figure 1)
The data lake is a single site where all structured (rows and columns), semi-structured (xml, json) and unstructured (images) data are stored.
The data lake stores the data in key value format (parquet) and compressed (through the gz and snappy algorithms), in such a way as to take advantage of the use of distributed computing frameworks for the calculation of aggregations and the presentation of data in the visualization analytical tools.
The data lake provides the possibility of having three data zones:
1.?Raw data?(raw layer): the data as it comes from the origins
2.?Curated data?(curated layer or trusted data): contains the data with aggregations, transformations, data anonymization (pii), data hashing (pci)
3.?Accessible data?(analytical layer or refined data or presentation data): allows visualization tools and business units to access business objects: people, accounts, transactions, etc.; without having to worry about understanding tables and databases; thus uniting various data models based on rules to present reliable and democratized data to the business.
Through the Data Lake, three data transfer models can be implemented:
A. Inside-out data transfer:?A subset of the data in the data lake is sometimes moved to a data warehouse, such as an RDS to store the data that feeds a visualization tool as superset, to support specialized analytics, such as analytics search, knowledge graph creation, or both.
B. Outside-in data transfer: Most companies use AWS RDS as the data stores for their microservices, and then move that data into the data lake. For example, to maintain a customer account statement, customer data, session history, and transaction tables are stored in AWS RDS, DynamoDB, and DocumentDB. This data is then exported to the data lake through AWS Glue and AWS EMR for further analysis, which improves the user experience of customer accounts.
C. Around the perimeter:?In addition to the two patterns above, there are scenarios where data moves from one specialized data store to another. For example, data warehouse fact tables and dimensions in AWS Redshift can send data to the data lake or read data from the data lake.
The main characteristics of the data lake are:
The proposed data architecture is based on the reference model powered by AWS (see Figure 2). In this reference model, the following layers are presented from bottom to top:
4. Catalog management layer (Lake House Catalog):?The catalog of all the data that is in the data lake is managed through AWS Glue, with its schemas and the different versions of each data structure and with AWS Lake Formation data access permissions are managed, by schema, tables and even columns according to the roles defined in AWS IAM.
5. Processing layer:?this layer is divided into three sublayers:
6. Consumption layer:?In this layer, Amazon Athena is used as a tool to perform SQL queries against the Glue catalog tables that are connected to the data lake and other data sources, Athena allows to establish workgroups that have quotas for perform queries that are executed using distributed computing on S3, Amazon Redshift can also be used to query the Datawarehouse tables, Amazon?QuickSight?to perform analytical reports and Amazon?Sagemaker?to perform machine learning models.