Proposal for a Management Architecture for Large Volumes of Data

Proposal for a Management Architecture for Large Volumes of Data

Article written by?Pedro Bonillo, ML and Big Data consultant at?INNOVANT

Introduction

Before describing the components that make up the Data Architecture, it is necessary to establish a conceptual framework about them. The four high-level components that make up this Architecture are: (1) Database; (2) Big Data Analytics or BI; (3) Data Science; and, (4) Governance.

Databases

In the first aspect, the issue of databases is addressed. The model presented by Edgar Frank Codd, which has allowed us to store the data as a matrix of rows and columns; where for each new attribute a new column is created, despite the fact that many rows do not have values for that new attribute, it has only allowed organizations to manage 30% of their data, according to research by the Garner Group

https://s3-ap-southeast-1.amazonaws.com/mktg-apac/Big+Data+Refresh+Q4+Campaign/Aberdeen+Research+-+Angling+for+Insights+in+Today's+Data+Lake.pdf.

This is why a new concept has emerged to store and manage data called key-value. In this way, different rows can have different attributes. There are mainly 3 types of databases that are key values in Big Data: (1) Transactional with high availability; (2) Document management; and, (3) Georeferences.

Big Data Analytics

The term Business Intelligence has evolved in the context of Big Data as Big Data Analytics. Where you do not necessarily work with the database concept, but rather with a distributed file system. This file system continues to be populated through data extraction, transformation, cleaning and loading tools. But now these tools write in the new key-value format and with the data compressed in the distributed file system in a zone called raw data.

The forerunner of these distributed file systems was?Google File System (GFS), which evolved into what is now known as Hadoop Distributed File System (HDFS), or simply?Apache Hadoop.

This new type of file system uses the concept of distributed computing, in order to be able to process large amounts of data (more than a million records, with high variability and with queries in less than 5 seconds), between several servers. An optimized version of the algorithm, to handle distributed computing clusters, through?apache mesos, is?Spark. Thus, currently, most organizations use spark to schedule programs and divide tasks of summarization, grouping, selection, on the data that is stored in the Distributed File System. In the case of AWS, the distributed file system is?S3, in Google Cloud Platform CS (cloud storage), in Microsoft Azure Storage.

Spark then allows you to manage map reduce tasks on the Distributed File System. However, to be able to use spark it is necessary to write a program in Java, Python, Scala or R language. In AWS the version of Spark is?EMR?(Elastic Map Reduce), in Google?Data Proc?and in Microsoft Azure?HDInsight with Databricks. This program commonly loads the data that is in the HDFS (distributed file system), into a distributed data structure between data nodes such as RDD or a data frame (array of distributed data rows) and then performs queries on the standard SQL query language, to finally write the result in the same HDFS (in an area called curated data) or in another Relational Database or a file separated by commas, for example to be consumed by some report visualization tool.

Taking into account that to perform a query on the data in the HDFS, a program would have to be made; It was necessary to establish another mechanism in which the administrators and functional databases will continue making SQL queries, this mechanism is?HIVE. Hive is a translator, interpreter of?SQL?statements in Map Reduce algorithms that are scheduled with Yarn or Spark on the data in the HDFS. In such a way that direct queries can be made on the data in hadoop. Hive has a metastore, which can reside in an in-memory relational database or in an object-relational database such as?Microsoft SQL Server?or?PostgreSQL, this metastore stores the link between the metadata with which the tables are created and the data in hadoop, which through partitions can populate these tables, but the data actually resides in hadoop. The AWS equivalent of HIVE is?Athena?at?Google Bigquery?and?Azure Synapse?at Microsoft.

In the event that it is necessary to automate a series of queries (in the style of a store procedure),?pig?can also be used, which is a tool that uses a procedural language for automating queries in hadoop.

It is important to mention that the data in hadoop is not indexed, so the?Lucene API?is also often used through?ElasticSearch?to place the indexed data in information cubes that can later be consulted through web services, by applications of reports or by other applications within the organization.

Finally, Big Data Analytics requires Data Warehouse tools (Snowflake,?Redshift) and presentation and analysis of multidimensional reports, such as?Tableau,?QlickView, Microsoft?PowerBI,?Looker?and?hue?(hadoop user experience).

Data Science

The identification of patterns through statistical models, in large volumes of data, either to establish the probability of an event occurring (regression model) or to establish a characterization (classification model) it’s called Data Science. To perform Data Science, the

object-oriented tool?R Studio?is commonly used, which is an evolution of the S language, used in the 90s in most computer-assisted statistical studies.

R Studio, allows performing structural analysis and transformations on a set of data; Quantitative analysis using descriptive statistics, identify variables to predict and the type of model (regression or classification), separate the data set into train (data for model training) and test (data for testing the model); perform analysis using decision trees and regression to indicate the importance of attributes; perform attribute engineering (for instance, extract information from a text, categorize numerical attributes, add attributes, reduce the dimensionality of attributes; and apply supervised analysis models (boosting, random forest, glm, logistic regression, depending on the type of problem).

When it is required to apply a model to a large amount of data, it is necessary to store it in the distributed file system and make a program to execute it with spark, which has all the functionality of?R Studio?through the?MLib?library.

To code in R from a web server it is necessary to implement an R studio Server.

Jupyterhub?is a server with a multi-user proxy that interconnects several instances of Jupyter Notebook, which is a client-server application that allows you to create and share web documents in json format and execute code in the different programming languages that are installed on the server.?Jupyter Notebook?has proven to be a very valuable tool for programmers, who only need to connect via the web to Jupyter Hub to access their notebooks (workbooks) and execute the corresponding code.

Governance

In Big Data, Governance is achieved through the modeling and implementation of eight processes: (1) Data operations management; (2) Master data and reference management; (3) Documentation and content management; (4) Data security management; (5) Data Development; (6) Data quality management; (7) Meta-Data Management; and, (8) Management of data architecture Data Warehousing & BI Management.

Data Warehouse and Data Lake

Data Warehouses store large volumes of data in defined and static structures that determine the type of analysis that can be done on the data. Therefore, its analytical capacity (reports, dashboards, etc.) is limited by the very structures of the sources.

A Data Lake is a repository of all the data that an organization has or can obtain. The data is “ingested raw” and can be saved “raw”, without any predefined schema or structure, or in a structured way. This provides an unlimited window for the use of the data, in such a way that any query, navigation, analysis, etc. can be carried out on it.

Proposed data architecture

The proposed Data Architecture has the Data Lake as an integrating component through the AWS S3 and?Lake Formation?tools.

In such a way that all the data of the related Databases (AWS RDS); NoSql Databases (AWS Dynamo,?Documentdb); Machine learning algorithms (AWS sagemaker); Data Warehouse fact and dimension tables (AWS redshift); transaction log files processed by microservices (AWS Kinesis Data Stream,?Kinesis Firehose, and?Kinesis Data Analytics); and the data from all aggregations, calculations, rules (AWS Glue?and EMR) interacts with the Data Lake. (see, Figure 1)

No hay texto alternativo para esta imagen
Figure 1: Proposed data architecture

The data lake is a single site where all structured (rows and columns), semi-structured (xml, json) and unstructured (images) data are stored.

The data lake stores the data in key value format (parquet) and compressed (through the gz and snappy algorithms), in such a way as to take advantage of the use of distributed computing frameworks for the calculation of aggregations and the presentation of data in the visualization analytical tools.

The data lake provides the possibility of having three data zones:

1.?Raw data?(raw layer): the data as it comes from the origins

2.?Curated data?(curated layer or trusted data): contains the data with aggregations, transformations, data anonymization (pii), data hashing (pci)

3.?Accessible data?(analytical layer or refined data or presentation data): allows visualization tools and business units to access business objects: people, accounts, transactions, etc.; without having to worry about understanding tables and databases; thus uniting various data models based on rules to present reliable and democratized data to the business.

Through the Data Lake, three data transfer models can be implemented:

A. Inside-out data transfer:?A subset of the data in the data lake is sometimes moved to a data warehouse, such as an RDS to store the data that feeds a visualization tool as superset, to support specialized analytics, such as analytics search, knowledge graph creation, or both.

B. Outside-in data transfer: Most companies use AWS RDS as the data stores for their microservices, and then move that data into the data lake. For example, to maintain a customer account statement, customer data, session history, and transaction tables are stored in AWS RDS, DynamoDB, and DocumentDB. This data is then exported to the data lake through AWS Glue and AWS EMR for further analysis, which improves the user experience of customer accounts.

C. Around the perimeter:?In addition to the two patterns above, there are scenarios where data moves from one specialized data store to another. For example, data warehouse fact tables and dimensions in AWS Redshift can send data to the data lake or read data from the data lake.

The main characteristics of the data lake are:

  1. Scalability:?A data lake must be able to easily scale to petabytes and exabytes as data grows; use a persistent and scalable data store that allows the best performance at the lowest cost, with multiple ways to transfer the data.
  2. Data diversity:?Applications generate data in many formats. A data lake must support various types of data: structured, semi-structured, and unstructured.
  3. Schema management:?The choice of storage structure, schema, ingestion frequency, and data quality should be left to the data producer. A data lake must be able to accommodate changes to the structure of incoming data, something which is known as schema evolution. In addition, schema management helps ensure data quality by preventing writes that do not match the data.
  4. Metadata management:?Data must be self-discoverable with the ability to trace lineage as it flows through the different tiers (raw, curated, analytic) within the data lake. The AWS Glue data catalog is used, which captures the metadata and provides a query interface for all data assets.
  5. Unified governance:?The data lake design should have a strong mechanism for centralized authorization and auditing of data. Configuring access policies across the data lake and across all data stores can be extremely complex and error-prone. Having a centralized way to define policies and enforce them is critical to a data architecture.
  6. Transactional semantics:?In a data lake, data is often ingested almost continuously from multiple sources and queried simultaneously by multiple analytic engines. Having atomic, consistent, isolated, durable (ACID) transactions is critical to maintaining data consistency.

The proposed data architecture is based on the reference model powered by AWS (see Figure 2). In this reference model, the following layers are presented from bottom to top:

  1. Data sources:?including software as a cloud computing service (SaaS), Online Transaction Processing databases, Enterprise Resource Planning databases, Customer Relation Management databases, the databases of each line of, shared files, mobile devices, web pages, sensors and social networks.
  2. Ingestion Layer:?In this layer AWS suggests the use of?DMS?(Database Migration Services) that allows an initial loading of the data and marking a point of change (change data capture CDC) and keeping the extractions updated from this point, AWS DMS can send the data to AWS Kinesis for writing to the S3 distributed file system through AWS Kinesis Firehose, or stream it for further processing to AWS Kinesis Data Streams. Flow control of ingest processes is orchestrated through Amazon?AppFlow?and kept up-to-date using?AWS DataSync.
  3. Storage layer (Lake House Storage):?This is the layer where the data lake is made up of different buckets of S3 (AWS distributed file system), these buckets can be in different accounts and in different regions, the important thing is that AWS proposes to classify data that is written to buckets into:

  • Landing Zone:?it is the zone in which the clean, anonymized, hashed data is found, and structured in a data model so that it can be interpreted by a business person and not a technical person and this data can come from different user AWS accounts.
  • Raw Zone:?in this zone the data is placed just as it comes from the different sources, without any treatment. The data in this zone is normally found in a secure environment.
  • Trusted Zone:?it is the zone where transformations are made to the data, for the necessary aggregations in the different analytical engines and for their subsequent security treatment.
  • Curated Zone:?in this zone all the mechanisms to comply with PII (personally identifiable information) and with the Payment Card Industry Data Security Standard are applied.

No hay texto alternativo para esta imagen
Figure 2: AWS Reference Model for Data Architecture

4. Catalog management layer (Lake House Catalog):?The catalog of all the data that is in the data lake is managed through AWS Glue, with its schemas and the different versions of each data structure and with AWS Lake Formation data access permissions are managed, by schema, tables and even columns according to the roles defined in AWS IAM.

5. Processing layer:?this layer is divided into three sublayers:

  • Datawarehouse (SQL-based ETL):?sublayer where the fact tables (facts) and the dimensions (dim) that are updated through queries made on the different databases and even on the data lake are stored, this layer it is implemented using Amazon Redshift and is the source for the analytics engines (superset, Amazon QuickSight, tableau, power bi, etc)
  • Big Data Processing:?in this sublayer all the aggregation operations are carried out through the implementation of the mapping algorithm and reduction of the Apache Spark distributed computing framework, the use of servers that can be turned on and off for each calculation is called AWS Elastic Map Reduce (Amazon EMR) and the particular way to use Apache Spark with specific benefits for the AWS platform is called AWS Glue, in Glue workflows are declared that invoke jobs that contain the programs in python or python for apache spark (pyspark) that perform distributed computing and crawlers that read the data in one or several directories and automatically detect the structure and type of data and generate tables in the glue catalog, even recognizing partitions by year month day hour ( format that AWS Kinesis Firehose uses by default to write data to S3).
  • Near to Real Time ETL:?Through the implementation of Apache Spark Streaming and Kinesis Datastream, the data can be read from the source in near real time and can be processed in memory or written to S3, if the processing is done in memory AWS Kinesis Data Analytics can be performed to perform summarization, aggregation and send the data, for example, to the feature store that feeds the machine learning models.

6. Consumption layer:?In this layer, Amazon Athena is used as a tool to perform SQL queries against the Glue catalog tables that are connected to the data lake and other data sources, Athena allows to establish workgroups that have quotas for perform queries that are executed using distributed computing on S3, Amazon Redshift can also be used to query the Datawarehouse tables, Amazon?QuickSight?to perform analytical reports and Amazon?Sagemaker?to perform machine learning models.

No hay texto alternativo para esta imagen
Figure 3: Proposed architecture with corresponding products


要查看或添加评论,请登录

INNOVANT的更多文章

社区洞察

其他会员也浏览了