Data Technology Trend #2: Strategic
Lakshmi Venkatesh
Head of Enterprise Data, Analytics & Transformation Architecture at GIC | SG Digital Leader 2024 | Accredited Board Director (SID) | MBA, MTech NUS, MCA, MCOM | Blogger & International Keynote Speaker
Simplification of Data
Make simple things simple, and complex things possible?—?Alan Kay
Trend #2.1: Data warehouse a big come back on Data Lake
With complex and ever-growing data sets, 1000’s columns had to be curated and normalized to build Data Warehouse. The very reason why Data Warehouses fail
What does Data Lake do:
Ideally, Data Lake stores all the data from the upstream systems and all data types without any throw-away. This initial store of the Raw data can be accessed by users on a need basis based on the type of analysis they want to perform on the data. Unlike Data Warehouses where there is an impeccable structure and adding any new data source takes time to curate and normalize, the staging like Data Lake does not take much time to integrate. Once this data is available. Since data is readily available, performing calculations on the fly and bringing out actionable insights with the available data points is not complicated. The cycle to realize the value of the data can begin within few weeks instead of months/years like in the traditional Data Warehouse. Further to storing this data in the Data Lakes, the Data Marts can be built directly out of Data Lakes using federated queries or RDS or Redshift / Snowflake depending on the budget and requirement of the firm.
How does it work:
Structured, Semi-structured and unstructured data can be stored as it is in the Data Lake. After processing the data, this data can be directly used to build Reports, BI, Analytics, etc., Also the same can be stored in the database for further use.
Example: Redshift & Data Lake (Glue and S3) in AWS. Cloud is a given for Data & Analytics
Best practices in Lake Formation to avoid creating a Swamp:
In order for the Data Lake to be successful, it is important to construct a “health Lake architecture and governance” in the first place. Being Agile and open data lake access does not mean that all the data can be simply dumped into one place. There has to be clear architecture and structure with defined data pipelines and flow to enable a healthy Lake Formation. Many organizations tend to switch to a new model every now and then as their current Data Lake does not work or it is already going in the direction of Data Swamps mainly due to non-governance and every business unit builds a small Data Lake on its own. Building a healthy Data Lake is a central and organization-level design even if it must include multi-cloud and multi-data lake design. This makes the healthy data lake open and on-demand queries or analytics can be run any time on any subset of data or as a whole. if Data Lake is the epicenter of an organization’s architecture, building a healthy one is imperative to build a scalable, sustainable Data Platform / Product out of it.
Data from different lines of businesses within the organization gets segregated into a place in its original form or natural stage. Data flows in the form of files/streams from different business units and is captured as is. The required user set can be given permissions to access and analyze the data based on need. All the data gets loaded from the source system and no data is thrown away. This source of data that is stored in the Data Lake is termed as “Bronze”?—?Raw/original data.
The data then gets filtered, cleaned, and augmented based on the needs and is moved to different functional buckets. Data is transformed and the required schema can be applied at this phase to make the data in the files query-able or can be further loaded into modern cloud databases. This source of data that is stored in the Data Lake is termed as “Silver”?—?Filtered, Cleaned, and Augmented data.
Further Business summary and explanations generated out of the processed data that enable smoother Decision Making for the Business. This source of data that is stored in the Data Lake is termed as “Gold”?—?Business summary.
If the data can be categorized into Bronze, Silver, and Gold, building Delta Lake in the future on top of this becomes easier.
When multiple Data Lakes across different cloud providers are involved due to regulatory restrictions, try to have similar structure and naming conventions across the business Lake formation so that there is synergy and the future expansion possibilities will be simplified.
What problem it tries to solve:
The main purpose to build Data Warehouses is to be able to generate analytics all from a single place. However, many Data Warehousing application fails due to 2 reasons
(1) Upstream to downstream data curation: Data Warehouses stores the curated data from the upstream as Data Warehouses are immutable. Will all the upstream be doing this? Not really. There will be data duplications across the organization and process overlaps that are built over several years. To identify the redundant processes and to build a unified system with upstream sending all relevant information is not a cakewalk (who said walking in cake is an easy task in the first place!) Building Data Warehouse has so many pre-cursors that needs to be satisfied, which often is overseen or not addressed before the Data Warehouse gets started. So, Data Warehouse comes back to the drawing board very often than not. You would have seen the existence of 150 + Data Warehouses in a firm. If Data Warehouse is the single version of the truth and is “Central” to the firm because there are so many Central Data Warehouses.
To implement a Data Warehousing solution in a firm, for business, there is always a constant trade-off between emergency and importance. Data Warehousing is not a low-hanging fruit, and its benefit will be realized over time. Any new functional-rich project will always be given importance over Data Warehousing.
(2) Factor of Time: Building Data Warehousing takes time. While answering the essential questions of Why Data Warehousing and what approach are we taking to build the Data Warehousing (Inmon Vs Kimball or combination based on organization needs etc.), a clear message should be delivered that building Data Warehouses takes substantial time and efforts and the benefits can be seen only in the long run. Many of the firms, keep this deadline as 1–2 years and marks the Data Warehousing as a failure, and moves on to building a new one within few years. Data Warehousing is a function of resources, time, quality, cost, technology, and most importantly data. Thousands and thousands of data points across several systems that needs to be read, understood, curated and normalized into a single warehouse is not an easy task!
What problems exist in Data Warehousing does not vanish just by introducing Data Lake or Delta Lake. However, unification of data into a single place without the need to curate 1000’s data points becomes quicker and easier. Also unlike Data Warehousing, unification of data will see the light of the day sooner as all the data in its original form is in a single place without any fancy modifications (with the ability to process structured and unstructured data and without the need to stick to proprietary file types). Building a Single version of the truth from this massive, unified data set by talking to one business at a time and bringing in only the essential fields becomes easier. Building Data Services for quick business use on top of this massive staging data becomes a lot easier.
Use cases:
- Data Lake for modern batch data warehouses
- Data Lake as the base for Delta Lake
- Multiple Data Lakes across several public clouds due to Regulatory restrictions
Trend #2.2: Data Hub
Data hubs are data stores that act as an integration point in a hub-and-spoke architecture. They physically move and integrate multi-structured data and store it in an underlying database.
What does Data Hub do:
With the constant debate of whether data should be centralized or decentralized, Data Hub tilt towards the “Data should be Centralized” approach. Data Hub is in the epicenter and it connects all the IT systems such as web applications, internal and external backend, and frontend applications, ERP / CRM, Data Warehousing applications, Analytics applications, etc. Concept-wise, this is pretty much like Data Lake yet more. Data hub orchestrates the connections between these systems and enables data flow between them. It eliminates the idea of point-to-point integration as it will build an iron wall and all future modern data platform migrations will be nearly impossible.
As per Gartner “A data hub strategy completes governance and sharing architecture and drives integration. Data and analytics leaders should develop such a strategy to determine effective mediation of semantics, and to identify data sharing requirements across applications, IoT infrastructure, and ecosystems.”
This Data Hub is a concept/approach and not a technology by itself. The technology portion in this concept is to create integration. Points or connectors.
How does it work:
Data Hub functions on the hub-spoke model.
Sample technology: Actian DataConnect
It is a hybrid digital platform for hybrid data integration. Actian DataConnect provides a strong integration architecture for the organization regardless of its size. It provides powerful design tools to quickly. Design integrations on-premise or cloud.
Reference: Source
Data Hub vs Data Warehouse Vs Data Lake:
Reference: DataHub by Semarchy
What problem it solves:
Eliminating silos and eliminates the point-to-point integration. Data hub is a central integration point for all the IT systems within the organization. It enables the creation of a hybrid integration platform.
Trend #2.3: Database as a Service (DBaaS)
With the growth of Model Cloud Data platforms, Database as a Service or Fully managed Databases becomes simple and easier. More and more organizations are able to focus on creating value out of data and improve the performance of the business rather than spending loads of time in maintaining databases. It is a boon especially for start-ups and smaller organizations.
DBaaS providers:
Database as a Service is more of a Cloud Database. Very few providers provide fully managed databases in on-premise as well (eg., AWS RDS Postgresql). These database services provide integration ability with other services within or outside the public domain such as querying (federated), reports generation, Business Intelligence, and Analytics using Machine Learning (managed or otherwise).
What problem it solves or Features of Database as a Service:
1. Security?—?Encryption at rest, store, and transit.
2. Automated maintenance and Database administration
3. Scalability (horizontal especially)
4. Self-healing
5. Load balancing
6. Auto Replication and Backup
7. Pay-as-you-go model
Refer Comparison between different DBaaS.
Trend #2.4: Multi-model Database for cross needs
As organizations generate loads and loads of data, gaining value from data from a single access point becomes extremely important. In order to achieve this, having all data sources in a single place not only becomes easier to access but also easy to ringfence the data from a security perspective and can be heavily performant. There are several disadvantages on the same grounds too. My aim is just to put up a list of technologies/solutions that are promising and working towards this area of single data solutioning for cross-company needs.
1. Data Lake
Discussed earlier in this section.
2. Heatwave
Heatwave, is Oracle’s brand new integrated, high-performance analytics engine for MySQL Database service. It reduces the distinction between OLAP and OLTP and enables running both the workloads directly from the MySQL database directly eliminating the need to perform complex moves and re-compute etc., There is no need to have a separate analytics database. This service has been introduced in the Oracle Cloud Infrastructure (OCI).
While Oracle Cloud platform, Mango DB Atlas, CosmoDB by Microsoft, AWS Data and Big Data platform, Cloudera Data platform, etc., are no question best of the breed, I see the growing trend of Data Bricks Delta Lake and Snowflake to be most promising and next get Cloud Data platforms.
3. CosmosDB
In order to manage high responsiveness, low latency, high availability and to build performant data sources, it is important that the single database reads and stores all structured, semi-structured and unstructured data sources into a single source so that cross joins across the different data types and sources becomes easier and efficient. Azure CosmosDB is a fully managed NoSQL database that tries to fix the gap and provides single-digit millisecond latency. Further app development, analytics applications become much faster anywhere from the world.
1.Delta Lake by Data Bricks:
Customers are increasingly migrating to modern cloud data platforms such as Data bricks and are experiencing up to 50% performance improvement in runtime, 40% lower infrastructure costs, 200% data processing throughput, much more secure environment, etc., Source.
Increasingly easy to manage the Big Data and Analytics Change Management process. Delta Lake by Data Bricks?—?Reliable Data Lakes at scale. It is built on the lakehouse architecture and is growing as one unified platform for Data and AI. Spark shifted the focus from Hadoop and HDFS and made the use of Big Data mainstream. The next apparent step is the integration of Big Data and AI in which Data Bricks is already providing a unified solution and it is only a matter of time where it becomes mainstream. Delta Lake’s promise is to combine the performance of Data Warehouse with the flexibility of the Data Lake. More on this in Trend #8.
Trend #2.5: Data management (preparation & integration) tools
Data Management:
According to Gartner Hype cycle (2019), below are the top Data Management technologies.
Worth mentioning the below technologies:
Data Hub Strategy
Already discussed.
Data Catalog
Already discussed.
Data Classification
By default, all the data points for the organization must be classified as sensitive and confidential.
Public
Confidential
Sensitive
Personal
Classification levels:
C1: Contact information or PII, including name, address, telephone number, and e-mail address.
C2: Identity data, including gender and date of birth.
C3: Communication data between you and us, including recordings of calls to our service centers, e-mail communication, online chats, comments, and reviews collected through surveys or posted on our channels and on social media platforms.
C4: Digital information data collected when you visit our websites, applications, or other digital platforms, including IP addresses, browser data, traffic data, social media behavior, and user patterns. If you subscribe to our newsletters, we may collect data regarding which newsletters you open, your location when opening them and whether you access any links inserted in the newsletters.
DataOps
Will be discussed in the Democratization of data.
Data Fabric
This will be discussed in the Decentralization of data.
Augmented Data Management
Data Preparation
1. gather data
2. discover and assess data
3. cleanse and validate data
4. transform and enrich data
5. store data
Front runners: Alterix APA platform, Power BI, Tableau server, etc.,
Though these are BI solutions, they are also predominantly used for Data preparation.
Metadata Management solutions
Metadata management solutions deliver insights from data that is stored in the enterprise environment. This solution enables to search, locate, and easily manageable information needs for the organization. This in turn leads to better data governance and creates better opportunities for advanced and enhanced analytics. The Metadata Management solutions include Data Catalogues, tables, and other visual tools for processing information.
Sample 20 leading metadata management software:
10. Informatica
9. IBM
8. Alation
7. ASG technologies
6. Colliba
5. Infogix
4. Octopi
3. Alex solutions
2. Smartlogic
1. Erwin
Multimodel DBMS
1. A Multi-model database is a database that can store, index, and query data in more than one model.
2. for most of the part Databases have only one part such as?—?RDBMS, Document/graph, or triplestore. A database that combines many of these is multi-model.
Multi-model databases include but not limited to:
AllegroGraph?—?document (JSON, JSON-LD), graph
ArangoDB?—?document (JSON), graph, key-value
Cosmos DB?—?document (JSON), graph,[6] key-value, SQL
Couchbase?—?document (JSON), key-value, N1QL
Datastax?—?key-value, tabular, graph
EnterpriseDB?—?document (XML and JSON), key-value
MarkLogic?—?document (XML and JSON), graph triplestore, binary, SQL
MongoDB?—?document (XML and JSON), graph, key-value, time-series
Oracle Database?—?relational, document (JSON and XML), graph triplestore, property graph, key-value, objects
OrientDB?—?document (JSON), graph, key-value, reactive, SQL
Redis?—?key-value, document (JSON), property graph, streaming, time-series
SAP HANA?—?relational, document (JSON), graph, streaming
Virtuoso Universal Server?—?relational, document (XML), RDF graphs
CosmosDB, Data Lake & Delta Lake.
Graph DBMS
1. Graph database is designed to treat networks and relationships between the data as equal and important to data itself.
2. Intention is to hold data without constructing a pre-defined model.
Eg., Neo4j, Neptune, etc.
Application Data Management
1. Application Data Management (ADM) is a technology-enabled discipline designed to help users govern and manage data in the business applications such as ERP, Financial applications.
2. ADM is critical for digital transformation and other modernization initiatives.
3. Today ADM has emerged as a way to move beyond master data management to standardize and govern broader application data, and Winshuttle is paving the way forward in the digital era.
Blockchain
Discussed
Data Lakes
Already discussed
Master Data Management
1. Master data Management (MDM) is a technology disciple that ensures uniformity, accuracy, stewardship, semantic consistency, and accountability of the enterprise’s official shared master data assets.
2. “Single Version. Of Truth” is the core concept and is difficult to achieve without proper architecture and stakeholder acceptance.
In-DBMS Analytics
1. An in-DBMS Analytics system contains an EDW (Enterprise Data Warehouse) integrated with an Analytic database platform.
2. Is mainly used for applications that require intensive processing. Once the datasets are effectively gathered in data marts this technology facilitates and secures data analysis, processing, and retrieval.
3. Key benefits?—?streamlines the. Identification of future business opportunities and risks improves organizational predictive analytics capability, and provides ad-hoc analytics reporting.
Logical Data Warehouse
1. A LDW (Logical Data Warehouse) is an architectural layer that sits on top of data warehouse that allows viewing data without transformation or movement.
2. Allows analysts and other business users to access data without. Formatting and. Eliminates the need to transform and consolidate data from disparate sources in order to view it.
3. It allows to provide a more holistic view of an organization’s data at any point in time regardless of where that data may reside.
Wide-Column DBMSs
1. Is a no-SQL database
2. Wide-column stores vs columnar Databases
a. Wide-column stores such as Bigtable and Apache Cassandra are not column stores as they do not use columnar storage.
b. Each column is stored separately on disk
c. Wide-column stores often support the notion of column families that are stored separately.
d. Within a given column family, all data is stored in a row-by-row fashion.
- Amazon DynamoDB
- Apache Accumulo
- Apache Cassandra
- Apache HBase
- DataStax Enterprise
- DataStax Luna
- DataStax Astra
- Azure Tables
- Bigtable
- Hypertable
- MapR-DB
- ScyllaDB
- Sqrrl
- ClickHouse
Document Store DBMSs
1. Is a type of non-relational database that is designed to store and query data as JSON-like documents.
2. Make it easier for developers to store and query data in a database using the same document-model format.
3. Flexible, semi-structured, and hierarchical nature of documents.
4. Allows evolving with applications.
5. Enable flexible indexing, powerful and ad-hoc queries, and analytics over a collection of documents.
Eg., MongoDB.
Operational In-Memory DBMS
1. An in-memory database management system (IMDBMS) is a database management system (DBMS) that predominantly relies on main memory for data storage, management and manipulation.
2. This eliminates the latency and overhead of hard disk storage and reduces the instruction set that’s required to access data.
Data Integration Tool
1. Data integration is a process (technology?—?integration tool is a software) of bringing data from different sources into a single destination.
2. Once the data is gathered in a single destination, meaningful insights are gathered. It would integrate the collected data such that data is comprehensive, reliable, correct, and current.
3. Organizations should be able to readily rely on business analysis and reporting.
4. Types of Data integration tools:
a. On-premise data integration tools
b. Cloud-based data integration tools
c. Open-source data integration tools
d. Proprietary data integration tools
Example: Pentaho, Informatica Powercenter, Talend, hevo Data etc.
Analytical In-Memory DBMS
Refer whitepaper
Data Encryption
Encryption at Rest?—?KMS
Encryption at Transit?—?TLS / SSL
TDE?—?Transparent data encryption
Symmetric and Asymmetric encryption
Data Virtualization
- Virtual views of the data
- No data is physically moved
- often the same as federated data
- Data virtualization is an approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data
- Unlike the traditional extract, transform, load (ETL) process, the data remains in place, and real-time access is given to the source system for the data.
- This reduces the risk of data errors, of the workload moving data around that may never be used. This concept and software is a subset of data integration and is commonly used within business intelligence, service-oriented architecture data services, cloud computing, enterprise search, and master data management.
- Some enterprise landscapes are filled with disparate data sources including multiple data warehouses, data marts, and/or data lakes, even though a Data Warehouse, if implemented correctly, should be unique and a single source of truth.
- Data virtualization can efficiently bridge data across data warehouses, data marts, and data lakes without having to create a whole new integrated physical data platform.
In-Memory Data Grids
1. An in-memory data grid (IMDG) is a set of networked/clustered computers that pool together their random access memory (RAM) to let applications share data with other applications running in the cluster.
2. Though IMDGs are sometimes generically described as a distributed in-memory data store, IMDGs offer more than just storage.
3. IMDGs are built for data processing at extremely high speeds. They are designed for building and running large-scale applications that need more RAM than is typically available in a single computer server.
4. This enables the highest application performance by using RAM along with the processing power of multiple computers that run tasks in parallel. IMDGs are especially valuable for applications that do extensive parallel processing on large data sets.
5. Performant and can improve from nontraditional to in-memory by 100 or even 1000x faster.
References: multiple sources from the internet.
As part of a strategic trends, I would also like to include Data Fabric and Data Mesh. As these technologies in my opinion provides a more meaningful position in the decentralized trend have included there.
Data Engineer || SQL, PlSql || Performance Tunning || Unix || Requirement Analysis || DP - 203 || Investment Banking || Credit Risk, Market Risk, Clearance & Settlement, Recon, Loans, Payment Processing
3 年Thought provoking.