A robust Data Fabric for post-pandemic Digital Transformation
Data is everywhere from cell towers to aircraft, to hearbeat monitoring to systems monitoring. However the challenge of making sense of data has only increased. IBM is the undisputed leader in data processing using Cloud Pak for Data and data fabric built-in.
At its core, the data fabric concept is about providing everyone in an organization with the ability to find, explore, and ask questions against all available data.
A data fabric can be thought of as a tapestry that connects data from multiple locations (edge-core-cloud), types, and sources of data, with methods for accessing that data. For users that are consuming applications and systems alike, it abstracts away the complexities associated with underlying storage, movement, transformation, securing, governing, and processing of data.
It is important to note that a data fabric is not a replacement of more traditional data management architectures such as data lakes, data warehouses, data hubs, and databases. Rather, a data fabric involves those systems as active participants in a unified approach. It covers over these systems with a layer of search, query, and governance capabilities.
The data fabric aims to simplify data complexity through automating data integration, data governance, and data processing. Tools for data fabric design and management include data pipelines with various integration styles; workflow management, orchestration, and policy management; dynamically updated metadata and machine learning (ML)-augmented data management; augmented data cataloging; Data Virtualization.
Through Cloud Pak for Data, IBM provides organizations with an intelligent data fabric.
What IBM has learned from countless AI projects is that every step of the journey is critical. AI is not magic and requires a thoughtful and well-architected approach. For example, the majority of AI failures are due to problems in data preparation and data organization, not the AI models themselves. Success with AI models is dependent on achieving success first with how you collect and organize data.
The AI Ladder represents a prescriptive approach?to help clients overcome data challenges and accelerate their journey to AI, no matter where they are on their journey. It allows for simplification and automation that turns data into insights by unifying the collection, organization, and analysis of data, regardless of where it lives. By climbing the ladder to AI, enterprises can build a governed, efficient, agile, and future-proof approach to AI.
The?AI ladder?has four steps (often referred to as “rungs”):?
Collect: Make data simple and accessible.
Collect data of every type regardless of where it lives, enabling flexibility in the face of ever-changing data sources. Note that collect does not mean put data all in one place. In fact, quite the opposite. Virtualizing the data allowing access to wherever it lives as if it were consolidated.
Organize: Create a business-ready analytics foundation.
Organize collected data into a trusted, business-ready foundation with built-in governance, protection, and compliance.
Analyze:?Build and scale AI with trust and transparency.
Analyze data in automated ways and benefit from AI models that empower teams to gain new insights and make better, smarter decisions.
Infuse: Operationalize AI throughout the business.
Infuse AI across the business - across multiple departments and within various processes - drawing on predictions, automation, and optimization.
Supporting the AI ladder is the concept of Modernization, which is how customers can simplify and automate how they turn data into insights by unifying the collection, organization and analysis of data, regardless of where it lives, within a hybrid cloud platform. All of which is secure to help you manage risk and intelligently defend against threats.
IBM uniquely delivers the capabilities for all rungs of the AI Ladder in IBM Cloud Pak for Data. One platform, supporting a hybrid cloud environment (through Red Hat OpenShift, so it can run anywhere), that brings all a customer’s data and AI capabilities into one set of collaborative workflows and governance capabilities.?All of which is secure to manage risk and intelligently defend against threats.
In order to manage all their data, IBM is aware that its customers leverage not only IBM offerings but those provided by other 3rd parties, with many embracing open-source technologies.
IBM has pre-integrated OEM software into Cloud Pak for Data. IBM’s OEM partners are market leaders and expand IBM’s capabilities to support organizations as they climb the AI Ladder.
What is Cloud Pak for Data? At a high level, it’s an Enterprise Insights Platform (EIP) that runs on any vendor’s cloud and any infrastructure. If EIP is a new term to you, know that many industry analysts and consultants like Forrester and PricewaterhouseCoopers (PwC) have recently started using this term as a category for integrated sets of data management, analytics, and development tools.
The first core tenet of Cloud Pak for Data is that you can run it anywhere. You can co-locate it where you are making your infrastructure investments. This means you can deploy Cloud Pak for Data on every major cloud vendor’s platform, including Azure, Amazon Web Services (AWS), Google Cloud Platform (GCP), and IBM Cloud. You can also deploy on-premises for the case that you are developing a hybrid cloud approach. Finally, on IBM Cloud, you can subscribe to Cloud Pak for Data as-a-Service if you need a fully managed option, where you only pay for what you use. With Cloud Pak for Data, your organization has the deployment flexibility to run anywhere.
Cloud Pak for Data is built on the foundation of Red Hat OpenShift. This provides the flexibility for customers to scale across any infrastructure using the worlds leading open-source steward: Red Hat. Red Hat OpenShift is a Kubernetes-based platform that allows IBM to deploy software through a container-based model delivering greater agility, control, and portability
IBM’s Cloud Pak offerings all share a common control plane, which makes administration and integration of diverse services easy.
Cloud Pak for Data includes a set of pre-integrated data services that allow you to collect information from any repository (databases, data lakes, data warehouses, you name it). The design point here is for customers to leave the data in all the places where it already resides, but to its users it seems like the enterprise data is in one spot.
Once all your enterprise data has been connected, industry-leading data organization services can be deployed that allow for the development of an enterprise data catalog. This capability enables a “shop for data” like experience and enforces governance across all data sources. Enabling data consumers to have a single place to go for all their data needs.
With your enterprise data connected and cataloged, Cloud Pak for Data presents a wide variety of data analysis tools out of the box. For example, there is a wealth of data science capabilities that cater to all skill levels (meaning no-code, low-code, and all code). Users can quickly grab data from the catalog and instantly start working towards generating insights in a common workflow built around the “project” concept.
For additional capabilities, there are a large set of extended services available for Cloud Pak for Data that present more specialized data management and analytics capabilities. These range from powerful IBM solutions, like Planning Analytics to solutions from IBM partners, like Palantir (creating a business ontology) and DataStax (open-source database).
Cloud Pak for Data includes many options for storing and querying data. For enterprise-ready capabilities and high scalability, Db2 and Netezza are available. For organizations that are committed to open-source software, EDB, MongoDB, DataStax, and Cloudera can address a wide array of data storage needs.
Also, most of these data services are available on both the installable (or hosted) software form of Cloud Pak for Data, or its managed Software-as-a-Service form on IBM Cloud.
Aside from these databases, Cloud Pak for Data includes a query service called Data Virtualization, which enables users to create virtual views spanning multiple databases, even if they use a different format (for example, NoSQL vs. relational), or a different deployment model (for example, on-premises or as-a-Service on a cloud).
Data access and processing has been a challenge for a long time. You can look back multiple decades and see issues with data silos. One past solution was to copy the data from the silos into data warehouses. When data warehouse solutions were too difficult to implement, another solution was to create subject-specific data marts. These data solos continue to grow and multiply, which leads to many copies of the same data. In fact, the IDC Global DataSphere Forecast predicts that the volume of copied (replicated) data will be ten times as much as unique (original) data by 2024.1
Another challenge is the organization complexity and fluidity. This refers to the fact that organizations are not monolithic. They include multiple groups that define and manage their data their own way. Then, to add to this complexity, organizations can change due to changes in business goals and acquisitions that add new data, along with new processes that need to be integrated within an organization.
All this makes access to different internal data sets complex. Then, in the current business world, we also have to contend with external data such as social media for brand protection. Many companies also need to use external data such as weather data, accident data, and more. All those data sets, internal and external, require their own access methods making it hard for application developers, not to mention when there is a need to combine data from multiple data sets.
Having to manipulate data coming from multiple data sources raises issues of data latency. These issues get compounded as the number of data sources and amount of data increase. We have to find a way to efficiently access these data sources to limit the latency.
Finally, there is the issue of data security. When we talk about data security, we are talking about potential attacks from the outside but also attacks from the inside! Only the people that have a business reason to access specific datasets should have the proper permission. We must consider the type of operations performed: read, write, update. This does not stop there; we must also protect personal identifiable information (PII).
Data Virtualization addresses each of the major data management challenges listed here:
?Data Virtualization presents a single view of data across the organization.
?This single view provides a consistent interface to all the data which helps overcome organizational complexity. Any new data can also be virtualized to maintain that single view.
?The single view does not only extend across the enterprise but can also cover external data. To make it viable, intelligent views can be created to limit data movement and other optimizations (like caching) can be applied to reduce latency. This provides the possibility of real-time analytics without having to move data to a common repository.
?Finally, Watson Knowledge Catalog, which is a key element in IBM’s data fabric architecture, integrates with Data Virtualization to enable access control for sensitive data.
As a quick proof point, the Forrester New Tech Total Economic Impact (TEI) report concluded that with Data Virtualization, customers can expect a reduction in Extract, Transform, and Load (ETL) requests by between 25 to 64%.1 This speaks to the business impact of how Data Virtualization can meet the first three challenges listed on this slide. Instead of requiring the development of expensive ETL pipelines, data analysis teams can rely on virtual views.
It is easy to take a simplistic view of what Data Virtualization is. For example, we can look at the processing architecture as a hub and spoke. With this approach, a coordinator process knows which data sources to access and retrieves the required data based on simple conditions from each data source. The coordinator then needs to sift through all the data and figure out how to assemble the multiple pieces together to get to a final result.
The hub and spoke approach relies mainly on data shipping to get to an answer. This does not scale well for two main reasons:
1.Shipping data over a network is the slowest part of the processing. As the data size increases, it becomes impractical. This means it is not scalable.
2.As the number of data sources and amount of data increases, we then need more processing power and memory for the coordinator process. It can reach a point where we reach the limits of how much resources we can have for the coordinator. Even before we reach that limit, the amount of data that must be processed can make getting to an answer very slow, basically making it impractical to use.
For these reasons, a hub and spoke approach is not a viable solution over time as the amount of data and data sources increase. A better approach is to push as much of the processing to the data sources as possible.
The IBM approach is to organize the data nodes so they can collaborate by each contributing partial answers to a user’s query. IBM refers to the organization of these data nodes as a constellation. A constellation is defined for each incoming query, as selected nodes are grouped together. These groups minimize data transfer and mostly provide a final partial answer. The coordinator then has much less data to process and its main task is to assemble the partial answers together.
The push down of the processing into the constellation and the parallel execution in each subgroup within the constellation provides unlimited scalability. The result is that the computational mesh provides an efficient processing model that limits data latency. Results are obtained faster, and that speed is more consistent as the amount of data and data sources increase. Since the constellation mesh is dynamic, it is easy to add new data sources. Query performance is also improved by the gathering of statistics as input for IBM’s query optimizer, which is backed with decades of research and development.
The coordination of the mesh is controlled through a central node that can scale as there is more data demand. It can also be implemented using a high-availability (HA) cluster to provide robustness and reliability. This type of enterprise feature (HA cluster) provides the enterprise the visibility to quickly adapt to increasing business demands.
A key component in Data Virtualization is a common SQL engine that includes a rich set of SQL dialects that provides application portability. This SQL engine also takes advantage of decades of IBM’s experience in query optimization, which allows the Data Virtualization engine to push as much processing as possible to the data sources.
Accessing the data is one part of the Data Virtualization problem. With many data sources, it is important to have automated processes to discover the data instead of having to spend a lot of time entering the information. With automated data discovery, the data users can take a self-service approach to data so they can start on their solution faster. Of course, this self-service approach is still under the governance control, preserving data security.
Here is a high-level architecture of Data Virtualization in cloud pack for data. On the previous image, we covered the architecture differentiation of the virtual layer represented here in the middle of the diagram. To communicate with the data sources, the virtual platform has a comprehensive set of adaptors. We’ll see the list of available adapters later in this presentation. Suffice to say that it can communicate with a variety of data sources even including applications and web services.
The different user applications and tools communicate with the virtual layer through a consumer layer that provides that single view across the business data.
The virtualization platform provides access control of the different views provided on the data and it can be augmented by using a governance catalog that then adds business terms definitions, data privacy and protection rules. This includes, for example the ability to mask data by either redact, substitute, or obfuscate asset columns.
Data Virtualization can cache data from virtual views which enables queries to perform faster. This is not just the ability to cache but the ability to monitor the data access activities and make recommendations on what should be cached so the overall system performs faster for all users. It can identify the commonly executed queries so when the query is requested, the virtualization platform can avoid accessing the data directly and use the cache to provide a much higher level of performance. The use of the cache also reduces the overall system processing requirement: the code that does not need to be executed runs the fastest!
Here are a few examples of important use-cases:
Over the last decade or so, many enterprises have created big data lakes and warehouses using Hadoop and NoSQL databases. This added data silos in their organizations. There is a need for using these multiple silos to get better insights. Data Virtualization can provide that single view of their big data repositories and help them get insights into their overall business.
The same can be said about enterprise data warehouses (EDWs) and data marts. In the case of merger and acquisitions for example, we can end up with additional data warehouses and data marts, each now providing a partial view of the business. Access to these repositories require different APIs and credentials. Even after getting over this hurdle, there is still the issue of joining the data together. This is not an easy task and can introduce complexity, latency, and even errors due to bad code (bugs). Data Virtualization can eliminate this complexity.
Many enterprises have useful data warehouses but, with the changes in business requirements, the data they contain is not sufficient, or complete, anymore. With Data Virtualization IBM can augment the EDW with additional data that will complete the information required to answer the changing business needs.
In some cases, a subject specific data mart may be needed for a specific project. Instead of going through the large effort of allocating computing and storage resources to create a new data mart and then duplicating data from other part of the enterprise, it can be much more efficient to dynamically define a data mart using Data Virtualization. In addition to removing all the effort of creating a data mart from scratch, it also eliminated data replication. This way, the project is always certain to have a single version of the truth as it concerned the state of the business.
There are always new business requirements. Having the ability to quickly assemble data from different resources allows to test “what if” scenarios and quickly find possible competitive advantages that can be implemented.
Finally, as more and more projects take advantage of Data Virtualization, they can get additional query performance benefits since they can share cached data from frequently executed queries.
Data Virtualization provides huge benefits with its efficient single view of the enterprise data. It evens provides some coarse security on the data. But enterprises need more. As mentioned on the architecture slide earlier, Data Virtualization integrates with the Watson Knowledge Catalog service in Cloud Pak for Data and the overall data fabric.
With the catalog integration you can represent columns and tables using business terms. The catalog service can apply fine-grained policies so only the users with the appropriate business requirements can see the data. When it comes to personal identifiable information (PII) the data can be masked for protection from external or even internal hacking. Policy rules can be implemented, adding to the data protection. Finally, data lineage can be followed to ensure data quality.
The Data Virtualization service in Cloud Pak for Data supports nearly 40 data sources, ranging from on-premises database systems to cloud-based database services. This list is not exclusive to relational databases, as it also includes NoSQL stores, like Apache Hive and MongoDB.
While the world’s leading data stores are listed here, IBM development is continually adding support for additional sources.
Case Study 1:
ING is a large, complex bank with a mix of disparate data silos with both legacy and modern capabilities. Adherence to numerous industry regulatory requirements made accessing and querying data difficult and complex; data lineage was a large factor. Data insight initiatives were often slowed or delayed.
ING sought to move to a single corporate operating model, in anticipation of General Data Protection Regulation (GDPR) and other cross-border regulatory requirements.
ING partnered with IBM to streamline data management and applications across all operational countries by developing a single operating model strategy and platform that leveraged the IBM Data Virtualization, IBM Watson Knowledge Catalog, and IBM DataStage services in Cloud Pak for Data to ensure proper data governance, while leveraging data from across the bank.
The following resources provide further details on the ING data fabric built on IBM Cloud Pak for Data:
Audiogram: ING’s Ferd Scheepers shares his vision of using data fabric in a hybrid cloud environment.
Tooling:
To support fast insights, you need a flexible data management foundation powered by modern technologies that are agile enough to deliver data anywhere it’s needed. Db2 Database on IBM Cloud Pak for Data combines a proven, AI-infused, enterprise-ready data management system with an integrated data and AI platform built on the security-rich, scalable Red Hat OpenShift foundation.
Db2 delivers industry leading query performance. Db2 has a world-class query optimizer, which recently became even more robust by using machine learning to reduce tuning requirements.
Db2 uses advanced authorization, encryption at rest and in transit, and comprehensive security controls for managing GDPR compliance.
Database availability is a paramount concern for most organizations, whether during day-to-day activities or in the event of a disaster. Db2 helps provide this availability in multiple ways.
Flexible environment, integrates with multiple platforms plus supports NoSQL, pureXML, Graph and JSON; Java, .Net, Ruby, Python, Perl and more for building robust applications.
Elastic scaling up to 128 machines in multi-cloud and hybrid environments to reduce storage costs; data federation eliminates data silos.
Db2 is also available as a managed cloud service on IBM Cloud in two forms: as Db2 on Cloud, which can be used as a transactional or general-purpose database; and as Db2 Warehouse on Cloud, which can be used as an analytical database (for data warehouses or data marts).
Both these offerings are fully-managed, high-performance, elastic cloud services. They come with all the features you need to securely run your enterprise data workloads:
?Role-Based and Row/Column Access Control
?Auditing
?Query Federation
?In-Database ML
?Geospatial Analytics
?JSON support
?Graph Query engine
?Workload Management
?Oracle Compatibility
Netezza systems have always combined simplicity and performance. The next generation of Netezza warehouse, Netezza Performance Server (NPS) for IBM Cloud Pak for Data, continues to improve the performance of the Netezza engine while retaining 100% compatibility with IBM PureData System for Analytics and its TwinFin, Striper and Mako models.
The advanced technology in NPS fuses data warehousing and in-database analytics into a scalable, high-performance, massively parallel advanced analytic system that is designed to crunch through petabyte-scale data volumes with linear performance scalability. On the Cloud Pak for Data platform, NPS benefits from a containerized architecture and full integration with IBM’s data fabric and solutions for business intelligence, machine learning, and AI.
Netezza Performance Server is architected for high-performance analytics. As such, it is engineered to reduce and remove any complexities commonly found in relational database systems.
The best example of Netezza’s simple administration requirements is that there is no need to plan and maintain indexes. This isn’t even an option because there are no indexes in Netezza. Because of this, and other efficiencies in Netezza’s architecture, there is no need for performance tuning work. Netezza also doesn’t require any physical data modeling or storage administration. Even tasks, such as scaling database resources in or out is done with a graphical slider tool, where the administrator simply moves a dial out to increase processing or storage, and in to scale it back. Altogether this means that there is simply not that much work for database administrators to do with Netezza data warehouses.
Netezza doesn’t need special drivers or connectors to enable integration. The most common data integration, dashboarding, and data science platforms include out-of-the-box support for Netezza databases.
With Netezza, the value proposition has always been, load your data, and start querying in the same day. This is not the case, even today, for other cloud-based enterprise data warehousing services.
Netezza includes an extensive library of built-in data science and machine learning libraries. Furthermore, Netezza is integrated with the data science and analytics tooling in Cloud Pak for Data.
Netezza Performance Server (NPS) is available as part of Cloud Pak for Data System, which is an all-in-one cloud-native data and AI platform in a box. It enables the collection, organization, and analysis of data with unprecedented simplicity and agility, within a pre-configured and governed environment.
By combining storage, compute, networking, and software into plug-and-play nodes, this hyper-converged infrastructure helps organizations accelerate private cloud deployment down to a matter of hours and easily and elastically scales to suit your changing data and AI needs.
With Cloud Pak for Data System, you benefit from the ability to:
?Rapidly stand-up new private cloud data and AI platforms
?Easily and elastically scale to changing needs
领英推荐
?Simplify and unify software and systems management through a single intuitive dashboard
?Speed time-to-value with simplicity, and lower TCO with a flexible pay-as-you-go capacity model
As the newest “Netezza appliance”, NPS running on Cloud Pak for Data System has all the capabilities from the original Netezza appliances:
?100% compatible with legacy Netezza appliances: NPS is based on the Netezza platform’s software, now upgraded with 64 bit and Non-Volatile Memory express (NVMe) support. You can use NPS in a cross-generational environment alongside TwinFin/Striper/Mako Netezza appliances.
?Massively parallel processing (MPP) hardware acceleration using Field Programmable Gate Array (FPGA) chips optimized for instructions from the Netezza query engine. Processing is even faster now, with the usage of solid-state disc (SSD) technology. On average, NPS is at least 3x faster than Mako (previous generation of the Netezza appliance) from a rack-to-rack perspective.
?Simple appliance form-factor: Cloud Pak for Data is a host server, with modular add-on NPS racks. This is a modular architecture, which allows for the addition of expansion units.
?Deep integration with 3rd party tools: With continued compatibility with the legacy Netezza appliances, NPS supports integration with a wide range of data analytics tooling, for example: Informatica, Tableau, SAS, MicroStrategy, Oracle, Microsoft, SAP.
Like its predecessors, Netezza Performance Server is available as a hyper-converged system, where the Netezza software is installed on an appliance with pre-configured optimized hardware. Now, for the first time, Netezza is available on the cloud. This is made possible through the containerization of the Netezza software, which now runs on RedHat OpenShift.
There are many benefits of running Netezza in a cloud setting:
?Flexible deployment options: Netezza Performance Server on Cloud Pak for Data can be deployed in both private and public cloud settings. There is support for public cloud deployments of Netezza with Cloud Pak for Data on IBM Cloud, AWS, and Azure.
?100% Netezza compatibility: Having the same database engine in on-premises settings and on the cloud makes many scenarios easy, like seamless expansion from ground to cloud.
?Scalable and elastic: Netezza on Cloud Pak for Data can be configured to scale compute and storage independently of one-another. This is an easy task thanks to a simple interface where administrators scale compute and storage by dragging a slider. A key principle of cloud economics, where you pay for what you use applies with Netezza: if there is little or no need for a warehouse, administrators can shut down its the compute resources, and resume when needed. This presents significant cost savings, while ensuring the data is preserved for later use.
?Highly available: Netezza Performance Server and the Cloud Pak for Data platform are designed for mission critical enterprise use, which means that this offering includes high availability capabilities.
?Reliable: On public cloud, Netezza backups can be kept in object storage repositories, which can then be seamlessly be replicated to multiple availability zones. Object storage is among the most reliable data storage alternatives available today.
?Data fabric integration: By being part of Cloud Pak for Data, Netezza is deeply integrated with the Data Virtualization and Watson Knowledge Catalog services, which make up IBM’s data fabric.
With public and private clouds making up a significant portion of the technology solutions for today’s enterprises, it is important that Netezza support both on-premises and cloud-based deployment options. Even while an enterprise has good reasons for maintaining data warehouses in a Cloud Pak for Data System appliance in their data center, cloud support provides the ability to easily and inexpensively extend and complement the on-premises system.
Here are a few interesting use cases where running Netezza on cloud can add value to an on-premises deployment:
?Seamless expansion from ground-to-cloud: Within minutes you can lift and shift on-premises Netezza databases with either the nz_migrate or nzbackup/restore utilities.
?Development and test environment: Instead of purchasing a separate appliance for testing work, development teams can back databases up to cloud object storage and restore onto Cloud Pak for Data on the cloud. Netezza running in the cloud environment is the same code as what’s running on the Cloud Pak for Data System hyperconverged appliance.
?Create cloud data marts: Quickly spin up localized or isolated data sets for specific lines of business without complicating work in the on-premises Netezza appliance.
Informix has a loyal following for many reasons, including its ease of use and management. In addition to all the features expected in an extensible relational database, Informix stands out through capabilities including:
Distributed Transactional Processing
Informix helps power transactional workloads easily in a wide range of environments to enable analytics-driven insights quickly. It can process over two million transactions per second with full consistency and can seamlessly scale from 700 to 75K concurrent users within 30 seconds.
Business Continuity
Informix is frequently deployed in environments that are mission critical. To support such environments, Informix includes multiple capabilities. It includes High-Availability Data Replication (HADR) that provides the ability to support multiple servers working in tandem that keep in sync as to not lose any transactions in the case of a host failure. Another capability is to have a secondary server on standby that can take over in the eventuality of a primary host failure. This capability is called Remote Secondary Standby Database Servers. Finally, it also supports Shared-Disk Secondary Servers as another way to provide failure resilience. The flexible grid feature also enables rolling upgrades with no outages.
Embedded Integration
With a silent installation and small footprint that requires just 100 MB of memory, Informix is simple, non-disruptive, and can run in the smallest capacity devices and gateways. Self-managing capabilities make it ideal as an embedded data management solution and it is the proven enterprise database for edge computing.
The purpose of a database is to record information, and there are many circumstances where one of the data points that needs to be captured is the date and time: this is time series data. There are many examples: bank deposits, temperature readings, telco call detail records (CDRs), and many more.
The?Informix?TimeSeries solution stores time series data in a special format within a relational database in a way that takes advantage of the benefits of both non-relational and standard relational implementations of time series data. The?Informix?TimeSeries solution is more flexible than non-relational time series implementations because the?Informix?TimeSeries solution is not specific to any industry, is easily customizable, and can combine time series data with information in relational databases.
The?Informix?TimeSeries solution loads and queries time stamped data faster, requires less storage space, and provides more analytical capability than a standard relational table implementation. It saves disk space by not storing duplicate information from the columns that do not contain the time-based data. It loads and queries time series data quickly because the data is stored on disk in order by time stamp and by source.
Advantages of Informix:
?Less Storage:?Due to the small index size and the specific way of storing the timestamp value, the storage requirement is drastically reduced. Storage reduction can be up to 80% in the case of regular time series and up to 30% in the case of irregular time series.
?Greater Performance:?Because the index size is small and the data is ordered physically on disk, much less I/O is required to select data. This results in extremely high performance, in both uploading and retrieving data. Also, since less disk space is required, administration tasks like backing up and purging data are faster.
?Total Cost of Ownership:?TCO is much lower since there is no additional administration, no extra license cost, and no long learning curve. On the contrary, the results show huge disk space saving and extremely high performance.
?Interoperability: The Informix TimeSeries data type, being a native data type, provides interoperability with database features as well as other products with no extra tweaking.
?Faster application development: Developers can rely on existing built-in SQL routines, C APIs and JAVA APIs for accurate and faster development.
Use-Cases with Cloud Pak for Data (CP4D):
?DevOps/server/container monitoring.?The system typically collects metrics about different servers or containers: CPU usage, free/used memory, network tx/rx, disk IOPS, etc. Each set of metrics is associated with a timestamp, unique server name/ID, and a set of tags that describe an attribute of what is being collected.
?IoT sensor data.?Each IoT device may report multiple sensor readings for each time period. As an example, for environmental and air quality monitoring this could include: temperature, humidity, barometric pressure, sound levels, measurements of nitrogen dioxide, carbon monoxide, particulate matter, etc. Each set of readings is associated with a timestamp and unique device ID and may contain other metadata.
?Financial data.?Financial tick data may include streams with a timestamp, the name of the security, and its current price and/or price change. Another type of financial data is payment transactions, which would include a unique account ID, timestamp, transaction amount, as well as any other metadata. (Note that this data is different than the OLTP example above: here we are recording every transaction, while the OLTP system was just reflecting the current state of the system.)
?Fleet/asset management.?Data may include a vehicle/asset ID, timestamp, GPS coordinates at that timestamp, and any metadata.
Internet of Things (IoT) use cases include:
?Connected appliances (including cars)
?Smart home security systems
?Autonomous farming equipment
?Wearable health monitors
?Smart factory equipment
?Wireless inventory trackers
?Ultra-high speed wireless internet
?Biometric cybersecurity scanners
?Shipping container and logistics tracking
And more.
IoT use cases typically involve collecting data from sensors or devices in a scheduled way, possibly involving the establishment of a small database on the gateway device networked back to some corporate data server. There could be a few devices or several tens of thousands of these devices and sensors; either with or without gateways. To embed a database in a device means a very small initial footprint of a hundred megabytes or so of storage and a small amount of memory. It must be self administering and able to replicate its data back to the corporate server. The data on the device should be encrypted at rest and so should its network connections, all this for obvious security reasons. Provisioning of the devices must be easy, programmatically done and shipped to locations ready to run either bare metal or containerized via Docker or Kubernetes. Informix supports both approaches.
With the Informix Warehouse Accelerator feature, all the data can be immediately digested with analytics on top of in-memory persistent storage for nano second speed of thought analysis, either on-prem with the Informix Advanced Enterprise Edition or containerized via the Informix Cloud Pak for Data Advanced Enterprise Edition. All of this can be done on-prem, on-prem containerized, in the cloud, or in the cloud containerized.
We can utilize the unique capabilities of the Informix Time Series solution to store and analyze the device-generated data more efficiently, which makes the solution very quick on inserts, updates, deletes and selects compared to relational solutions. Time series has many analytical functions to work on time series data to aid in processing the data. This method has been competitively benchmarked as 5 times faster than the competition to process and generate billing records for 3 million electrical meters; a solution which is now in use by a Texas based electricity utility. Time Series has also been used for by a Wall St. brokerage house to process trading data.
Informix provides the most efficient solution for Internet of things. It can be implemented either stand alone or be supplemented with Cloud Pak for Data. The result is the most comprehensive and cost-effective solution on the market.
Informix brings many value propositions to Cloud Pak for Data as a database service. In addition, modernizing existing Informix deployments by porting them to Cloud Pak for Data presents many new possibilities for organizations to get more value out of their data.
A good initial incentive to modernize an Informix deployment is cost. Moving Informix databases to a containerized deployment on Cloud Pak for Data provides efficiencies that result in the reduction of infrastructure and storage needs. This containerized architecture also provides operational efficiencies, where an Informix deployment can be elastically scaled in or out on an on-demand basis.
For existing consumers of data in Informix systems, Cloud Pak for Data enables tight integration of Informix with IBM’s data fabric technologies, which enables automated data governance and self-service consumption.
The four main open-source data offerings IBM supports for use within Cloud Pak for Data are Cloudera, EDB PostgreSQL, MongoDB, and DataStax Cassandra.
?Cloudera is known for its role in pioneering Big Data analytics and still has the top distribution in this space.
?EDB Postgres is the market leader for an enterprise distribution of PostgreSQL. EDB makes Postgres easier to run everywhere, it provides high availability, and Oracle compatibility.
?MongoDB is the market leader for NoSQL document databases. MongoDB is known as an easy entry point for developers to store document data.
?DataStax is the market leader for NoSQL columnar databases. It is highly scalable and in a distributed mode can be configured to have zero downtime.
For customers needing open-source software solutions, IBM offers several benefits.
IBM has distinct market advantage as it is an established industry leader for enterprise open-source hybrid-cloud procurement, services, and support needs.
IBM is the only provider who provides access to ‘best in class’ open-source offerings, with significant market penetration, and rapid growth.
IBM provides a single source of support, thus allowing businesses to easily and quickly access high-quality support to address technology and integration challenges, all while having access to the best-in-class knowledge catalogs across various various technologies.
IBM provides customers with multiple deployment options, whether it’s on-premises/private cloud, or public cloud with enterprise scale, governance and resiliency.
AI Developers depend on IBM and open-source technologies to design/build and deploy their models. An enterprise can now leverage all their intellectual property and improve their insights keeping them one step ahead of their competition.
Since 2018, IBM and Cloudera have had a strategic partnership where, Cloudera Data Platform integrates with Cloud Pak for Data, to provide users with a best of breed analytics platform for big data.
Cloudera Data Platform (CDP) is a next-generation, hybrid data platform with cloud-native, self-service analytic experiences. On the data lake side, Cloudera Data Platform provides data hub clusters, storage, a streaming data service, as well as data governance for an organization’s big data.
Cloud Pak for Data provides a data fabric layer to complement the large-scale data processing and data storage capabilities on CDP, including Watson Knowledge Catalog and Data Virtualization. Also included is Watson Studio, which provides data science tooling, where the data processing is pushed down to the CDP cluster.
Cloud Pak for Data runs alongside CPD as both sets of infrastructure run in containers on RedHat OpenShift.
When Hadoop and the Big Data movement started making headlines in the IT industry just over a decade ago, many people were excited about the potential of a lower cost platform for storing and processing large volumes of data. The use case that proved to deliver the most value is still highly relevant today: offloading staging data and cold data from more expensive data warehouse systems to a data lake. Many of the data lakes from the past decade are themselves due for modernization, and there is an easy upgrade path to the Cloudera Data Platform (CDP) and its containerized architecture.
Combined with Cloud Pak for Data, data lakes on the CDP can execute the data processing from DataStage as it performs ETL on the staging data. And IBM’s Data Virtualization service can orchestrate queries that need to poll both current hot data from the data warehouse and the older cold data now in the data lake.
CDP includes Cloudera DataFlow (CDF), which is a real-time streaming data platform for the collection, curation, and analysis of data. Analysis of real-time data, while pulling in insights from existing data sets can deliver significant value in the form of real-time dashboards and use cases like predictive maintenance, asset tracking and monitoring, patient monitoring, and many more.
DB PostgresSQL offers an open-source SQL relational database solution that is built for enterprise-scale data needs. It supports a variety of data types including JSON documents, geospatial data, key-value, and traditional relational tables. It is a great fit for a wide range of customer use cases, from on-line transaction processing (OLTP), reporting, analytics, supporting web and mobile applications, and new business processes.
Where EDB PostgreSQL stands out as an open-source relational database solution is for its support of relational database management system standards. It has a robust and flexible SQL query engine, can support high transaction volumes, supports ACID transactions (Atomic, Consistent, Isolated, and Durable), as well as other enterprise-critical capabilities; for example: referential integrity server-side programs (like stored procedures), and fine-grained access controls.
All these advanced capabilities position EDB PostgreSQL as an inexpensive open-source alternative to expensive Oracle or Teradata installations.
EDB PostgreSQL is widely recognized as an open-source database that does not compromise on enterprise-ready capabilities. Meanwhile, it is less expensive than most commercial relational database management offerings, while offering comparable performance and capability. In an era where rising data volumes are a significant concern for IT departments, the lower cost of EDB PostgreSQL makes it practical to set up data warehouses for the purpose of new analytics projects. The flexible architecture of PostgreSQL also enables it to be deployed for multiple purposes, ranging from operational storage to analytical processing.
Cloud Pak for Data and EDB PostgreSQL are a great fit together. Wherever EDB PostgreSQL is used as a data warehouse, analysts can use the data fabric capabilities in Cloud Pak for Data to easily find, understand, and start analyzing the data.
MongoDB is widely recognized as the industry’s leading document database. MongoDB Enterprise Advanced is fully featured with enterprise capabilities, is high-performing and can scale. For many customers, whose applications need a document-based data model, and where there is a software deployment strategy oriented towards open-source tools, MongoDB is a common choice.
To help customers manage and gain insights from many data sources or a hybrid, multi-cloud environment, IBM has integrated Cloud Pak for Data with MongoDB. This enables users to explore and use JSON data with MongoDB’s powerful tools, natively on Cloud Pak for Data. These capabilities help users better cope with rapidly changing data models or schemas and support modern-day web and mobile apps. When customers provision MongoDB Enterprise Advanced, they have the option to integrate the database automatically with the governance capabilities of the Cloud Pak for Data platform, allowing analytics users access to self-service data stored on MongoDB.
MongoDB is the most popular choice for developers as a data store for their web and mobile apps. By integrating MongoDB and Cloud Pak for Data, an organization will be able to take a far more data-driven approach to how their apps are being consumed. MongoDB is also a highly flexible data store, where a wide variety of data formats can be stored, and support a variety of data storage architectures, like systems of record and engagement, and edge computing.
DataStax is the Cassandra company.
Cassandra an open source No SQL database project. It has many unique features:
?Best-in-class support for replicating across multiple datacenters, providing lower latency for users, and the ability to survive regional outages. Failed nodes can be replaced with no downtime.
?At its core, Cassandra is a wide-column store, which means that it uses tables, rows, and columns like a relational database, but unlike a relational database the individual columns can be different from row-to-row in the same table.
?It has a very simple SQL-like language called CQL, which allows developers to have a very familiar experience to querying data.
?Storage Attaching Indexing (SAI) allows flexible queries on columns other than the primary key.
?Multi-datacenter replication allows for flexible deployments allowing for a simple database to be replicated in real time in hybrid or multi cloud architectures.
?Zero lock-in to any cloud vendor or platform. Moving from on-premise to cloud or from one cloud to another is seamless and requires no downtime and can be done in hours.
DataStax builds on everything that Apache Cassandra does by default. It’s a pure enterprise platform with enterprise security, advanced performance and monitoring. It supports advanced workloads like Search, Analytics, and Graph and integrates with Apache Spark and Kafka
Cloud Pak for Data has a wealth of capabilities across the rungs of the AI ladder, where organizations can find, understand, and analyze their data. One particular benefit of DataStax is that it can store large volumes of data quickly, making the data available for querying very soon after whatever happened is recorded in the database. This quick-storage ability makes DataStax well suited to use cases involving real-time data, where an immediate understanding of the state of the business is needed.
These are use cases that share a common theme of needing real-time data.
To conclude, IBM has a few incredible ways to continue the conversation and provide immediate value with this opportunity. IBM can deliver a brief Business Value Assessment (BVA) that will provide a custom report on the value and cost savings of Cloud Pak for Data. This will provide an idea of the cost of not acting and recommendations depending on where an enterprise falls on the AI Ladder.
You can access the Cloud Pak for Data BVA at the URL ibm.biz/icpdbva
IBM also offers a free trial of Cloud Pak for Data as a Service with access to all lite services and many new services from the catalog. There is no commitment when test driving Cloud Pak for Data as a Service. IBM encourages clients to see for themselves the power of this product, perhaps with a small pilot project within a particular line of business.
Using the link provided on this slide, clients can quickly get started today with Cloud Pak for Data as a Service.
CTA:
Let me know how I can help with your Cloud Pak for Data Implementation within your company.
References:
1. IDC article: IDC's Global DataSphere Forecast Shows Continued Steady Growth in the Creation and Consumption of Data – Source: https://www.idc.com/getdoc.jsp?containerId=prUS46286020
2. Forrester New Tech Total Economic Impact: https://www.ibm.com/downloads/cas/V5GNQKGE
3. To view all currently available services, visit IBM’s online documentation: https://dataplatform.cloud.ibm.com/docs?context=cpdaas .
4. To view the latest services and what’s new in Cloud Pak for Data as a Service, view the “What’s New” section of the online documentation: https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/whats-new.html?context=cpdaas&audience=wdp .?