Reliability Engineering Applied to Data Management Systems

??

?

Abstract

??

Data Reliability Engineering (DRE) is increasingly pivotal in the data management landscape, addressing the mounting challenges posed by evolving technology, regulations, and data trust. As data's significance burgeons in steering strategic decisions across sectors, the need for dependable, resilient data becomes imperative.

This paper navigates through the integration of reliability engineering concepts within Data Management Systems, illuminating the path from understanding data movement to the implementation of proactive reliability measures. Employing a systemic approach, it delves into the classifications of reliability, the design prerequisites, and the deployment of robust controls across various stages—sources, inputs, processes, and outcomes.

The role of Data Reliability Engineers is examined in various industries, such as healthcare and cloud computing. It addresses challenges related to data integrity, security, and timely delivery. Through critical insights and tools encompassing system design, proactive monitoring, fault tolerance, incident response, and data contracts, this research unravels the intersection of reliability engineering principles and data management, culminating in fortified data strategies. It discusses the integration of reliability tools, and the application of redundancy, and recoverability policies as part of a Data Governance strategy.

Furthermore, the research ventures into the application of reliability metrics, including mean time to failure and mean time to repair, as a conduit to convey operational conditions and pinpoint system vulnerabilities. In conclusion, it deliberates upon emerging trends steering the application of Reliability Engineering in Data Management, acknowledging budget constraints and exploring the potential evolution of Data Management Systems into invaluable data products.

?

Keywords: Data Reliability Engineering, Data Management Systems, Reliability Engineering, System Requirements, Fault Tolerance, Redundancy, Recoverability, Data Governance, Reliability Metrics, Data Products, Technology Evolution.

?

??

?

Data Reliability Engineering (DRE)

?

In your organization, when was the last time you looked at a report and it appeared askew? When was the last time you were expecting a report but didn't arrive on time? Or perhaps you vividly recall a system crash causing ripples across your organization? Or the last time you saw news about some hacking of sensitive data? Whether experienced as a customer, owner, or observer, these instances likely left indelible memories. Notably, they signify potential setbacks in terms of finances, time, security, or customer perception.

?

The rapid evolution of technology brings a continuous influx of new players, amplifying the complexities involved in managing companies, tools, software, coding languages, and data management systems. With each passing day, fresh regulations emerge, aiming to ensure transparency and safeguard data, exemplified by acts like the Sarbanes-Oxley Act of 2002 and regulations like the General Data Protection Regulation (GDPR) of 2018 in the European Union, among others. This surge in regulatory measures is a cause for concern among data consumers, given the pivotal role of data in driving strategic decisions and impacting lives. Notably, Momota & Moreshed (2022) [1] underscored the criticality of dependable data in healthcare choices, highlighting the real-time constraints faced in evaluating the reliability of data sourced from wearable devices before using it in healthcare settings.

?

The perceived intangibility of data as it moves through different processing stages, coupled with the diminishing involvement of key players, raises valid questions about its trustworthiness. This issue of "data trust ability" has become a paramount concern for data managers. As the data industry seeks solutions to these challenges, companies like Netflix and Google are spearheading the hiring of Data Reliability Engineers, signaling a recognition of the opportunities in addressing data reliability concerns.

?

Gartner (2021)[2] reinforces this trend, advocating for the integration of IT resilience roles, (referring to the engineers implementing Reliability Engineering) in their projects and their planning for the upcoming years. According to them, IT resilience is composed of Reliability (the degree systems can remain performant, secure, and meet service level objectives), Tolerability (the degree to which overt adverse consequences of IT hazards can be managed within levels that can be tolerated), and Recoverability (degree systems and data can be confidently restored given the organization's stated risk appetite for known and unknown IT hazards). This paper aims to unravel how experts have applied "Reliability Engineering" principles to the entire spectrum of "Data Management Systems," utilizing a systemic approach.

?

Definitions and Methodology

?

Before going deeper into it, let’s synchronize concepts first, let′s use IBM[3] definition of Data Management as “the practice of ingesting, processing, securing and storing an organization’s data, where it is then utilized for strategic decision-making to improve business outcomes”. This implies that Data Managers would need to care for data quality, governance, compliance, cost management, access control, data stability, input and output pipelines, with queries, dashboards, on-time deliveries, automation, data discovery, data observability, intra-organization politics, and more.? A Data Management System will be the hardware, software, network, query language, procedures, and strategy working together to provide business outcomes, following data processing, storage, governance, security, and retrieval requirements.

?

This paper delves into the realm of Data Reliability Engineering, exploring the application of reliability engineering concepts within data management systems.? Therefore, let′s clarify the most relevant concepts of reliability engineering as explained by Ebeling (2019)[4]:

-????????? Availability as “the probability that a system or component is performing its required function at a given point in time or over a stated period when operated and maintained in a prescribed manner”,

-????????? Maintainability as “the probability that a failed system or component will be restored or repaired to a specified condition within a time when maintenance is performed following prescribed procedures” and

-????????? Reliability as “the ability of a system or component to perform its required functions under stated conditions for a specified period".

?

Since early times Data managers have cared about data quality, but the concept of "Reliability" wasn't exactly implemented like it happened in the engineering world. The movement started with Software and Site Reliability and it’s in these areas where we find the largest research volume. However, there have been some key researchers who have attempted implementing reliability concepts into different areas of Data Management, that will be prioritized in this paper. Therefore, this paper explores how experts have been applying “Reliability Engineering” concepts in “Data management systems”, as a whole using a systemic approach.

?

According to Talend (n.d.) [5], Data Reliability is defined as the completeness and accuracy of data, serving as a crucial foundation for building trust across the organization. They proposed to evaluate the data “Validity (format and storage), Completeness (all fields required exist), and Uniqueness (free of duplicates and test entries)”, to assess possible reliability risks. They also say, “Reliable data, on the other hand, refers to data that can be a trusted basis for analysis and decision-making.” Data Reliability is an emerging concept in the data world.

?

Consequently, to start talking about the reliability of a Data Management System we need to start by understanding and defining the system's required function and performance requirements related to reliability, availability, and maintenance goals. This will allow us to understand better the right points of control and tools to implement to measure reliability, and availability and to identify the maintenance needs and resource allocation requirements.

???

System Design and Requirements

?

The beginning of any system it’s the conceptualization, design, and identification of its requirements, so when analyzing Data Management Systems, a ‘top-down’ approach will help to visualize the system as a whole, its intended function, and desired outcomes, instead of concentrating on the technology and its components. These systems could be closed (like IBM Mainframes used in utility companies where the data analyzed is steady and the final function of the business is stable) or open systems which are programmed to receive information to produce outputs with the intent of a business outcome.

?

This analysis should include an understanding of the business areas of interest, the type of data or business domain (Customer, Provider, Operations, Network, etc.), the critical data elements (business key performance metrics, numerators, and denominators, dimensions, data segments, etc.), the inputs, outputs and data processing nodes and most frequent issues the organization deals with on each one of these layers.

?

Understanding where the data products and data services align with the business operations and critical transactions will help us identify the availability requirements. These will consequently help us prioritize the production and maintenance pipeline and their respective resources. This analysis should provide both an understanding of the business services and/or products that drive the operational and financial transactions as well as capture the life cycle of the data products and services supported. Once that knowledge is gathered, then we could create correlation points to understand where they align and the critical path and baseline for our maintenance and reliability planning.

?

Olesen-Bagneux (2023)[6] proposed us to identify the data domains to start the design, following the path established by Evans (2003)[7] and Dehghani (2021)[8] but based on information science instead of software design, where Smiraglia (2014)[9] defined Domain as “a group of people who work together if they share knowledge, goals, methods of operations and communication, it can be a community of hobbyists, a scholarly discipline, an academic department and so on”. This will help us identify the data types we are dealing with and the probable “intention” and “extension” of the data consumers.

?

Identifying the intention and extension could also guide us to identify the data transactions. Data transactions are the process that a data set passes through from being an input until it is used as an output or as part of the output that could happen within the system. Subsequently, it will help us know how many quality rules and probable risks are caused by extraction, transformation, and load processes. Ole Olesen-Bagneux (2023) defines intension as “how deep a domain goes in terms of the level of expert knowledge” and extension as “the level of breadth in the domain”. See graphic 1 below

Bauer & Adams (2012)[10] remind us that “The best reliability and availability requirements include quantitative targets for maximum acceptable service disruption latency, service availability, service reliability, latency, and related behaviors for the target solution.” We can identify these quantitative targets from client and compliance contractual agreements service level. This is why it’s so important to involve business, compliance, and client managers during the design of the system.? Bauer & Adams (2012) classify the reliability requirements into service availability and reliability, disaster recovery, and elasticity requirements. They invite us to think about these requirements regardless of whether we are referring to virtualized hardware or a physical one.

?

Once this information is collected, the Reliability Engineer will have a full picture of the system and its business purpose, requirements, and resources. This analysis will drive the direction for the data manager or reliability team regarding how and where to set control processes to confirm validity, completeness, and uniqueness, the risks of redundant processes or systems, and which tool to utilize. The final goal of the Data Management System is to help business users find the information they need when they need it. The Data Management team must be capable of setting the right controls to allow self-regulation and create the capability and capacity to adapt it to the continuous changes and demands of the business. This will then guide us to have a qualitative assessment of where the “reliability” controls should be added.

?

Understanding the data management system requires the comprehension that it is formed by multiple components that need to be orchestrated in synchrony (hardware, software, network, payloads, electrical power, humans, policies, internal and external environments, etc.). Each component has different potential risks and could have different reliability tools implemented at each stage of the process. Most of these issues or risks are caused by unplanned changes in source, input, processing capacity or speed, unoptimized code, or non-indexed objects. Therefore, the reliability of the Data Management System should be studied based on input, process, and output, per data domain.

?

Therefore, before continuing to present the results of this research, let’s set the framework to better appreciate the big picture of the Data Management System and organize the found data reliability research. In this report we are going to use the SIPOC (Supplier, Input, Process, Output, Customer) Diagram, with a slight twist, to adapt it better to Data Management Systems. In this case, the S will apply to Suppliers and Sources, the O will be for Outputs and Outcomes, and I′d like to use the word Consumers (instead of customers). ?See below an example of using a value map, SIPOC, and Failure Mode to sketch the most relevant inputs and processes of a data management system:

?

Here are some of the tools that could be used per area of the diagram, some of the most common and innovative ones will be presented through this research (the location in the diagram below is irrelevant to where they can be used, but it is relevant to where this research finds their best application to be):

?

Reliability of Data Sources and Inputs

?

The first step to implementing a good Data Reliability Strategy for Data Management systems is to understand the business requirements, and contractual, regulatory, and technical systems constrictions and incorporate them into the data management system design.? Therefore, Data Reliability Engineers set proactive monitoring of the behavior of the sources and system components and the input or service received from them. Site and Software reliability engineering (SRE) have led the pack-generating system health monitoring tools during the load process at the data center level. In fact, Mikey Dickerson Google′s SRE declared monitoring as the base of the SRE activities where Reliability Engineers should invest most of their time.

-????????? Belok (2002)[11] said SRE team cannot perform their job and take on responsibilities if they cannot map the business requirements with technical metrics. Some Service level and Reliability metrics are SLI (number of completed requests for a pipeline), SLO (based on the success expected by the user against its experience), SLA (contractual agreement with the user), MTTR (Mean time to resolution), MTTF (Mean time to failure), and MTBF (Mean Time Between Failure).?

-????????? Most data inputs and outputs in data management systems are automated and operate with little to no supervision, otherwise, it will be costly. For these unsupervised settings, I′d recommend the review of the research by Momota & Moreshed (2022)[12] where they proposed the "Data Reliability Metric" (DReM). They explain this measure as a value between 0.0 and 1.0 that indicates the reliability of data collected from unsupervised settings. The lower value of DReM represents less reliable data, whereas a higher value represents more reliable data. ?

?

However, the complexity of the Data Management Systems increases based on the nodes of interchange, and generating these metrics for all of them is not feasible, therefore it′s important to classify them and prioritize them according to impact, and required outcomes. Therefore, the second step will be to document the data movement from source to destination and the nodes through which it passes.? ?Following the path of the data, and the reliability research applicable to achieve the observability goal the most relevant research work was found:

-????????? DREs or Data Managers need to be able to communicate the conditions of operations and the reliability of the different nodes. These nodes receive data, process (clean, transform, or standardize data), and output data to a subsequent process. The higher the volume the nodes we identify the higher the possibility of failure within the system components and process based on those conditions.

-????????? Ole Olesen-Bagneux (2023)?reminded us that these data sources should be registered in the data catalog aligned to their domain. It must also be documented aligned to the expected output or outcome and whenever possible. Therefore, the domain and organization topology will be very important to identify each one of these per data asset and data pipeline.

-????????? Cheng-Fu, Ding-Hsiang, and Yi-Kuei (2022)[13] who studied the reliability of a cloud-based network. They explained system reliability in the context of traditional stochastic flow computer networks as “the probability that the required flow (i.e., demand) can successfully be sent from one source to one sink”? whereas for cloud-based networks system reliability, they defined it as “the probability that the demand and the processed demand can be satisfied under the edge server capacity and budget constraints.” In cloud computing,?edge?refers to the computing infrastructure that is located closer to the end-users or devices, rather than in a centralized data center.? In their study, they include transmission costs and process costs which are a very important topic for today’s Data Managers and proposed an algorithm in terms of Minimal Paths to find out all of lower system-state vectors (LSV) for calculating system reliability. See graphic 3 below where the minimal path has been highlighted.

?

-????????? Similarly, Shi, Y., Shi, D., Ying & Yan (2023)[14] presented Blockchains as “a promising technology to drive business processes transparency and traceability, providing consensus and agreement between business partners to reduce information asymmetry and uncertainty along business processes. They can help to ensure data origin integrity, which refers to the accuracy and reliability of data from the point of creation to the point of usage on the blockchain[15].? However, it does not assure data quality in business processes nor solve human errors. Its implementation should be combined with a review of the business process and data-cleaning initiatives when needed.? Below is graphic 4 we can see 3 blocks of data for the ledger:

The third step, once the data flow path is identified, the DRE should work on identifying the critical points of failure. It requires understanding and documenting how data could fail at each stage of the process and how to react to these failures through an incident matrix. The Data Management System needs to promote and guarantee data observability by documenting the flow of the data from source to final user interface and it requires operational awareness to understand the impact on operations and consumer perception. See the example in Graphics above.?

-????????? Analyzing the system's technical constrictions will light the path to identify if load balancing will be needed to improve performance, reliability, and availability. Some load balancing strategies are done through schedulers multiple servers or virtual machines, and code optimization (which is the less expensive one and normally the quickest to provide gain).? ?

-????????? The most outstanding application of Reliability tools and concepts found so far was in the book Reliability and Availability of Cloud Computing by the authors Bauer, E., & Adams, R. (2012). They discuss the application of load balancing to distribute traffic through multiple instances assuring service is available even if one instance fails. In addition, they also propose redundancy, Data sharding, and geo-redundancy for critical services and data elements. Reminding us to “Design for Failure”, that way we are prepared most of the time.

-????????? DREs must set up an Incident Response Matrix with clear actions and responsibilities.? It’s recommended to implement Failure Mode and Effects Analysis (FMEA) on the reliability of the pipeline as a preventive measurement before setting the data strategy planning to ensure the allotment of time for preventive actions and their priority.

?

The next step will be to align all these findings with the Data Strategy allowing capacity to set the automated or manual controls where possible.? One of the control measurements found during this research was proposed by Jones (2022), these are the Data Contracts to control input sources to minimize data degradation caused during ingestion or data stale scenarios caused by extraction jobs running when data wasn’t available for pick up, data could have been late (stragglers data sets or tailing behavior), or incomplete or corrupted when picked up for loading.

-????????? Jones (2022) defines Data Contracts as an “agreed interface between the generators of the data and its consumers. It sets the expectations around that data, defines how it should be governed and facilitates the explicit generation of quality data that meets the business requirements.”

-????????? A Data Contract[16] should be defined to enforce the schema and meaning of the data being produced by a service so that it can be reliably leveraged and understood by data consumers.

-????????? Here is an example of a data contract where it defines an entity in the system with three properties: member id, member name, and state. It also specifies the minimum uptime required and the governance policy to which belongs:

name: data_contract_example
version: 2.0
description: Example of data contract
schema:
- name: member id
type: integer
description: Unique identifier
- name: member name
type: string
description: Name of the entity
- name: state
type: integer
description: State of residence
semantics:
- name: entity
description: An entity in the system
properties:
- member id
- member name
- state
sla:
- name: uptime
description: Minimum uptime required
value: 99.9%
governance:
- name: exception_control
description: Exception Control Policy
value: role-based

??

Reliability of Data Process and Outcomes

As we pass through input and processing, we need to talk about the load into the Enterprise Data Management Systems (EDM). Garraghan et al. (2016)[17] demonstrate that 5% of the task stragglers (processes that take longer abnormal time to complete, causing longer processing times than expected or resulting in timeout failures) impact more than half of the jobs in a data center. They become high-impacting issues when they impact service level agreements of subsequent processes. Therefore, detecting them, mitigating their impact, and identifying their root causes it's some of the most time-consuming activities in Data Management.

?

Perhaps this idea of "design for failure" it's what Madsen (2014) tried to tell us when she proposed RISE (Reduce the unknowns, identify alternatives, streamline the standards, and evaluate the activities) based on Edward Deming's PDCA cycle. Reliability Engineering provides us the tools to reduce the unknowns and design our Data Management System knowing that some of its components or processes will eventually fail. Besides handling input data, a Data Management System would also need to be able to organize and store the data guaranteeing availability, recovery, and the ability to search and find the data. Suppose we know to expect these failure types in our Data Management Systems. In that case, it will make sense to implement FMEA, especially in the critical nodes or minimal path of the critical measures and critical supported reports.

?

Data management systems have to have the ability to execute kill-restart actions, execute systems upgrades or updates to release new data products, migrate servers seamlessly, upload new data sets, and have the ability to recover to a prior valid point when data becomes corrupted. Barroso, H?lzle, & Ranganathan (2019) used 4 categories to classify service-level failures that apply to Data Management Systems with a slight adaptation:

-????????? Corrupted (Committed data impossible to regenerate, it is lost or corrupted),

-????????? Unreachable (Users cannot access the data, this happens when the server is down object names were changed or access was removed),

-????????? Degraded (Service is available but in some degraded mode),

-????????? Masked (faults exist but are not perceived by users as fault-tolerant software or hardware or code handles them).

Masking takes us to “Coding” another area of Data Management during the data transformation and manipulation stage, and what could go wrong with it from layer over layer old code that has been modified multiple times by different developers, original requirements of design changed over and little to no documentation regarding their original intent, or bad translation from one language to the other.

?

Another common “code” issue impacting the data’s reliability is the issue that software reliability calls “code cloning”. In Data Management we call them “duplicated data assets" This is normally a non-intentional redundancy (you may have experienced this when saving security copies of your last presentation and then ending with too many copies). This could have happened due to initial requirements being changed or due to a lack of awareness of the existence of the original data asset. On-prem SQL serves caused the issue of wasting space, therefore, sooner or later it will be found during the audit for optimization of space.

?

However, now in cloud computing, space is not something we care about as much as it is not associated with cost, so probably it won’t be caught unless the data management teams are notified by the front-end analyst or the final users as it could generate situations where two users could be working on two separate reports using two different sources, and could produce different results if the two sources are not synchronized in timeliness, volume, and dimensions.

?

Therefore, we need to design our Data Management System and support architecture and governance to withstand failure, then we need to ensure we have clear redundancy and recoverability policies established when generating development, testing, quality UAT (user acceptance testing), and production environments. Barroso, H?lzle, & Ranganathan (2019)[18] Google's engineers explained that Fault-tolerant software is more complex than fault-free one. A data management system, particularly one relying on external sources for status, readiness, and changes (as is the case for healthcare data management systems), must be designed as a fault-tolerant system.

?

After implementing Reliability Engineering metrics Data Managers will be able to tell the weakest parts of their system or those that take longer to remediate when failing, because as Peter Drucker said "you cannot improve what you cannot measure". Jones (2022) reminds us of the 3 most useful measures to measure data performance and dependability: completeness (does it contain all required data elements), timelines (when was last updated), and availability (reachability, and usability). However, each one of these could vary for the same data asset depending on the intention and extension. This takes us back to the clarification of requirements, and the reliability of the input discussed above.

?

Most importantly, we need to make sure that all the analysis done through the implementation of Data Reliability in the Data Management System provides feedback into the Data Management and Data Governance strategy, and that the resources are allocated accordingly to the defined Reliability, Availability, and Maintainability goals at least through the minimal and priority path of the critical data elements. ?This will require them to identify all the possible minimal paths to success and possible failure modes, their mean time to failure, their mean time to repair, the mean time between failure, the rate of occurrence, and other reliability metrics.

?

?Cost and Gains

?

The continuous growth of the complexity of data flows is not stopping. There is no stopping the technology either. With this growth, liability risks and opportunities will also grow. Prioritization and optimization will be the key to leading our data strategy, including data management and data reliability.

?

As Data Strategists we have to lean into the “opportunities” showing the business and products that we support how accurate and timely data could help improve decision-making, driving opportunities to increase market penetration, revenue, cost reduction opportunities, and more.

?

It is our responsibility to build our teams and processes with the governance and data quality controls needed to guarantee transparency and increase trust. Building trust is the only way we can improve and maintain a healthy customer relationship with our data assets. It is our responsibility as well to prove and demonstrate the value of implementing a sound data strategy and the return on the investment.

?

Investing resources in the implementation of automated monitoring of the system health and an alert strategy will be more productive than investing double or triple those resources in cleaning up, correcting, and reconciling data. Following Pareto′s rule, automating quality controls on 20% of the data flows we have will drive 80% of the business gain.

?

Therefore, identifying the MVP services, products, processes, and data sets needed to manage them is of utmost importance for this road map to achieve the desired goal. It is of utmost importance to have a good relationship with the business leaders, and product experts, to collect the information necessary to determine which ones are these MVPs, to produce a value base analysis and prioritization of data queues, as well as the identification of the service level required.

?

Consequently, implementing Data Reliability as part of our Data Strategy will help us promote the reputation of our data infrastructure, reporting, analytics, and business intelligence teams. We have to think of it as a seal to our brand, like a certification seal not only on paper but proven with metrics and statistics, and driven by it.

?

All of this should drive then the reduction of financial impact related to errors,?rework,?and missed opportunities. It will also provide a framework to support compliance and security regulations. Moreover, a healthy data environment should generate a healthy work environment, with empowered personnel, and promote ownership without blame but instead follow a ‘fail forward learn fast’ model.

?

Applying Reliability Engineering to Data Strategy should help us unlock the true potential of the organizational data, and pave the way for successful decision-making, increased efficiency, and a competitive advantage.

?

?Conclusion

?

Data Reliability Engineering (DRE) emerges as an essential linchpin in fortifying the integrity and trustworthiness of data within management systems. As highlighted by Google engineers Barroso, Clidaras, & H?lzle (2018) [19], the accelerating speed and complexity demand a delicate equilibrium between service quality, data integrity, and cost-effective solutions.

By intertwining reliability engineering concepts with Data Management Systems, this paper delineates a roadmap for leveraging reliability metrics and controls across the data lifecycle. Beginning with a comprehensive understanding of business needs and data domains, it advocates for a top-down approach to system design, emphasizing the importance of aligning reliability goals with business objectives. The meticulous scrutiny of data sources, inputs, processes, and outcomes underscores the necessity of proactive monitoring, incident response, and the implementation of reliability controls.

The integration of reliability measures into the DNA of Data Management Systems serves as a preemptive strategy, preparing organizations to navigate failures, mitigate risks, and ensure robustness in the face of evolving complexities. It aligns with the ethos of 'designing for failure,' heralding a proactive stance to anticipate, detect, and address vulnerabilities within the data ecosystem. Moreover, this strategic integration not only fosters data trust but also underpins compliance adherence, cost optimization, and operational efficiency.

Investing resources into DRE as a core tenet of Data Strategy isn't merely a trend; it's an imperative move toward unlocking the true potential of organizational data assets. Organizations can fortify decision-making, bolster efficiency, and gain a competitive edge by championing a 'fail forward, learn fast' culture and leveraging reliability engineering principles. Ultimately, the implementation of DRE within data strategies becomes a testament to an organization's commitment to data integrity, resilience, and strategic innovation. This path will help to reframe the Data Systems as contributors to business value. Viewing data systems and outputs as products prompts parallels with reliability practices seen in manufacturing and service industries, providing a framework for implementation.

For the continued growth of reliability engineering in Data Management, a dual focus on elevating awareness and enhancing capabilities is imperative. Any solution or model developed should maintain a generic, brand-agnostic approach, ensuring longevity and independence from specific products seeking promotion. In essence, fortifying reliability engineering's foothold in Data Management necessitates a strategic shift towards perceiving data systems as assets that not only sustain operations but also drive business outcomes, fostering an environment conducive to reliability-driven innovation and sustainable solutions.

?

??

References

[1]Momota, M. R., & Morshed, B. I. (2022). ML algorithms to estimate data reliability metric of ECG from inter-patient data for

?? trustable AI-based cardiac monitors. Smart Health, 26, 100350. https://doi.org/10.1016/j.smhl.2022.100350

[2] Blair, R., Wilson, B., Bangera, M., & Chessman, J. (2021, June 24). IT Resilience — 7 Tips for Improving Reliability, Tolerability and

?? Disaster Recovery. Gartner. Retrieved from Boost Your IT Resilience Strategy With 7 Pragmatic Tips (gartner.com)

[3] IBM. (n.d.). Data Management. IBM. https://www.ibm.com/topics/data-management

[4] Ebeling, C. E. (2019).?An introduction to reliability and maintainability engineering?(3rd ed.). Waveland Press.

[5] Talend. (n.d.). What is Data Reliability? Definition & Assessment Guide. Talend. https://www.talend.com/resources/what-is-data-reliability/

[6] Olesen-Bagneux, O. (2023).?The Enterprise Data Catalog. O’Reilly Media.

[7] Eric Evans, Domain-Driven Design: Tackling Complexity in the heart of software (upper saddle river, NJ: Addison- Wesley, 2003)

[8] Dehghani, Z. (2021). Data Mesh: Delivering Data-Driven Value at Scale. ThoughtWorks.

[9] Richard P. Smiraglia, The Elements of Knowledge Organization (Cham: Springer, 2014), 86. The theoretical definition

[10] Bauer, E., & Adams, R. (2012).?Reliability and availability of cloud computing. John Wiley & Sons.

[11] Huete Belok, U. (2022).?The Art of Site Reliability Engineering (SRE) with Azure: Building and Deploying Applications That Endure. ?Apress. ISBN-13 (pbk): 978-1-4842-8703-3, ISBN-13 (electronic): 978-1-4842-8704-0. https://doi.org/10.1007/978-1-4842-8704-0

[12] Momota, M. R., & Morshed, B. I. (2022). ML algorithms to estimate data reliability metric of ECG from inter-patient data for ??

???? trustable AI-based cardiac monitors. Smart Health, 26, 100350. https://doi.org/10.1016/j.smhl.2022.100350

[13] Cheng-Fu Huang, Ding-Hsiang Huang, & Yi-Kuei Lin. (2022). System reliability analysis for a cloud-based network under edge

????? server capacity and budget constraints. Annals of Operations Research, 312(1), 217-234.

[14] Shi, Y., Shi, D., Ying, J., & Yan, J. (2023). Ontology Modeling for Data Reliability Assessment in Consortium Blockchains. Journal of

?? ?Global Information Management, 31(7). https://doi.org/10.4018/JGIM.333237

[15] World Economic Forum. (n.d.).?Data integrity. https://widgets.weforum.org/blockchain-toolkit/data-integrity/index.html

[16] Monte Carlo Data. (n.d.).?Data Contracts. https://www.montecarlodata.com/blog-data-contracts/

[17] P. Garraghan et al., "Straggler Root Cause and Impact Analysis for Massivem, Scale Virtualized Cloud Datacenters"

in IEEE Transactions on Services Computing, 2016.

[18] Barroso, L. A., H?lzle, U., & Ranganathan, P. (2019).?The Datacenter as a Computer: Designing Warehouse-Scale Machines, Third

??? Edition. Synthesis Lectures on Computer Architecture, 14(1), 1-189. https://doi.org/10.2200/S00850ED3V01Y201902CAC045

[19] Barroso, L. A., Clidaras, J., & H?lzle, U. (2018). The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (2nd ed.). Synthesis Lectures on Computer Architecture. Springer.

?

Reliability Engineering Applied to Data Management Systems

Ines Paternina

CDAO in Action | Creating Business Value from Data Delivery & Analytics | Data Governance, Quality, Literacy, Architecture & Strategy | Driving a Data-Driven Culture & Strategic Leadership

Abstract

Data Reliability Engineering (DRE)

Definitions and Methodology

System Design and Requirements

Reliability of Data Sources and Inputs

领英推荐

Reliability of Data Process and Outcomes

?Cost and Gains

?Conclusion

References

Ines Paternina的更多文章

社区洞察

其他会员也浏览了

Data Replication

Building Data Management Centres of Excellence

Data Governance Architect needed in Chicago, IL - must be onsite!

Data Tiering: 5 Observability and Security Scenarios Where You Can Save Millions

NYPA’s VP of Product Development, Data Management & Enterprise Architecture

Data Governance Modernization

Data Migration Projects

The Role of Data Reliability Engineering in Modern Business

Data Governance vs. Data Management vs. Data Security: Clearing the Confusion

Roles and Responsibilities in Test Data Management (TDM)

Abstract

Data Reliability Engineering (DRE)

Definitions and Methodology

System Design and Requirements

Reliability of Data Sources and Inputs

领英推荐

Reliability of Data Process and Outcomes

?Cost and Gains

?Conclusion

References

Ines Paternina的更多文章

Our Digital Future: Addressing Challenges in IoT Security

Crafting Trustworthy Data Solutions: Unveiling the Impact, Challenges, and Strategies

社区洞察

其他会员也浏览了

Data Replication

Building Data Management Centres of Excellence

Data Governance Architect needed in Chicago, IL - must be onsite!

Data Tiering: 5 Observability and Security Scenarios Where You Can Save Millions

NYPA’s VP of Product Development, Data Management & Enterprise Architecture

Data Governance Modernization

Data Migration Projects

The Role of Data Reliability Engineering in Modern Business

Data Governance vs. Data Management vs. Data Security: Clearing the Confusion

Roles and Responsibilities in Test Data Management (TDM)