Data Warehousing & Data Analytics
What is Data Warehousing
In computing, a data warehouse, also known as an enterprise data warehouse, is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources.
Data warehouse architecture
Data warehouse architecture is typically divided into three categories:
Single-tier architecture:?This type of architecture focuses on reducing the amount of data stored in order to remove data redundancy. This architecture is rarely used nowadays.
Two-tier architecture: In this type of architecture, two layers separate the physically available data warehouse sources. This type, however, does not support most end-users due to a lack of expanding capabilities. Moreover, network limitations and connectivity problems were also reported with this architecture.
Three-tier architecture:?Conventional data warehouses were developed using three-tier architecture, and it continues to be the most widely used architecture for data warehouse modernization. It is divided into three tiers, bottom, middle, and top.
Data warehouse components
A typical data warehouse consists of the following elements:
Database: The database is a vital element of the data warehousing environment and is implemented on the RDBMS (Relational Database Management System) technology. This approach is traditional and is often constrained. New techniques such as parallel relational database designs, new index structures, and MDDBs are being utilized to improvise database management.
Data marts: Data marts can mean different things depending on their need. It can be misleading to generalize them as an alternative to a data warehouse or that they take less time and effort to build. Data marts can be dependent if they source data from data warehouses or independent if they act as a fragmented point solution to business issues. Independent data marts miss the central aspect of data warehousing, which is data integration, giving rise to the challenge of overlapping data.
Metadata: Metadata, in a nutshell, is the data that described a data warehouse. It is used for building, managing, and maintaining a data warehouse, and consists of two components, technical metadata, and business metadata. Also, metadata provides interactive access to end-users by helping them find data and understand the content.
Access tools: The primary objective of data warehousing is to provide businesses with information for streamlining and improving the decision-making process. The users use front-end tools to interact with the data warehouse. These tools include query and reporting, application development, online analytical processing, and data mining tools, collectively known as access tools.
Other components: Apart from the elements discussed above, a data warehouse also consists of data warehouse administration and management, and information delivery systems that provide back-end support and ensure the warehousing process is facilitated adequately.
Benefits of Data Warehousing
·???????Saves Time
·???????Improves Data Quality
·???????Improves Business Intelligence
·???????Leads to Data Consistency
·???????Enhances Return on Investment (ROI)
·???????Stores Historical Data
·???????Increases Data Security
Traditional Data Warehousing
A traditional data warehouse is?located on-site at your offices. You purchase the hardware, the server rooms and hire the staff to run it. They are also called on-premises, on-prem or (grammatically incorrect) on-premises data warehouses
Advanced Data Warehousing
An advanced data warehouse, also known as an enterprise data warehouse, serves as a data hub for business intelligence. It is a support system that stores data across the organization processes it and enables it to be utilized for various business purposes, including reporting, business analysis, and dashboards. A data warehouse system stores structured data from multiple sources such as Online Transaction Processing (OLTP), Customer Relationship Management (CRM), and Enterprise Resource Planning (ERP).
This is also known as Autonomous Data warehousing which means to Examine the Evolution of?Data Warehousing?and Gain More Value from Your?Data?Faster. Everything You Need to Know About Oracle Autonomous?Data Warehouse.
Data Integration / Engineering
Data integration?involves combining?data residing in different sources and providing users with a unified view of them.?This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies need to merge their?databases) and scientific (combining research results from different?bioinformatics?repositories, for example) domains. Data integration app ears with increasing frequency as the volume (that is,?big data) and the need to share existing data?explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. Data integration encourages collaboration between internal as well as external users. The data being integrated must be received from a?heterogenous database system?and transformed to a single coherent data store that provides synchronous data across a network of files for clients. A common use of data integration is in?data mining?when analyzing and extracting information from existing databases that can be useful for Business Information.
Data Profiling
Data profiling refers to the process of examining, analyzing, reviewing, and summarizing data sets to gain insight into the quality of data.?Data quality?is a measure of the condition of data based on factors such as its accuracy, completeness, consistency, timeliness, and accessibility. Additionally, data profiling involves a review of source data to understand the data's structure, content and interrelationships.
Data Ingestion
Data ingestion is?the process of obtaining and importing data for immediate use or storage in a database. To ingest something is to take something in or absorb something. Data can be streamed in real time or ingested in batches. In real-time data ingestion, each data item is imported as the source emits it.
The two main types of data ingestion are: Batch data ingestion, in which data is collected and transferred in batches at regular intervals. Streaming data ingestion, in which data is collected in real-time (or nearly) and loaded into the target location almost immediately.
?Data Quality
Data quality?refers to the state of?qualitative?or?quantitative?pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is "fit for [its] intended uses in?operations,?decision Making?and?Planning". Moreover, data is deemed of high quality if it correctly represents the real-world construct to which it refers. Furthermore, apart from these definitions, as the number of data sources increases, the question of internal?data consistancey?becomes significant, regardless of fitness for use for any particular external purpose. People's views on data quality can often be in disagreement, even when discussing the same set of data used for the same purpose. When this is the case,?data governance?is used to form agreed upon definitions and standards for data quality. In such cases,?data cleansing, including standardization, may be required in order to ensure data quality.
Business Intelligence
Business intelligence comprises the strategies and technologies used by enterprises for the data analysis of business information. BI technologies provide historical, current, and predictive views of business operations
The most important types of business intelligence tool features and functionality
·???????Dashboards
·???????Visualizations
·???????Reporting
·???????Predictive Analytics
·???????Data Mining
·???????ETL
·???????OLAP
·???????Drill-Down
Why is business intelligence important?
Great BI helps businesses and organizations ask and answer questions of their data. Business intelligence can help companies make better decisions by showing present and historical data within their business context.
Business intelligence greatly enhances how a company approaches its decision-making by using data to answer questions of the company's past and present. It can be used by teams across an organization to track key metrics and organize on goals.
Self Service BI / Analytics
Self-service business intelligence (BI) is an approach to data analytics that enables business users to access and explore data sets even if they don't have a background in BI or related functions like data mining and statistical analysis.
Self-Service Analytics is?a form of business intelligence (BI) in which line-of-business professionals are enabled and encouraged to perform queries and generate reports on their own, with nominal IT support.
?Self-service dashboards?allow analysis based on existing data sources while minimizing the query language needed to pull the data. This permits employees to be more focused on modeling and business decisions from data gathering.
Mobile BI
Mobile Business Intelligence?(Mobile BI or Mobile Intelligence) is defined as “Mobile BI is a system comprising both technical and organizational elements that present historical and/or real-time information to its users for analysis on mobile devices such as smartphones and tablets (not laptops), to enable effective decision-making and management support, for the overall purpose of increasing firm performance.” (Peters et al., 2016).Business intelligence?(BI) refers to computer-based techniques used in spotting, digging-out, and analyzing business data, such as sales revenue by products and/or departments or associated costs and incomes.
Data Virtualization
Data virtualization?is an approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source, or where it is physically located,?and can provide a?single customer view?(or single view of any other entity) of the overall data.
Unlike the traditional?extract, transform, load?("ETL") process, the data remains in place, and real-time access is given to the source system for the data. This reduces the risk of data errors, of the workload moving data around that may never be used, and it does not attempt to impose a single data model on the data (an example of heterogeneous data is a?federated database system). The technology also supports the writing of transaction data updates back to the source systems.?To resolve differences in source and consumer formats and semantics, various abstraction and transformation techniques are used. This concept and software is a subset of?data integration?and is commonly used within?business intelligence,?service-oriented architecture?data services,?cloud computing,?enterprise search, and?master data management.
Data Visualization
Data and information visualization?(data viz?or?info viz) is an interdisciplinary field that deals with the?graphic?representation?of?data?and?information. It is a particularly efficient way of communicating when the data or information is numerous as for example a?time series.
It is also the study of?visual representations?of abstract data to reinforce human cognition. The abstract data include both numerical and non-numerical data, such as text and?geographic information. It is related to?infographics?and?scientific visualization. One distinction is that it's information visualization when the spatial representation (e.g., the?page layout?of a?graphic design) is chosen, whereas it's?scientific visualization?when the spatial representation is given.
From an academic point of view, this representation can be considered as a mapping between the original data (usually numerical) and graphic elements (for example, lines or points in a chart). The mapping determines how the attributes of these elements vary according to the data. In this light, a bar chart is a mapping of the length of a bar to a magnitude of a variable. Since the graphic design of the mapping can adversely affect the readability of a chart,?mapping is a core competency of Data visualization.
Advance Analytics & Big Data
Advanced Analytics is the autonomous or semi-autonomous examination of data or content using sophisticated techniques and tools, typically beyond those of traditional business intelligence (BI), to discover deeper insights, make predictions, or generate recommendations.
领英推荐
Big data analytics is the use of advanced analytic techniques against very large, diverse big data sets that include structured, semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zettabytes.
Descriptive Analytics
Descriptive analytics is the?interpretation of historical data to better understand changes that have occurred in a business. Descriptive analytics describes the use of a range of historic data to draw comparisons. Most commonly reported financial metrics are a product of descriptive analytics, for example,?year-over-year?pricing changes, month-over-month sales growth, the number of users, or the total?revenue per subscriber. These measures all describe what has occurred in a business during a set period.
Predictive Analytics
Predictive analytics?encompasses a variety of?statistical?techniques from?data mining,?predictive modeling, and?machine learning?that analyze current and historical facts to make?predictions?about future or otherwise unknown events.
In business, predictive models exploit?patterns?found in historical and transactional data to identify risks and opportunities. Models capture relationships among many factors to allow assessment of risk or potential associated with a particular set of conditions, guiding?decision-making?for candidate transactions.
The defining functional effect of these technical approaches is that predictive analytics provides a predictive score (probability) for each individual (customer, employee, healthcare patient, product SKU, vehicle, component, machine, or other organizational unit) in order to determine, inform, or influence organizational processes that pertain across large numbers of individuals, such as in marketing, credit risk assessment, fraud detection, manufacturing, healthcare, and government operations including law enforcement.
Diagnostic Analytics
Diagnostic analytics is?a form of advanced analytics that examines data or content to answer the question, “Why did it happen?”?It is characterized by techniques such as drill-down, data discovery, data mining and correlations.
The purpose of diagnostic analytics is?to determine the root cause of an occurrence or trend. Often, a trend is identified using an earlier descriptive analysis step. The company can then apply diagnostic analytics to understand why the trend occurred
Diagnostic analytics is usually performed using such techniques as?data discovery, drill-down, data mining, and correlations. In the discovery process, analysts identify the data sources that will help them interpret the results. Drilling down involves focusing on a certain facet of the data or particular widget.
Prescriptive analytics
is the fourth and final phase of?business analytics, which also includes?descriptive?and?predictive analytics.
Referred to as the "final frontier of analytic capabilities,"?prescriptive analytics entails the application of?mathematical?and?computational sciences?and suggests decision options to take advantage of the results of descriptive and predictive analytics. The first stage of business analytics is descriptive analytics, which still accounts for the majority of all business analytics today. Descriptive analytics looks at past performance and understands that performance by mining historical data to look for the reasons behind past success or failure. Most management
reporting – such as?sales,?marketing,?operations, and?finance?– uses this type of post-mortem analysis.
Prescriptive analytics not only anticipates what will happen and when it will happen, but also why it will happen. Further, prescriptive analytics suggests decision options on how to take advantage of a future opportunity or mitigate a future risk and shows the implication of each decision option. Prescriptive analytics can continually take in new data to re-predict and re-prescribe, thus automatically improving prediction accuracy and prescribing better decision options. Prescriptive analytics ingests hybrid data, a combination of structured (numbers, categories) and unstructured data (videos, images, sounds, texts), and business rules to predict what lies ahead and to prescribe how to take advantage of this predicted future without compromising other priorities.
Data Science
Data science?is an?interdisciplinary?field that uses?scientific methods, processes,?algorithms?and systems to extract or extrapolate?knowledge?and insights from noisy, structured and?unstructured data, and apply knowledge from data across a broad range of application domains. Data science is relate d to?data mining,?machine learning?and?big data.
Data science is a "concept to unify?statistics,?data analysis,?informatics, and their related?methods" in order to "understand and analyses actual?phenomena" with?data.?It uses techniques and theories drawn from many fields within the context of?mathematics, statistics,?computer science,?information science, and?domain knowledge.?However, data science is different from?computer science?and information science.?Turing Award?winner?Jim Gray?imagined data science as a "fourth paradigm" of science (empirical,?theoretical,?computational, and now data-driven) and asserted that "everything about science is changing because of the impact of?information technology" and the?data deluge.
What exactly data science does?
In simple terms, a data scientist's job is to?analyze data for actionable insights. Specific tasks include Identifying the data-analytics problems that offer the greatest opportunities to the organization. Determining the correct data sets and variables.
Data Science Process
The data science process is?a systematic approach to solving a data problem. It provides a structured framework for articulating your problem as a question, deciding how to solve it, and then presenting the solution to stakeholders
Machine Learning
Machine learning?(ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks.?It is seen as a part of?artificial intelligence. Machine learning algorithms build a model based on sample data, known as?training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine,?email filtering,?speech recognition, and?computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
Artificial Intelligence
Artificial intelligence?(AI) is?intelligence?demonstrated by?machines, as opposed to the?natural intelligence?displayed by?animals?and?humans. AI research has been defined as the field of study of?intelligent agents, which refers to any system that perceives its environment and takes actions that maximize its chance of achieving its goals.
The term "artificial intelligence" had previously been used to describe machines that mimic and display "human" cognitive skills that are associated with the?human mind, such as "learning" and "problem-solving". This definition has since been rejected by major AI researchers who now describe AI in terms of?rationality?and acting rationally, which does not limit how intelligence can be articulated.
Deep Learning.
Deep learning?(also known as?deep structured learning) is part of a broader family of?machine learning?methods based on?artificial neural networks?with?representation learning. Learning can be?supervised,?semi-supervised?or?unsupervised.
Deep-learning architectures such as?deep neural networks,?deep belief networks,?deep reinforcement learning,?recurrent neural networks,?convolutional neural networks?and?Transformers?have been applied to fields including?computer vision,?speech recognition,?natural language processing,?machine translation,?bioinformatics,?drug design,?medical image analysis,?climate science, material inspection and?board game?programs, where they have produced results comparable to and in some cases surpassing human expert performance.
Artificial neural networks?(ANNs) were inspired by information processing and distributed communication nodes in?biological systems. ANNs have various differences from biological?brains. Specifically, artificial neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analogue.
Natural language processing?(NLP) is a subfield of?linguistics,?computer science, and?artificial intelligence?concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of?natural language?data. The goal is a computer capable of "understanding" the contents of documents, including the?contextual?nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
Challenges in natural language processing frequently involve?speech recognition,?natural-language understanding, and?natural-language generation.
?
Data Governance
Data governance?is a term used on both a?macro and a micro level. The former is a political concept and forms part of international relations and Internet?governance; the latter is a?data management?concept and forms part of corporate?data?governance.
Data governance encompasses the people, processes, and?information technology?required to create a consistent and proper handling of an organization's data across the?business enterprise. It provides all data management practices with the necessary foundation, strategy, and structure needed to ensure that data is managed as an asset and transformed into meaningful information.?Goals may be defined at all levels of the enterprise and doing so may aid in acceptance of processes by those who will use them. Some goals include
·???????Increasing consistency and confidence in?decision making
·???????Decreasing the risk of regulatory fines
·???????Improving?data security, also defining and verifying the requirements for data distribution policies
·???????Maximizing the income generation potential of data
·???????Designating accountability for information quality
·???????Enable better planning by supervisory staff
·???????Minimizing or eliminating re-work
·???????Optimize staff effectiveness
·???????Establish process performance baselines to enable improvement efforts
·???????Acknowledge and hold all gain
Data Security
Data security?means protecting?digital data, such as those in a?database, from destructive forces and from the unwanted actions of unauthorized users,?such as a?cyberattack?or a?data breach.
Backups
Backups ?are used to ensure data that is lost can be recovered from another source. It is considered essential to keep a backup of any data in most industries and the process is recommended for any files of importance to a user.
Data masking
Data masking ?of structured data is the process of obscuring (masking) specific data within a database table or cell to ensure that data security is maintained, and sensitive information is not exposed to unauthorized personnel. This may include masking the data from users (for example so banking customer representatives can only see the last four digits of a customer's national identity number), developers (who need real production data to test new software releases but should not be able to see sensitive financial data), outsourcing vendors, etc.
Data erasure
Data erasure ?is a method of software-based overwriting that completely wipes all electronic data residing on a hard drive or other digital media to ensure that no sensitive data is lost when an asset is retired or reused.
Reference data management?is the process of managing classifications and hierarchies across systems and business lines. This may include performing analytics on reference data, tracking changes to?reference data , distributing reference data, and more. For effective reference data management, companies must set policies, frameworks, and standards to govern and manage both internal and external reference data.
After coming to widespread prominence in 2012, Reference Data Management (RDM) has become a key element in?Master Data Management ?(MDM ). RDM provides the processes and technologies for recognizing, harmonizing and sharing coded, relatively static data sets for “reference” by multiple constituencies (people, systems, and other master data domains). Such a system provides governance, process, security, and audit control around the mastering of reference data. In addition, RDM systems also manage complex mappings between different reference data representations and different data domains across the enterprise. Most contemporary RDM systems also provide connectivity, typically a service-oriented architecture (SOA) service layer (a.k.a.?microservices), for sharing of reference data with enterprise applications, analytical/data science, and?governance?applications.
Master data management?(MDM)
Master data management (MDM) is?a process that creates a uniform set of data on customers, products, suppliers, and other business entities from different IT systems.
Master Data Management (MDM) is the technology, tools and processes that?ensure master data is coordinated across the enterprise. MDM provides a unified master data service that provides accurate, consistent, and complete master data across the enterprise and to business partners.
The most found categories of master data are parties (individuals and organizations, and their roles, such as customers, suppliers, employees), products, financial structures (such as ledgers and cost centers) and locational concepts.
Enterprise Performance Management
What is EPM (Enterprise Performance Management) Software? EPM refers to?the processes designed to help organizations plan, budget, forecast, and report on business performance as well as consolidate and finalize financial results?(often referred to as closing the books).
Enterprise Performance Management (EPM) is?the process of monitoring performance across the enterprise with the goal of improving business performance.
EPM?helps improve financial planning and analysis to reduce cost and risk so as to drive profitability with decisions based on facts. It has the ability to coordinate resources, activities and tasks across multiple departments and teams.