Data Modeling: Driven by the Data
As a data modeler, I will tell you that all of my data models are driven by the data. After all, the data is what modeling is all about. I have created hundreds of data models over the last thirty years that resulted in production databases for transaction systems, data warehouses, metadata repositories, and more. With this level of experience, I am considered an industry expert with several of the leading data modeling products. These tools provide tremendous value for modeling enterprise data, especially for traditional databases.
That said, the traditional data modeling tools provide absolutely no value at all for modeling semi-structured and unstructured data beyond just storing the values in a database. To be clear, we are not talking about being able to create a target for storing semi-structured and unstructured data; we are talking about being able to identify, extract, and model the business data from within the semi-structured and unstructured values.
Data Classification
Data classification is the process of identifying the type of data that a value represents. Most data classification engines are intended to identify sensitive data from structured data sources to help meet privacy laws and regulations. This is an important use of data classification, but only a fraction of the business value that data classification is capable of delivering.
Complete Data Quality evolves the use of data classification to model structured, semi-structured, and unstructured data for all business data, not just sensitive data. The primary differentiator is that traditional data classification engines operate at the value level, whereas Clean Cloud’s data classification engine operates at the content level. If we follow KISS and “Keep It Super Simple”, the value level classifies structured data and the content level classifies structured, semi-structured, and unstructured data.
Phone Number Example:
·??????847-975-0217
·??????Phone Number: 847-975-0217
·??????Contact the inventor at 847-975-0217 for more information.
In the phone number example above, the first bullet contains a structured value, the second bullet contains a semi-structured value, and the third bullet contains an unstructured value. Clean Cloud’s data classification engine will identify and extract the phone number from all three examples, where traditional classification engines will only work on the first example.
领英推荐
Clean Cloud’s data classification engine is capable of classifying every word in this article using the base classes provided with the modeling software product. The entire article would be treated as a single value by the classification engine and all of the base data types would be applied to classify the data content. This will work, but would produce an overwhelming number of classified values.
Instead, Clean Cloud provides the data modeler with the ability to control what data is classified by using domains. The domains identify the data classes to be used by the classification engine for modeling the data. This allows the modeler to focus the classification engine to identify and extract only pertinent business values from the unstructured data. This capability is why the modeling software within Clean Cloud is referred to as Domain Search.
Data Driven
The data types are the most granular classification component and align directly to the data content. The data modeler creates custom data types to extract and classify specific business data content from unstructured values.
Continuing with the phone number example, the data modeler creates the phone number data type by literally feeding valid phone numbers to the data classification engine. The classification engine profiles the submitted values to infer all of the metadata necessary to classify a value as a phone number. The inferred metadata is stored as the phone number classification data type.
In this example, the modeler could submit valid phone numbers with different formats to create an all-inclusive phone number data type, or the modeler could create multiple phone number data types specific to each format. This allows for a great deal of flexibility for creating data rules specific to a set of data, but creates a many-to-many relationship with the data classes.
Data class types are used to break the many-to-many relationship between the data types and the data classes. The data class types provide the ability to create rules specific to one data type or many data types that all conform to the same data class. The data modeler has complete control to determine the correct level of granularity needed in order to model the data properly for the organization.
Modeling Data
Clean Cloud evolves data modeling by introducing powerful data classification technology to unlock the business data within unstructured text data sources. Clean Cloud provides the data modeler with the modeling tools to create the classification objects, and Domain Search to utilize the classification objects to extract business data from unstructured values. A complete review of the steps involved with a Domain Search project will be covered over the next several articles.