Modeling Data Classes

Modeling Data Classes

Modernizing data classification to address all types of data requires a more advanced data structure for the data classes. This is necessary to address all of the different variations of values that represent the same business information. A simple example would be the possible variations of country values. The country values may come in the form of country names, two-character abbreviations, or even three-character abbreviations. All of these variations need to be addressed by the country data class.

In order to achieve this goal, the data structure is expanded to define data types. For our country data class, we would have three different country data types to represent the variations. Adding the data types to the data class allows for these variations.

However, there is a problem with this data structure because in some circumstances a data type may align to multiple data classes, creating a many-to-many situation. For example, U.S. state names can also be first names, like Alabama, or even last names, like Montana. A class type is introduced to resolve the many-to-many situation.


The diagram above depicts the relationships as viewed from the data class perspective, but keep in mind that a data type may be associated to multiple class types. The class types manage the relationship between a data class and the data types. In other words, the class type identifies the data types that are valid for the data class.

When creating a new data class, the data modeler first creates the data types. A data type identifies the data constraints specific to a set of data. However, a classification data type requires more data constraints than the traditional data type and length defined for a column in a relational database. The data constraints must be more specific in order to classify values properly.

Instead of a data type like Boolean, character, decimal, integer, etc., the classification data type defines allowed and required characters. The characters can be alpha, numeric, punctuation, symbols, or spaces. The allowed and required characters can be one of the character types or a combination of character types. If the allowed and required characters include punctuation or symbols, then the modeler defines which punctuation or symbols are allowed as well. The length data constraint is a length range, rather than a fixed length.

These are the basic data constraints that are required for the data type, whether the data type is intended for structured or unstructured data. The data modeler may define additional data constraints such as patterns, value patterns, ranges, valid values, and more. Be aware that there are a handful of data constraints specifically used for classifying values in unstructured data that are not available or used for classifying structured data.

For either structured or unstructured data, the basic data constraints need to be as accurate as possible in order to classify data properly. To ensure the accuracy, the data modeler feeds valid values to the modeling tool to infer the basic data constraints and the patterns constraint. This approach accelerates the modeling process and increases the accuracy of the data constraints.

The data modeler is of course able to manually customize the inferred data constraints, but feeding values to the modeling tool to make the desired adjustments is recommended. The data modeler may need to add additional constraints in order to classify a specific set of values correctly. For example, pattern constraints make a lot of sense for phone numbers, social security numbers, and values that have a specific format. For the country data types, valid values make a lot of sense because the country names and abbreviations are clearly defined and change very slowly over time.

Data classification modeling requires intimate knowledge about the data in order to create the proper data constraints. The data modeler is provided with the ability to use the classification engine to expose the data to gain the intimate knowledge required to model the data classes.

How is this achieved? Base system data classes are provided to ensure no values can fall through the classification process. The base system data classes provide very basic classifications such as alpha, alphanumeric, Boolean, character, decimal, integer, and money. The intention is to ensure that all forms of data are covered. The character base system data class provides coverage for values that contain punctuation and/or symbols with alpha, alphanumeric, or numeric characters. The character base system class will also provide coverage for unexpected garbage characters.

The data modeler simply runs the classification engine against a data source before modeling the new data classes. For any values that are not classified using modeled data classes, the system data classes are used to classify the values. The data modeler is able to analyze these values and actually use them to automatically forward engineer new classification data types.

The goal is to provide the data modeler with the ability to analyze the data in order to understand what values are present for modeling new data classes. This approach provides the data modeler with the intimate knowledge necessary to create new data classes.

This is the second article in a series describing the evolution of data classification to address all types of data, not just sensitive data. The next article will continue to explain the modeling process and dig deeper into the details.

?

All Rights Reserved by Complete Data Quality, Inc. ?2025


Eugene Breger

Synagogue Financial Support Services

2 周

Advanced classification data structure from blueFlash Software and Antonio Amorin

Gerald Provost

Founder & Principal - Transborder Ventures

2 周

Well stated! #dataclassification

要查看或添加评论,请登录

Antonio Amorin的更多文章

  • Modernizing Data Classification

    Modernizing Data Classification

    Data is finally in the spotlight! Whether we are talking about being data-driven or the impact that low quality data…

    7 条评论
  • Data Classification Evolves Data Governance

    Data Classification Evolves Data Governance

    Clean Cloud from blueFlash Software evolves data governance by introducing the ability to create a single set of rules…

  • Data Quality Assessment: Origin Story

    Data Quality Assessment: Origin Story

    In the summer of 2001, I was in Fishkill, NY engaged as the lead data profiling consultant on a sales data warehouse…

    1 条评论
  • Data Modeling: Certified Data Repository

    Data Modeling: Certified Data Repository

    The certified data repository captures the results of the data modeling solution. All certified values are loaded into…

  • Data Modeling: Data Certification

    Data Modeling: Data Certification

    Complete Data Quality’s data modeling solution tasks the data community with certifying the classified values. The data…

  • Data Modeling: Data Quality Analysis

    Data Modeling: Data Quality Analysis

    Data quality analysis is critical in all of the solutions from Complete Data Quality, especially the data modeling…

    1 条评论
  • Data Modeling: Profile & Position Analysis

    Data Modeling: Profile & Position Analysis

    The first two steps in Complete Data Quality’s data modeling solution are the profile and position analysis steps. The…

  • Data Modeling: Solution Overview

    Data Modeling: Solution Overview

    Complete Data Quality’s data modeling solution is designed to support extracting business data from unstructured text…

  • Data Modeling: Driven by the Data

    Data Modeling: Driven by the Data

    As a data modeler, I will tell you that all of my data models are driven by the data. After all, the data is what…

  • Data Modeling: Driven by the Business Value

    Data Modeling: Driven by the Business Value

    Data is the lifeblood for the modern enterprise and a large percentage of enterprise data is unstructured, which is not…

    1 条评论

社区洞察

其他会员也浏览了