登录查看更多内容

Modeling Data Classes

Antonio Amorin

President & Data System Programmer at Complete Data Quality

发布日期: 2025年2月10日

Modernizing data classification to address all types of data requires a more advanced data structure for the data classes. This is necessary to address all of the different variations of values that represent the same business information. A simple example would be the possible variations of country values. The country values may come in the form of country names, two-character abbreviations, or even three-character abbreviations. All of these variations need to be addressed by the country data class.

In order to achieve this goal, the data structure is expanded to define data types. For our country data class, we would have three different country data types to represent the variations. Adding the data types to the data class allows for these variations.

However, there is a problem with this data structure because in some circumstances a data type may align to multiple data classes, creating a many-to-many situation. For example, U.S. state names can also be first names, like Alabama, or even last names, like Montana. A class type is introduced to resolve the many-to-many situation.

The diagram above depicts the relationships as viewed from the data class perspective, but keep in mind that a data type may be associated to multiple class types. The class types manage the relationship between a data class and the data types. In other words, the class type identifies the data types that are valid for the data class.

When creating a new data class, the data modeler first creates the data types. A data type identifies the data constraints specific to a set of data. However, a classification data type requires more data constraints than the traditional data type and length defined for a column in a relational database. The data constraints must be more specific in order to classify values properly.

Instead of a data type like Boolean, character, decimal, integer, etc., the classification data type defines allowed and required characters. The characters can be alpha, numeric, punctuation, symbols, or spaces. The allowed and required characters can be one of the character types or a combination of character types. If the allowed and required characters include punctuation or symbols, then the modeler defines which punctuation or symbols are allowed as well. The length data constraint is a length range, rather than a fixed length.

These are the basic data constraints that are required for the data type, whether the data type is intended for structured or unstructured data. The data modeler may define additional data constraints such as patterns, value patterns, ranges, valid values, and more. Be aware that there are a handful of data constraints specifically used for classifying values in unstructured data that are not available or used for classifying structured data.

For either structured or unstructured data, the basic data constraints need to be as accurate as possible in order to classify data properly. To ensure the accuracy, the data modeler feeds valid values to the modeling tool to infer the basic data constraints and the patterns constraint. This approach accelerates the modeling process and increases the accuracy of the data constraints.

领英推荐

Anatomy Of A Data Stack (2024 Update)

173tech 9 个月前

Secrets to Creating an Effective Data Strategy: Tips…

Eckerson Group 8 个月前

The Promises and Pitfalls of a Self-Service Data…

BlastX Consulting 2 年前

The data modeler is of course able to manually customize the inferred data constraints, but feeding values to the modeling tool to make the desired adjustments is recommended. The data modeler may need to add additional constraints in order to classify a specific set of values correctly. For example, pattern constraints make a lot of sense for phone numbers, social security numbers, and values that have a specific format. For the country data types, valid values make a lot of sense because the country names and abbreviations are clearly defined and change very slowly over time.

Data classification modeling requires intimate knowledge about the data in order to create the proper data constraints. The data modeler is provided with the ability to use the classification engine to expose the data to gain the intimate knowledge required to model the data classes.

How is this achieved? Base system data classes are provided to ensure no values can fall through the classification process. The base system data classes provide very basic classifications such as alpha, alphanumeric, Boolean, character, decimal, integer, and money. The intention is to ensure that all forms of data are covered. The character base system data class provides coverage for values that contain punctuation and/or symbols with alpha, alphanumeric, or numeric characters. The character base system class will also provide coverage for unexpected garbage characters.

The data modeler simply runs the classification engine against a data source before modeling the new data classes. For any values that are not classified using modeled data classes, the system data classes are used to classify the values. The data modeler is able to analyze these values and actually use them to automatically forward engineer new classification data types.

The goal is to provide the data modeler with the ability to analyze the data in order to understand what values are present for modeling new data classes. This approach provides the data modeler with the intimate knowledge necessary to create new data classes.

This is the second article in a series describing the evolution of data classification to address all types of data, not just sensitive data. The next article will continue to explain the modeling process and dig deeper into the details.

Eugene Breger

Synagogue Financial Support Services

2 周

Advanced classification data structure from blueFlash Software and Antonio Amorin

1 次回应

Gerald Provost

Founder & Principal - Transborder Ventures

2 周

Well stated! #dataclassification

1 次回应

查看更多评论

要查看或添加评论，请登录

Antonio Amorin的更多文章

Modernizing Data Classification

2025年2月8日

Modernizing Data Classification

Data is finally in the spotlight! Whether we are talking about being data-driven or the impact that low quality data…

7 条评论
Data Classification Evolves Data Governance

2023年1月26日

Data Classification Evolves Data Governance

Clean Cloud from blueFlash Software evolves data governance by introducing the ability to create a single set of rules…
Data Quality Assessment: Origin Story

2022年12月10日

Data Quality Assessment: Origin Story

In the summer of 2001, I was in Fishkill, NY engaged as the lead data profiling consultant on a sales data warehouse…

1 条评论
Data Modeling: Certified Data Repository

2022年11月21日

Data Modeling: Certified Data Repository

The certified data repository captures the results of the data modeling solution. All certified values are loaded into…
Data Modeling: Data Certification

2022年11月17日

Data Modeling: Data Certification

Complete Data Quality’s data modeling solution tasks the data community with certifying the classified values. The data…
Data Modeling: Data Quality Analysis

2022年11月10日

Data Modeling: Data Quality Analysis

Data quality analysis is critical in all of the solutions from Complete Data Quality, especially the data modeling…

1 条评论
Data Modeling: Profile & Position Analysis

2022年11月4日

Data Modeling: Profile & Position Analysis

The first two steps in Complete Data Quality’s data modeling solution are the profile and position analysis steps. The…
Data Modeling: Solution Overview

2022年10月23日

Data Modeling: Solution Overview

Complete Data Quality’s data modeling solution is designed to support extracting business data from unstructured text…
Data Modeling: Driven by the Data

2022年10月14日

Data Modeling: Driven by the Data

As a data modeler, I will tell you that all of my data models are driven by the data. After all, the data is what…
Data Modeling: Driven by the Business Value

2022年10月12日

Data Modeling: Driven by the Business Value

Data is the lifeblood for the modern enterprise and a large percentage of enterprise data is unstructured, which is not…

1 条评论

See all articles

Modeling Data Classes

Antonio Amorin

President & Data System Programmer at Complete Data Quality

领英推荐

Antonio Amorin的更多文章

社区洞察

其他会员也浏览了

Why are more companies turning to data strategy consultants?

Power BI Terminology Essentials: Data Profiling

What is Data Profiling? Definition, Tools and Examples

3 Roadblocks on The Path to Data Mapping

Clickstream Data: What It Is and How It Works

Data Mesh

THE ROLE OF THE DATA ARCHITECT

Data & Technology Revolution: Impact on Data & Analytics Jobs

Data Fundamentals in Plant Floor : Day 3

领英推荐

Antonio Amorin的更多文章

Modernizing Data Classification

Data Classification Evolves Data Governance

Data Quality Assessment: Origin Story

Data Modeling: Certified Data Repository

Data Modeling: Data Certification

Data Modeling: Data Quality Analysis

Data Modeling: Profile & Position Analysis

Data Modeling: Solution Overview

Data Modeling: Driven by the Data

Data Modeling: Driven by the Business Value

社区洞察

其他会员也浏览了

Why are more companies turning to data strategy consultants?

Power BI Terminology Essentials: Data Profiling

What is Data Profiling? Definition, Tools and Examples

3 Roadblocks on The Path to Data Mapping

Clickstream Data: What It Is and How It Works

Data Mesh

THE ROLE OF THE DATA ARCHITECT

Data & Technology Revolution: Impact on Data & Analytics Jobs

Data Fundamentals in Plant Floor : Day 3