Modeling Data Classes
Modernizing data classification to address all types of data requires a more advanced data structure for the data classes. This is necessary to address all of the different variations of values that represent the same business information. A simple example would be the possible variations of country values. The country values may come in the form of country names, two-character abbreviations, or even three-character abbreviations. All of these variations need to be addressed by the country data class.
In order to achieve this goal, the data structure is expanded to define data types. For our country data class, we would have three different country data types to represent the variations. Adding the data types to the data class allows for these variations.
However, there is a problem with this data structure because in some circumstances a data type may align to multiple data classes, creating a many-to-many situation. For example, U.S. state names can also be first names, like Alabama, or even last names, like Montana. A class type is introduced to resolve the many-to-many situation.
The diagram above depicts the relationships as viewed from the data class perspective, but keep in mind that a data type may be associated to multiple class types. The class types manage the relationship between a data class and the data types. In other words, the class type identifies the data types that are valid for the data class.
When creating a new data class, the data modeler first creates the data types. A data type identifies the data constraints specific to a set of data. However, a classification data type requires more data constraints than the traditional data type and length defined for a column in a relational database. The data constraints must be more specific in order to classify values properly.
Instead of a data type like Boolean, character, decimal, integer, etc., the classification data type defines allowed and required characters. The characters can be alpha, numeric, punctuation, symbols, or spaces. The allowed and required characters can be one of the character types or a combination of character types. If the allowed and required characters include punctuation or symbols, then the modeler defines which punctuation or symbols are allowed as well. The length data constraint is a length range, rather than a fixed length.
These are the basic data constraints that are required for the data type, whether the data type is intended for structured or unstructured data. The data modeler may define additional data constraints such as patterns, value patterns, ranges, valid values, and more. Be aware that there are a handful of data constraints specifically used for classifying values in unstructured data that are not available or used for classifying structured data.
For either structured or unstructured data, the basic data constraints need to be as accurate as possible in order to classify data properly. To ensure the accuracy, the data modeler feeds valid values to the modeling tool to infer the basic data constraints and the patterns constraint. This approach accelerates the modeling process and increases the accuracy of the data constraints.
领英推荐
The data modeler is of course able to manually customize the inferred data constraints, but feeding values to the modeling tool to make the desired adjustments is recommended. The data modeler may need to add additional constraints in order to classify a specific set of values correctly. For example, pattern constraints make a lot of sense for phone numbers, social security numbers, and values that have a specific format. For the country data types, valid values make a lot of sense because the country names and abbreviations are clearly defined and change very slowly over time.
Data classification modeling requires intimate knowledge about the data in order to create the proper data constraints. The data modeler is provided with the ability to use the classification engine to expose the data to gain the intimate knowledge required to model the data classes.
How is this achieved? Base system data classes are provided to ensure no values can fall through the classification process. The base system data classes provide very basic classifications such as alpha, alphanumeric, Boolean, character, decimal, integer, and money. The intention is to ensure that all forms of data are covered. The character base system data class provides coverage for values that contain punctuation and/or symbols with alpha, alphanumeric, or numeric characters. The character base system class will also provide coverage for unexpected garbage characters.
The data modeler simply runs the classification engine against a data source before modeling the new data classes. For any values that are not classified using modeled data classes, the system data classes are used to classify the values. The data modeler is able to analyze these values and actually use them to automatically forward engineer new classification data types.
The goal is to provide the data modeler with the ability to analyze the data in order to understand what values are present for modeling new data classes. This approach provides the data modeler with the intimate knowledge necessary to create new data classes.
This is the second article in a series describing the evolution of data classification to address all types of data, not just sensitive data. The next article will continue to explain the modeling process and dig deeper into the details.
?
All Rights Reserved by Complete Data Quality, Inc. ?2025
Synagogue Financial Support Services
2 周Advanced classification data structure from blueFlash Software and Antonio Amorin
Founder & Principal - Transborder Ventures
2 周Well stated! #dataclassification