Data Integrity - a more detailed Data Science perspective
Darko Medin
Data Scientist and a Biostatistician. Developer of ML/AI models. Researcher in the fields of Biology and Clinical Research. Helping companies with Digital products, Artificial intelligence, Machine Learning.
Data integrity is one of the most important segments in today’s Research. Data validity, structure quality, credibility, consistence, completeness, timeliness are some of the well-known concepts but there are many less discussed concepts like storage, software, code and confounders. In this article a more detailed Data Science perspective will be shared - Subject matter, Meta-data checks, Bivariate/Multivariate checks, Covariate checks, keycode checks, data storage validation, Data Resolution and Structure.
1. Data Validation ~
Data validation is not just removing duplicates, missing values and making sure the values are valid in terms of range type etc. as many think (even though these are of course necessary part of the data validation).
Data must meet a much wider spectrum of well defined rules to be considered valid. Validation Criteria blueprint is one of the most essential segments of ensuring Data validity and is ideally predefined. Data validity criteria should ensure that the data is representing the subject matter so a lot of communication between Subject matter experts and Statisticians is needed to define the rules for credibility and representative power of the data. Data should be technically acceptable (data type, range, uniqueness and many other technical criteria discussed in the text), original, unchanged, contextually credible and timely.?
From my experience the key for these criteria is something I would call the ‘Separation Power’ in terms of how effective is a criterion in Separating the wanted from unwanted data.
Concise and accurate Meta-data is one of the keys for this. Criterions should be sound and clear without any areas of doubt (this area of doubt can create inaccuracies). Domain knowledge experts and Statisticians should both be included in this process. Important Data Science perspective ~ Data Scientist should always think how to automate the data validation process, then to be able to program this process correctly and add the failsafe mechanisms that will act in case the validation process produces errors.
Just as the Meta-data can be used to validate observations, covariate data, especially the correlated data, can be used to validate data too. The amount of covariate variables increases the spectrum of validation possibilities. This type of check is considered as for Bivariate or Multivariate Check rule.
Defined key codes that can validate/invalidate any observation are among most effective techniques to iterate different rules trough the data. These keys codes may serve for being keys in a relational databases mostly in SQL format, sometimes as dictionaries or Hashtables. A different approach to key codes is to used to to identify certain sequence of letters or numbers in the data. This type of validation enables to validate only observations that contain the keycode.
2. Data Quality~
A well known aspect is of course the Accuracy. Accuracy itself is very fluid concept. Measurement accuracy is an important part of the validation concept. If the data is not complete it will also affect the Accuracy. If there is too much Noise it will also affect the Accuracy. Making sure that the data is as complete as possible and will lowest optimal level of noise is essential (noise can sometimes help but not at the data validation stage).?
One very often overlooked aspect is – Data resolution. Same way an image or a Computer screen have a certain resolution in Megapixel, other types of data can have a resolution. Numerical data has number of decimals, audio data has harmonics, and biomedical data can have detection limits.?
Statistical Perspective - The higher the resolution, the higher the ability to detect small details which often play a critical role. To translate this into Stats terms, lower resolution data may produce a lot of bias and Type II errors in case the effects sizes in comparisons are very small. For categorical data, resolution mostly comes from Meta-data, the more detailed the Meta-data, the higher the resolution of placing a certain observation in a defined category. It’s important to know that Type II errors can arise from sample size but even if the sample size is good, the resolution might cause a slight increase in these errors.
Data Quality - Level of information is at its highest in its clean but raw form, meaning the data is cleaned of unacceptable types of Data, but still untransformed.?Transforming the data might be needed for further analytics but its important to know that most transformation techniques reduce (generally slightly) the level of information in the data. That is one of the reasons why original data is most credible.
3. Data completeness ~
Most researchers consider this aspect in terms of how many missing values are present and the coverage of the features, but there is much more. Unique indexes are also very important to present in the data. Relational databases, mostly in use today use these indexes to connect unique values in different tables and this is needed to avoid duplicates too. Topics already discussed like full context Meta-data and Potential correlated covariates might be among the factors to consider the data complete or not.
领英推荐
Data Science perspective – When cleaning data, one of the most important aspects is dealing with missing values. Generally speaking missing values can either be deleted using column wise or row wise rule or Sometimes kept as acceptable in terms of different Statistical tests, so it’s not a general rule, and each situation is different. For example most PCA implementations will require deletion of missing value rows but t-tests can be easily performed with missing values.
However, Meta-data or other types of contextual data can also be missing for some observations and in my opinion in this case the observation is a highly probable candidate for removal.
Multivariate completeness. Sometimes validity of the Data is defined by the logic of other variables. One good examples is the blood marker and the parameters of blood concentration. Depending on the factors of blood concentration and amount of H2O in serum blood marker concentrations might vary, so to be very accurate all these must be taken into account to consider the data for blood biomarkers fully complete.
4. Data integrity vs outliers ~
The principle of removing outliers just because they are outliers is incorrect. The removal of outliers can only be done correctly if other Data integrity principles are breached (eg. Not representing the measurement or being inaccurate etc.). Outliers are just values which deviate from the majority of other values and this deviation could arise from natural and valid variation in the data. Incorrect removal of outliers can Bias the parametric central tendency and instead of making the data valid for tests, make the data invalid for many tests. When outliers are just a normal part of variation, the solution is using the tests which are insensitive to outliers not removing them.
5. Data Timeliness~
As mentioned before one of the data validation rules is making sure that the data represents the real world conditions or subject matters. But these conditions change over time, so making the the timeliness of the data is not outdated is another. If data is following a larger period of time?
6. The Structure ~
This is one of the most important segments of making sure of the Data. Every format in a Data Science has certain structure rules. These rules make sure that software/algorithms used to process the data are compatible with the format and configuration of the Data. It should be noted that there are a lot of segments mentioned before in the data validation which all take part in the Data structure but, their spatial and formatting configurations also play a role.
7. Structured validation.
This part should not be confused with the ' The Structure of the data'. As seen from the discussion above, the validation includes a complex set of procedures, data characteristics and rules. Structured validation is a process where all these are combined to produce a faster and more effective validation principle and associate the rules to each other. In Data Science this process enables effective approach to perform multiple validations at the same time while also being able to customize the Validation blueprint if needed. Today different packages in Data Science software are well developed and validated themselves for Structured validation.
8. The code ~
Maintaining Data integrity though time requires strong coding bases. From data science parts to data management and database storage codes, this part is essential in making sure that Data is both Validated and can be stored unchanged over time, but also accessed and communicated while preserving all the segments of Data integrity. Errors in the code could render the data invalid, so its important to do regular checks on the code of any Databases/Platforms where data is stored.
By Darko Medin, Senior Clinical Biostatistician and a Data Scientist
Data Scientist and a Biostatistician. Developer of ML/AI models. Researcher in the fields of Biology and Clinical Research. Helping companies with Digital products, Artificial intelligence, Machine Learning.
2 年Jon Nugent