登录查看更多内容

Avoiding mistakes with Data requirements in Projects

Darko Medin

Data Scientist and a Biostatistician. Developer of ML/AI models. Researcher in the fields of Biology and Clinical Research. Helping companies with Digital products, Artificial intelligence, Machine Learning.

发布日期: 2022年5月27日

In a Digital age where large amounts of different Data are actually enabling majority of the Research, Business, Economy, Financial, Industry, Meteorological, AI, Technology, Software development and many other areas, Data is the key segment enabling majority of these projects.

In this edition i will focus on showing how to avoid the Avoiding mistakes with defining the Data in these Projects from a Data Science perspective.

Mistakes to avoid :

Framing the problems while disregarding the availability of the Data

In most projects today, wheatear they are in the Business projects, innovation projects, software development or Clinical Research related projects, at some stage different type of data are to be made available for the Project. One situation that frequently happens is managers framing the Problems and their resolution in the Goals of the Project, without knowing which types and how much of the data is available to achieve those goals. What good is framing the problem which can not be resolved with no adequate data present? It should be kept in mind that implementation of the Project frequently depends on the availability of the Data, so that's the first aspect to look for in projects, does the data support the problem resolution of the project.

Thinking the correlated data will provide all the answers and selecting only the correlated data.

Frequently, even the Data Scientists will ask the following initial question about the data, is the data correlated or associated to the problem and use that question to answer how much of the problem it can explain. Correlated variables often have no causative value (only sometimes and not that frequently) and that question will not provide the right answer most of the time. Correlated does not mean explainable. Further, correlations often come from indirect associations and even randomness. Instead of asking is the data correlated to the problem, collect the full available data and make deeper analysis like causative inference, feature engineering, SEM and others. Do not underestimate non-correlated data as it can have mediating effects on other data.

Thinking the main data will tell the whole story.

The main part of the data, often framed in data tames, datasets has the context around it. Interpreting the data without the context can cause a lot of errors in both predictions and interpretations either in the analysis or the products from the Project. Get the context around the data to the maximal extent possible by communication with the Domain knowledge experts and Data contributors/collectors. Use the meta-data as additional source of information.

There are two strategies i like to use in different projects:

A. If possible include these in the dataset as labels and additional information. Use the labels and additional information and add them to the models or the analysis to account for them.

领英推荐

Ahmed Elsamadisi shares a new approach to data, Byron…

Victoria Taylor 4 年前

Probabilistic Data Structures: Bloom filter

Ahmed Shamim Hassan 1 年前

Understanding Statistical Distributions

Walter Shields 1 个月前

B. If its not possible to include the background data in the form of the model, always account for it in the interpretation and explainabaility of the Project/Product outputs.

Having just 'enough' available data.

When selecting the data managers often think also about the amounts of the available data and select the data which complies with the reshold of the needed data - Wrong! Its a rare situation that all 100% of the data is used. Sometimes just parts of the data are actually labeled as useful and there are situations where only 5 or 10% of the data is used. Depending on the quality of the data and the Project framework, i would advice to have at least as twice to 10-fold or more as much as data defined initially. After all the selection and data cleaning, its important to have a 'safe' level of enough data.

Not defining the databases where data will be stored.

Making the framework of the project without defined databases or data repositories for safe data storage can cause a very bad situation where data is unorganized, missing, not having backups, no clear structure and hard to access. Significant loss of both time and data can be caused is this segment is not optimal. So its essential do define exactly how and where the data will be stored.

Not including Data Scientists early

Last but very important in my opinion. From my experience, its very important to include Data Scientists very early, at the Project design stage. Situation where project leaders come to with already defined design, will lead to Data Scientist discovering that wrong types of data are defined or not enough data or data management and use is not well organized. Then, redesigning parts or even the whole projects is needed. A much efficient principle would be to include a Data Scientist early so design of the project does not have to be repeated due to data problems.

So, a very important segment, if the Data Scientist is to be included in a Project its a good idea to do that early in the Project.

By Darko Medin,

A Data Scientist and a Statistician

Advanced Stats / Data Science

12,680 位关注者

要查看或添加评论，请登录

Darko Medin的更多文章

OncoNeo400 - A new Precision Oncology Research AI tool on BioAIWorks

2025年3月16日

OncoNeo400 - A new Precision Oncology Research AI tool on BioAIWorks

In this edition the OncoNeo400, novel Precision Oncology Research AI tool on BioAIWorks platform (bioaiworks.com).

7 条评论
LARVOL CLIN - New modules

2025年3月3日

LARVOL CLIN - New modules

This featuring article is about the new modules Larvol Pseudo-IPD and Larvol NMA on https://clin.larvol.

1 条评论
AI Developer tech skillsets.

2025年2月24日

AI Developer tech skillsets.

While these skills may vary according to the role, i will discuss the most significant ones that almost every AI…

2 条评论
Featuring article - the book : How To Be an Effective Statistician by Dr. Alexander Schacht

2025年2月16日

Featuring article - the book : How To Be an Effective Statistician by Dr. Alexander Schacht

The book How To Be an Effective Statistician: A Guide for Statisticians, Data Scientists, and Other Quantitative…

2 条评论
Causal Inference II Live - The ORIENTATION

2025年2月11日

Causal Inference II Live - The ORIENTATION

Causal Inference II is a Live Linkedin Event by Justin Bélair and Darko Medin . Here is the orientation on how and when…

9 条评论
Simulated and Synthetic Data Generation - Edition 1

2024年10月31日

Simulated and Synthetic Data Generation - Edition 1

The first in the series for Simulated and Synthetic Data Generation - by Darko Medin. Where to read :…
Simulated and Synthetic Data Series by Darko Medin - An ORIENTATION

2024年10月20日

Simulated and Synthetic Data Series by Darko Medin - An ORIENTATION

This is the orientation for my upcoming Series on Simulated and Synthetic Data. If you have any additional suggestions…

5 条评论
Simulated and Synthetic Data Generation - The Effective Statistician Workshop ORIENTATION - Lead by Darko Medin

2024年10月13日

Simulated and Synthetic Data Generation - The Effective Statistician Workshop ORIENTATION - Lead by Darko Medin

In today's data-driven world ability to generate Simulated and Synthetic data is one of the most important Data Science…
INTRODUCTION TO DEEP LEARNING

2024年10月3日

INTRODUCTION TO DEEP LEARNING

The INTRODUCTION TO DEEP LEARNING tutorial. Where to find? adatascience.
BioAIworks - The novel AI platform

2024年9月25日

BioAIworks - The novel AI platform

Bio AI works is a novel AI platform, with main focus on AI Data Generation, Augmenting Biology and Biomedical Research…

8 条评论

See all articles

Avoiding mistakes with Data requirements in Projects

Darko Medin

Data Scientist and a Biostatistician. Developer of ML/AI models. Researcher in the fields of Biology and Clinical Research. Helping companies with Digital products, Artificial intelligence, Machine Learning.

领英推荐

Advanced Stats / Data Science

12,680 位关注者

Darko Medin的更多文章

社区洞察

其他会员也浏览了

Mastering Time Series Analysis from Scratch: A Data Scientist's Roadmap

Log-Normal Distribution in Data Science: Applications and Insights

Critical analysis of Big Data challenges and analytical methods

Journey of Data, depicted as Story

Data + LLM News - September 2024

Interweaving Design Thinking and Data Science to Unleash Economic Value of Data

Data Quality is your greatest ‘Edge’ in an otherwise Efficient-Leaning Marketplace

Dealing with Erratic Data in Time Series Forecasting: Strategies and Algorithms

The Data Scientist's Prayer: Finding Humour and Insight in the World of Data

Mastering the Top 10 Statistical Concepts: The Key to Success in Data Science

领英推荐

Advanced Stats / Data Science

12,680 位关注者

Darko Medin的更多文章

OncoNeo400 - A new Precision Oncology Research AI tool on BioAIWorks

LARVOL CLIN - New modules

AI Developer tech skillsets.

Featuring article - the book : How To Be an Effective Statistician by Dr. Alexander Schacht

Causal Inference II Live - The ORIENTATION

Simulated and Synthetic Data Generation - Edition 1

Simulated and Synthetic Data Series by Darko Medin - An ORIENTATION

Simulated and Synthetic Data Generation - The Effective Statistician Workshop ORIENTATION - Lead by Darko Medin

INTRODUCTION TO DEEP LEARNING

BioAIworks - The novel AI platform

社区洞察

其他会员也浏览了

Mastering Time Series Analysis from Scratch: A Data Scientist's Roadmap

Log-Normal Distribution in Data Science: Applications and Insights

Critical analysis of Big Data challenges and analytical methods

Journey of Data, depicted as Story

Data + LLM News - September 2024

Interweaving Design Thinking and Data Science to Unleash Economic Value of Data

Data Quality is your greatest ‘Edge’ in an otherwise Efficient-Leaning Marketplace

Dealing with Erratic Data in Time Series Forecasting: Strategies and Algorithms

The Data Scientist's Prayer: Finding Humour and Insight in the World of Data

Mastering the Top 10 Statistical Concepts: The Key to Success in Data Science