Avoiding mistakes with
Data requirements
in Projects

Avoiding mistakes with Data requirements in Projects

In a Digital age where large amounts of different Data are actually enabling majority of the Research, Business, Economy, Financial, Industry, Meteorological, AI, Technology, Software development and many other areas, Data is the key segment enabling majority of these projects.

In this edition i will focus on showing how to avoid the Avoiding mistakes with defining the Data in these Projects from a Data Science perspective.


Mistakes to avoid :

Framing the problems while disregarding the availability of the Data

In most projects today, wheatear they are in the Business projects, innovation projects, software development or Clinical Research related projects, at some stage different type of data are to be made available for the Project. One situation that frequently happens is managers framing the Problems and their resolution in the Goals of the Project, without knowing which types and how much of the data is available to achieve those goals. What good is framing the problem which can not be resolved with no adequate data present? It should be kept in mind that implementation of the Project frequently depends on the availability of the Data, so that's the first aspect to look for in projects, does the data support the problem resolution of the project.


Thinking the correlated data will provide all the answers and selecting only the correlated data.

Frequently, even the Data Scientists will ask the following initial question about the data, is the data correlated or associated to the problem and use that question to answer how much of the problem it can explain. Correlated variables often have no causative value (only sometimes and not that frequently) and that question will not provide the right answer most of the time. Correlated does not mean explainable. Further, correlations often come from indirect associations and even randomness. Instead of asking is the data correlated to the problem, collect the full available data and make deeper analysis like causative inference, feature engineering, SEM and others. Do not underestimate non-correlated data as it can have mediating effects on other data.


Thinking the main data will tell the whole story.

The main part of the data, often framed in data tames, datasets has the context around it. Interpreting the data without the context can cause a lot of errors in both predictions and interpretations either in the analysis or the products from the Project. Get the context around the data to the maximal extent possible by communication with the Domain knowledge experts and Data contributors/collectors. Use the meta-data as additional source of information.

There are two strategies i like to use in different projects:

A. If possible include these in the dataset as labels and additional information. Use the labels and additional information and add them to the models or the analysis to account for them.

B. If its not possible to include the background data in the form of the model, always account for it in the interpretation and explainabaility of the Project/Product outputs.


Having just 'enough' available data.

When selecting the data managers often think also about the amounts of the available data and select the data which complies with the reshold of the needed data - Wrong! Its a rare situation that all 100% of the data is used. Sometimes just parts of the data are actually labeled as useful and there are situations where only 5 or 10% of the data is used. Depending on the quality of the data and the Project framework, i would advice to have at least as twice to 10-fold or more as much as data defined initially. After all the selection and data cleaning, its important to have a 'safe' level of enough data.


Not defining the databases where data will be stored.

Making the framework of the project without defined databases or data repositories for safe data storage can cause a very bad situation where data is unorganized, missing, not having backups, no clear structure and hard to access. Significant loss of both time and data can be caused is this segment is not optimal. So its essential do define exactly how and where the data will be stored.


Not including Data Scientists early

Last but very important in my opinion. From my experience, its very important to include Data Scientists very early, at the Project design stage. Situation where project leaders come to with already defined design, will lead to Data Scientist discovering that wrong types of data are defined or not enough data or data management and use is not well organized. Then, redesigning parts or even the whole projects is needed. A much efficient principle would be to include a Data Scientist early so design of the project does not have to be repeated due to data problems.

So, a very important segment, if the Data Scientist is to be included in a Project its a good idea to do that early in the Project.


By Darko Medin,

A Data Scientist and a Statistician

要查看或添加评论,请登录

Darko Medin的更多文章

社区洞察

其他会员也浏览了