登录查看更多内容

Top 10 Coding Mistakes Made by Data Scientists

Parvez Shah Shaik

Data Analyst at Fortray

发布日期: 2024年5月21日

A data scientist is often described as someone who is more skilled in statistics than any software engineer and more adept at software engineering than any statistician. Many data scientists come from a statistics background and may not have extensive experience in software engineering. As a data scientist, I’ve compiled a list of 10 prevalent coding errors I often encounter.

1. Failing to Share Data Alongside Code

In data science, both code and data are essential for replicating results. It may seem straightforward, yet it’s common to see that the data referenced in the code isn’t shared, making it difficult for others to reproduce the findings.

import pandas as pd
df1 = pd.read_csv('file-i-dont-have.csv') # fails
do_stuff(df)

Solution: Efficient Data Sharing Techniques

To ensure that both your data and code are accessible for accurate result replication, consider using tools like d6tpipe to share data files seamlessly alongside your code. Alternatively, you can upload your data to cloud storage solutions such as Amazon S3, Google Drive, or a web server. Another viable option is saving your data to a database where recipients can easily retrieve the files. However, it’s important to avoid adding large data files directly to version control systems like Git. Instead, opt for these more efficient methods for sharing large datasets.

2. Hardcoding Inaccessible Paths

This issue mirrors the first one. If you hardcode file paths that others cannot access, they will not be able to execute your code without having to manually search for and modify these paths in multiple locations, which is inefficient and frustrating.

import pandas as pd
df = pd.read_csv('/path/i-dont/have/data.csv') # fails
do_stuff(df)

# or 
import os
os.chdir('c:\\Users\\yourname\\desktop\\python') # fails

Solution: Streamlining Access to?Data

To avoid the complications of hardcoded paths, it’s beneficial to use relative paths in your code. Alternatively, setting up global path configuration variables can simplify path management across different environments. Another efficient method is to utilize tools like d6tpipe , which can help ensure that your data is easily accessible and your code more portable. These practices promote better code usability and reproducibility.

3. Mixing Data with?Code

Combining data, code, images, reports, and other files in the same directory can create a disorganized and cluttered workspace. This practice complicates the management of your project’s components and can lead to inefficiencies and errors when trying to navigate through such a messy directory.

├── data.csv
├── ingest.py
├── other-data.csv
├── output.png
├── report.html
└── run.py

Solution: Structuring Your Project Directory

To maintain an organized workspace, it’s advisable to categorize your directory into clear sections such as ‘data’, ‘reports’, and ‘code’. You can use project templates like those provided by Cookiecutter Data Science or d6tflow to help structure your projects effectively. Additionally, employ the data storage and sharing tools mentioned in the previous points to manage and distribute your data efficiently. This structured approach not only enhances the clarity of your project but also improves workflow efficiency.

4. Committing Data with Source Code in?Git

While version controlling your code is essential (and not doing so is indeed a mistake), using Git to also version control data files can lead to problems. While committing very small data files might be manageable, Git is not designed to handle large files effectively. This can slow down repository operations and complicate data management.

git add data.csv

Solution: Efficient Data Version?Control

Instead of committing large data files directly to Git, consider using specialized tools designed for data version control. Options like d6tpipe , Data Version Control (DVC), and Git Large File Storage (LFS) are tailored for handling larger datasets efficiently. These tools enable you to version control your data without compromising the performance and manageability of your Git repository, aligning with the data-sharing methods recommended earlier.

5. Writing Functions Instead of Using?DAGs

When transitioning from basic coding to data science, many data scientists initially organize their code as a series of linearly executed functions, a practice picked up from early programming lessons. However, this approach can lead to inefficiencies and issues in more complex machine learning projects. This method lacks the structured flow and dependency management provided by Directed Acyclic Graphs (DAGs), which are essential for handling more sophisticated workflows in data science, see 4 Reasons Why Your Machine Learning Code is Probably Bad

def process_data(data, parameter):
    data = do_stuff(data)
    data.to_pickle('data.pkl')

data = pd.read_csv('data.csv')
process_data(data)
df_train = pd.read_pickle(df_train)
model = sklearn.svm.SVC()
model.fit(df_train.iloc[:,:-1], df_train['y'])

Solution: Utilizing Task-Based Workflows with Dependencies

To overcome the limitations of linear function chaining in data science projects, it’s beneficial to structure your code as a set of tasks that have defined dependencies. This method facilitates more organized and efficient workflows, especially when dealing with complex data processes. Tools like d6tflow and Apache Airflow are excellent for managing these task-based workflows. They allow you to define and automate the execution order of tasks based on their dependencies, enhancing both the reliability and scalability of your data science projects.

goML 1 年前

How to Use Python for Data Engineering [Use Cases with…

AnalytixLabs 4 个月前

What Skills Do Data Engineers Need?-The Data Engineer…

Benjamin Rogojan 3 年前

6. Relying on For?Loops

For loops are often one of the first programming constructs learned by new coders. While they are straightforward and easy to understand, for loops can be inefficient, particularly in data science tasks where processing speed is crucial. They are typically slower and more verbose than necessary, often indicating a lack of awareness of more efficient, vectorized alternatives that can handle operations in bulk.

x = range(10)
avg = sum(x)/len(x); std = math.sqrt(sum((i-avg)**2 for i in x)/len(x));
zscore = [(i-avg)/std for x]
# should be: scipy.stats.zscore(x)

# or
groupavg = []
for i in df['g'].unique():
 dfg = df[df[g']==i]
 groupavg.append(dfg['g'].mean())
# should be: df.groupby('g').mean()

Solution: Embracing Vectorized Functions

To enhance efficiency and reduce the reliance on slow for loops in your code, consider utilizing vectorized functions provided by libraries such as Numpy , Scipy and Pandas . These libraries offer a wide range of vectorized operations that can perform bulk calculations on data arrays much more quickly and with less code than traditional for loops. By adopting these vectorized functions, you can significantly speed up your data processing tasks and streamline your code.

7. Neglecting to Write Unit?Tests

In the dynamic field of data science, as data, parameters, or user inputs change, there’s a high risk that your code may break, sometimes without immediate detection. This oversight can result in incorrect outputs, which, if used for decision-making, can lead to poor and costly decisions. Writing unit tests is crucial for ensuring that each part of your code functions correctly under various conditions, thus safeguarding against the potentially significant consequences of undetected errors.

assert df['id'].unique().shape[0] == len(ids) # have data for all ids?
assert df.isna().sum()<0.9 # catch missing values
assert df.groupby(['g','date']).size().max() ==1 # no duplicate values/date?
assert d6tjoin.utils.PreJoin([df1,df2],['id','date']).is_all_matched() # all ids matched?

Solution: Implementing Robust Data Quality?Checks

To ensure the integrity and accuracy of your data, implementing assert statements can be a powerful method for ongoing data quality assurance. Libraries like pandas provide functions for testing data equality, which can help validate that your data meets expected criteria. Additionally, tools like d6tstack offer specialized checks for data ingestion, and d6tjoin is useful for verifying data joins.

For example, you can use pandas to perform equality tests to ensure that data transformations or merges are executed correctly. Below is a basic example of how you might use assertions in your data checks:

import pandas as pd

# Assuming df_expected and df_actual are two DataFrame objects you want to compare
assert df_expected.equals(df_actual), "The data frames are not equal."

# Using d6tjoin to check for successful joins
import d6tjoin

result = d6tjoin.utils.PreJoin([df1, df2], ['key1', 'key2']).is_all_matched()
assert result, "Data join mismatch found."

# Using d6tstack to ensure data ingestion integrity
import d6tstack

result = d6tstack.utils.check_column_names(df, required_columns=['column1', 'column2'])
assert result, "Required columns are missing in the data frame."

These code snippets show how to implement checks that prevent data-related errors in your projects, ensuring that your outputs remain reliable and accurate.

8. Neglecting to Document?Code

In the rush to deliver analysis results, it’s tempting to quickly hack together code without proper documentation. However, when clients or supervisors request changes or updates later, you may find yourself unable to recall the rationale behind your original implementation. This challenge is compounded if someone else needs to run or modify your code. Without documentation, understanding and maintaining the code becomes difficult and time-consuming, leading to potential errors and inefficiencies.

def some_complicated_function(data):
	data = data[data['column']!='wrong']
	data = data.groupby('date').apply(lambda x: complicated_stuff(x))
	data = data[data['value']<0.9]
	return data

Solution: Invest Time in Documentation

Taking the extra time to document your code, even after delivering the initial analysis, is invaluable. Clear documentation helps you understand your own code when revisiting it and makes it easier for others to follow your logic and make necessary updates. This practice not only saves time in the long run but also enhances your professional reputation. Comprehensive documentation includes explaining the purpose of functions, detailing the parameters and outputs, and describing any complex or non-obvious sections of the code. By doing so, you ensure your work is maintainable and reproducible, ultimately making you look like a pro.

9. Saving Data as CSV or Pickle?Files

In data science, commonly used formats like CSVs and pickle files have notable drawbacks. While CSVs are widely used, they lack schemas, requiring users to repeatedly parse numbers and dates. Although pickles preserve data types, they are Python-specific and uncompressed, making them inefficient for large datasets. Both formats fall short when it comes to storing and managing substantial volumes of data effectively.

def process_data(data, parameter):
    data = do_stuff(data)
    data.to_pickle('data.pkl')

data = pd.read_csv('data.csv')
process_data(data)
df_train = pd.read_pickle(df_train)

Solution: Utilize Efficient Data?Formats

To overcome the limitations of CSV and pickle files, opt for more advanced binary data formats that include schemas and support compression. Parquet is an excellent choice, as it efficiently stores data with built-in schema information and compression, making it ideal for large datasets. Additionally, tools like d6tflow can automate the process by saving task outputs as Parquet files, sparing you from manual handling. This approach ensures better data integrity, compatibility, and storage efficiency, enhancing overall workflow management.

10. Relying on Jupyter Notebooks

While Jupyter notebooks are as ubiquitous as CSV files in data science, their popularity doesn’t necessarily equate to best practice. They often encourage poor software engineering habits, such as:

Dumping all files into one directory.
Writing code that runs linearly instead of using Directed Acyclic Graphs (DAGs).
Failing to modularize code.
Difficulties in debugging.
Mixing code and output in one file.
Poor version control capabilities.
Easy initial use but poor scalability.

Solution: Opt for IDEs like PyCharm or?Spyder

To foster better coding practices, consider using integrated development environments (IDEs) like PyCharm or Spyder . These tools support modular code development, efficient debugging, and better version control, ultimately leading to more maintainable and scalable code.

Top 10 Coding Mistakes Made by Data Scientists

Parvez Shah Shaik

Data Analyst at Fortray

1. Failing to Share Data Alongside Code

Solution: Efficient Data Sharing Techniques

2. Hardcoding Inaccessible Paths

Solution: Streamlining Access to?Data

3. Mixing Data with?Code

Solution: Structuring Your Project Directory

4. Committing Data with Source Code in?Git

Solution: Efficient Data Version?Control

5. Writing Functions Instead of Using?DAGs

Solution: Utilizing Task-Based Workflows with Dependencies

领英推荐

6. Relying on For?Loops

Solution: Embracing Vectorized Functions

7. Neglecting to Write Unit?Tests

Solution: Implementing Robust Data Quality?Checks

8. Neglecting to Document?Code

Solution: Invest Time in Documentation

9. Saving Data as CSV or Pickle?Files

Solution: Utilize Efficient Data?Formats

10. Relying on Jupyter Notebooks

Solution: Opt for IDEs like PyCharm or?Spyder

My Data Analyst Resolution

1,768 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

GroupBy #10: Netflix's Psyberg, Parquet format, SQL is not Designed for Analytics

ETL vs ELT: A Surprising Insight About How Dangerous Data Transformations Are

The Roadmap to Becoming a Data Engineering Jedi

DATA ENGINEERING: SKILLS IN DEMAND

Course Announcement: Build Your First Data Engineering Project in AWS

Getting Started with Docker: A Guide for Data Engineers

Software Engineering vs Data Engineering, parallels and differences - Part 1

Using Git in Data Science: the Solo Master

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data

What is the N + 1 problem?

1. Failing to Share Data Alongside Code

Solution: Efficient Data Sharing Techniques

2. Hardcoding Inaccessible Paths

Solution: Streamlining Access to?Data

3. Mixing Data with?Code

Solution: Structuring Your Project Directory

4. Committing Data with Source Code in?Git

Solution: Efficient Data Version?Control

5. Writing Functions Instead of Using?DAGs

Solution: Utilizing Task-Based Workflows with Dependencies

领英推荐

6. Relying on For?Loops

Solution: Embracing Vectorized Functions

7. Neglecting to Write Unit?Tests

Solution: Implementing Robust Data Quality?Checks

8. Neglecting to Document?Code

Solution: Invest Time in Documentation

9. Saving Data as CSV or Pickle?Files

Solution: Utilize Efficient Data?Formats

10. Relying on Jupyter Notebooks

Solution: Opt for IDEs like PyCharm or?Spyder

My Data Analyst Resolution

1,768 位关注者

The Vital Role of Data Analysis in Modern Business and Technology

2024年11月4日

The 40 NumPy Methods Data Scientists Use All the Time

2024年9月25日

How I'd Learn Data Analytics in 2024 | 3 Month Plan

2024年5月14日

Top 15 BEST Websites For Data?Jobs 2024

2024年5月11日

Data Scientists and ML Engineers Are Luxury Employees

2024年5月9日

Do You Read Excel Files with Python? There is a 1000x Faster?Way

2024年5月9日

We Don't Need Data Scientists, We Need Data Engineers

2024年5月8日

10 GitHub Repositories to Master?Python

2024年5月1日

Ultimate Collection of 50 Free Courses for Mastering Data?Science

2024年5月1日

The High Paying Side Hustles for Data Scientists 2024

2024年4月30日

社区洞察

其他会员也浏览了

GroupBy #10: Netflix's Psyberg, Parquet format, SQL is not Designed for Analytics

ETL vs ELT: A Surprising Insight About How Dangerous Data Transformations Are

The Roadmap to Becoming a Data Engineering Jedi

DATA ENGINEERING: SKILLS IN DEMAND

Course Announcement: Build Your First Data Engineering Project in AWS

Getting Started with Docker: A Guide for Data Engineers

Software Engineering vs Data Engineering, parallels and differences - Part 1

Using Git in Data Science: the Solo Master

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data

What is the N + 1 problem?

GroupBy #10: Netflix's Psyberg, Parquet format, SQL is not Designed for Analytics