Top 10 Coding Mistakes Made by Data Scientists
A data scientist is often described as someone who is more skilled in statistics than any software engineer and more adept at software engineering than any statistician. Many data scientists come from a statistics background and may not have extensive experience in software engineering. As a data scientist, I’ve compiled a list of 10 prevalent coding errors I often encounter.
1. Failing to Share Data Alongside Code
In data science, both code and data are essential for replicating results. It may seem straightforward, yet it’s common to see that the data referenced in the code isn’t shared, making it difficult for others to reproduce the findings.
import pandas as pd
df1 = pd.read_csv('file-i-dont-have.csv') # fails
do_stuff(df)
Solution: Efficient Data Sharing Techniques
To ensure that both your data and code are accessible for accurate result replication, consider using tools like d6tpipe to share data files seamlessly alongside your code. Alternatively, you can upload your data to cloud storage solutions such as Amazon S3, Google Drive, or a web server. Another viable option is saving your data to a database where recipients can easily retrieve the files. However, it’s important to avoid adding large data files directly to version control systems like Git. Instead, opt for these more efficient methods for sharing large datasets.
2. Hardcoding Inaccessible Paths
This issue mirrors the first one. If you hardcode file paths that others cannot access, they will not be able to execute your code without having to manually search for and modify these paths in multiple locations, which is inefficient and frustrating.
import pandas as pd
df = pd.read_csv('/path/i-dont/have/data.csv') # fails
do_stuff(df)
# or
import os
os.chdir('c:\\Users\\yourname\\desktop\\python') # fails
Solution: Streamlining Access to?Data
To avoid the complications of hardcoded paths, it’s beneficial to use relative paths in your code. Alternatively, setting up global path configuration variables can simplify path management across different environments. Another efficient method is to utilize tools like d6tpipe , which can help ensure that your data is easily accessible and your code more portable. These practices promote better code usability and reproducibility.
3. Mixing Data with?Code
Combining data, code, images, reports, and other files in the same directory can create a disorganized and cluttered workspace. This practice complicates the management of your project’s components and can lead to inefficiencies and errors when trying to navigate through such a messy directory.
├── data.csv
├── ingest.py
├── other-data.csv
├── output.png
├── report.html
└── run.py
Solution: Structuring Your Project Directory
To maintain an organized workspace, it’s advisable to categorize your directory into clear sections such as ‘data’, ‘reports’, and ‘code’. You can use project templates like those provided by Cookiecutter Data Science or d6tflow to help structure your projects effectively. Additionally, employ the data storage and sharing tools mentioned in the previous points to manage and distribute your data efficiently. This structured approach not only enhances the clarity of your project but also improves workflow efficiency.
4. Committing Data with Source Code in?Git
While version controlling your code is essential (and not doing so is indeed a mistake), using Git to also version control data files can lead to problems. While committing very small data files might be manageable, Git is not designed to handle large files effectively. This can slow down repository operations and complicate data management.
git add data.csv
Solution: Efficient Data Version?Control
Instead of committing large data files directly to Git, consider using specialized tools designed for data version control. Options like d6tpipe , Data Version Control (DVC), and Git Large File Storage (LFS) are tailored for handling larger datasets efficiently. These tools enable you to version control your data without compromising the performance and manageability of your Git repository, aligning with the data-sharing methods recommended earlier.
5. Writing Functions Instead of Using?DAGs
When transitioning from basic coding to data science, many data scientists initially organize their code as a series of linearly executed functions, a practice picked up from early programming lessons. However, this approach can lead to inefficiencies and issues in more complex machine learning projects. This method lacks the structured flow and dependency management provided by Directed Acyclic Graphs (DAGs), which are essential for handling more sophisticated workflows in data science, see 4 Reasons Why Your Machine Learning Code is Probably Bad
def process_data(data, parameter):
data = do_stuff(data)
data.to_pickle('data.pkl')
data = pd.read_csv('data.csv')
process_data(data)
df_train = pd.read_pickle(df_train)
model = sklearn.svm.SVC()
model.fit(df_train.iloc[:,:-1], df_train['y'])
Solution: Utilizing Task-Based Workflows with Dependencies
To overcome the limitations of linear function chaining in data science projects, it’s beneficial to structure your code as a set of tasks that have defined dependencies. This method facilitates more organized and efficient workflows, especially when dealing with complex data processes. Tools like d6tflow and Apache Airflow are excellent for managing these task-based workflows. They allow you to define and automate the execution order of tasks based on their dependencies, enhancing both the reliability and scalability of your data science projects.
领英推荐
6. Relying on For?Loops
For loops are often one of the first programming constructs learned by new coders. While they are straightforward and easy to understand, for loops can be inefficient, particularly in data science tasks where processing speed is crucial. They are typically slower and more verbose than necessary, often indicating a lack of awareness of more efficient, vectorized alternatives that can handle operations in bulk.
x = range(10)
avg = sum(x)/len(x); std = math.sqrt(sum((i-avg)**2 for i in x)/len(x));
zscore = [(i-avg)/std for x]
# should be: scipy.stats.zscore(x)
# or
groupavg = []
for i in df['g'].unique():
dfg = df[df[g']==i]
groupavg.append(dfg['g'].mean())
# should be: df.groupby('g').mean()
Solution: Embracing Vectorized Functions
To enhance efficiency and reduce the reliance on slow for loops in your code, consider utilizing vectorized functions provided by libraries such as Numpy , Scipy and Pandas . These libraries offer a wide range of vectorized operations that can perform bulk calculations on data arrays much more quickly and with less code than traditional for loops. By adopting these vectorized functions, you can significantly speed up your data processing tasks and streamline your code.
7. Neglecting to Write Unit?Tests
In the dynamic field of data science, as data, parameters, or user inputs change, there’s a high risk that your code may break, sometimes without immediate detection. This oversight can result in incorrect outputs, which, if used for decision-making, can lead to poor and costly decisions. Writing unit tests is crucial for ensuring that each part of your code functions correctly under various conditions, thus safeguarding against the potentially significant consequences of undetected errors.
assert df['id'].unique().shape[0] == len(ids) # have data for all ids?
assert df.isna().sum()<0.9 # catch missing values
assert df.groupby(['g','date']).size().max() ==1 # no duplicate values/date?
assert d6tjoin.utils.PreJoin([df1,df2],['id','date']).is_all_matched() # all ids matched?
Solution: Implementing Robust Data Quality?Checks
To ensure the integrity and accuracy of your data, implementing assert statements can be a powerful method for ongoing data quality assurance. Libraries like pandas provide functions for testing data equality, which can help validate that your data meets expected criteria. Additionally, tools like d6tstack offer specialized checks for data ingestion, and d6tjoin is useful for verifying data joins.
For example, you can use pandas to perform equality tests to ensure that data transformations or merges are executed correctly. Below is a basic example of how you might use assertions in your data checks:
import pandas as pd
# Assuming df_expected and df_actual are two DataFrame objects you want to compare
assert df_expected.equals(df_actual), "The data frames are not equal."
# Using d6tjoin to check for successful joins
import d6tjoin
result = d6tjoin.utils.PreJoin([df1, df2], ['key1', 'key2']).is_all_matched()
assert result, "Data join mismatch found."
# Using d6tstack to ensure data ingestion integrity
import d6tstack
result = d6tstack.utils.check_column_names(df, required_columns=['column1', 'column2'])
assert result, "Required columns are missing in the data frame."
These code snippets show how to implement checks that prevent data-related errors in your projects, ensuring that your outputs remain reliable and accurate.
8. Neglecting to Document?Code
In the rush to deliver analysis results, it’s tempting to quickly hack together code without proper documentation. However, when clients or supervisors request changes or updates later, you may find yourself unable to recall the rationale behind your original implementation. This challenge is compounded if someone else needs to run or modify your code. Without documentation, understanding and maintaining the code becomes difficult and time-consuming, leading to potential errors and inefficiencies.
def some_complicated_function(data):
data = data[data['column']!='wrong']
data = data.groupby('date').apply(lambda x: complicated_stuff(x))
data = data[data['value']<0.9]
return data
Solution: Invest Time in Documentation
Taking the extra time to document your code, even after delivering the initial analysis, is invaluable. Clear documentation helps you understand your own code when revisiting it and makes it easier for others to follow your logic and make necessary updates. This practice not only saves time in the long run but also enhances your professional reputation. Comprehensive documentation includes explaining the purpose of functions, detailing the parameters and outputs, and describing any complex or non-obvious sections of the code. By doing so, you ensure your work is maintainable and reproducible, ultimately making you look like a pro.
9. Saving Data as CSV or Pickle?Files
In data science, commonly used formats like CSVs and pickle files have notable drawbacks. While CSVs are widely used, they lack schemas, requiring users to repeatedly parse numbers and dates. Although pickles preserve data types, they are Python-specific and uncompressed, making them inefficient for large datasets. Both formats fall short when it comes to storing and managing substantial volumes of data effectively.
def process_data(data, parameter):
data = do_stuff(data)
data.to_pickle('data.pkl')
data = pd.read_csv('data.csv')
process_data(data)
df_train = pd.read_pickle(df_train)
Solution: Utilize Efficient Data?Formats
To overcome the limitations of CSV and pickle files, opt for more advanced binary data formats that include schemas and support compression. Parquet is an excellent choice, as it efficiently stores data with built-in schema information and compression, making it ideal for large datasets. Additionally, tools like d6tflow can automate the process by saving task outputs as Parquet files, sparing you from manual handling. This approach ensures better data integrity, compatibility, and storage efficiency, enhancing overall workflow management.
10. Relying on Jupyter Notebooks
While Jupyter notebooks are as ubiquitous as CSV files in data science, their popularity doesn’t necessarily equate to best practice. They often encourage poor software engineering habits, such as: