Data Science Salaries 2023 Dataset??
Introduction
In today’s data-driven world, Data Science has emerged as a field with immense potential and exciting career prospects. With the increasing reliance on data for informed decision-making, companies are actively seeking skilled professionals who can navigate the complex world of data analysis and interpretation. Apart from the intellectually stimulating nature of the work, one of the key factors that make Data Science an attractive career choice is the potential for high salaries.
The Data Science Job Salaries Dataset provides us with valuable insights into the earning potential of different roles within the Data Science domain. By exploring this dataset, we can gain a comprehensive understanding of the salary trends, identify the most in-demand job titles, and uncover the factors that contribute to variations in salaries across regions and industries.
Importing Essential Libraries
mport numpy as np # linear algebr
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker
Loading the Data Science Salaries Dataset
df = pd.read_csv('/kaggle/input/data-science-salaries-2023/ds_salaries.csv')
df.head()
Checking the Shape of the DataFrame
df.shape
(3755, 11)
This means that the DataFrame df has 3755 rows and 11 columns. Each row represents a unique data point (in this case, a data scientist's salary information), and each column represents a different attribute of the data point (such as work year, experience level, employment type, job title, salary, etc.).
Obtaining Information about the DataFrame
df.info()
The DataFrame consists of?3755 rows and 11 columns, with data types being integers and objects (typically strings in pandas). Each column contains non-null entries, indicating?there are no missing values. The memory usage of the DataFrame is approximately 322.8 KB.
Counting Null Values in the DataFrame
df.isnull().sum()
Here we got a series showing the?total count of null values in each column. This is a quick way to check if your data has any missing values and, if so, where they are.?No missing values
Statistical Summary of the DataFrame
df.describe().astype(int)
The?average salary in USD is around 137,570. The standard deviation values tell us how spread out the data is. For example, a high standard deviation in salary would mean that salaries vary a lot.
The minimum and maximum values give us the?range of salaries, which in USD is from 5,132 to 450,000.?The 25%, 50%, and 75% values are called percentiles. They tell us that?25% of data scientists earn 95,000 USD or less, and?half of them earn 135,000 USD or less. Interestingly, there's a big difference between the?maximum salary and the maximum salary in USD, which could be due to outliers, errors, or different currency scales. This could be worth investigating further.
Number of job titles
unique_job_titles = df['job_title'].nunique()
print(f'Number of unique job titles: {unique_job_titles}')
Number of unique job titles: 93
Top 10 Job Titles in the Data Science Field
# Calculate the top 10 most common job titles
top_job_titles = df['job_title'].value_counts().nlargest(10)
# Create a new figure for the plot
plt.figure(figsize=(10, 6))
# Generate a bar plot using seaborn
barplot = sns.barplot(y=top_job_titles.index, x=top_job_titles.values, palette='viridis')
# Add text annotations to the bar plot
for i in range(top_job_titles.shape[0]):
barplot.text(top_job_titles[i] + 10, i, top_job_titles[i], va='center')
# Add title and labels to the plot
plt.title('Top 10 Job Titles')
plt.xlabel('Count')
plt.ylabel('Job Title')
# Display the plot
plt.show()
Salary Statistics for Various Data Science Roles
# Define a dictionary for job roles and their corresponding salarie
job_roles = {
? ? 'Data Engineer': 'Data Engineer',
? ? 'Data Scientist': 'Data Scientist',
? ? 'Data Analyst': 'Data Analyst',
? ? 'Machine Learning Engineer': 'Machine Learning Engineer',
? ? 'Analytics Engineer': 'Analytics Engineer'}
# Iterate over the job roles and calculate the highest, lowest, and average salaries
for role, title in job_roles.items():
? ? salaries = df[df['job_title'] == title]['salary_in_usd']
? ? max_salary = salaries.max()
? ? min_salary = salaries.min()
? ? avg_salary = int(salaries.mean())
? ??
? ? # Print the salary summary for each role
? ? print(role + ':')
? ? print('? - Highest Salary:', max_salary)
? ? print('? - Lowest Salary:', min_salary)
? ? print('? - Average Salary:', avg_salary)
? ? print()
Comparison of Salary Summaries
# Define the job role
job_roles = ['Data Engineer', 'Data Scientist', 'Data Analyst', 'Machine Learning Engineer', 'Analytics Engineer']
# Set the seaborn palette
sns.set_palette("viridis")
# Create a grid of subplots for job roles' salary summary
fig, axs = plt.subplots(len(job_roles), 1, figsize=(8, 5 * len(job_roles)))
# Iterate over job roles
for i, role in enumerate(job_roles):
? ? # Filter the dataset for the specific job role
? ? salaries = df[df['job_title'] == role]['salary_in_usd']
? ??
? ? # Calculate the highest, lowest, and average salaries
? ? max_salary = salaries.max()
? ? min_salary = salaries.min()
? ? avg_salary = int(salaries.mean())
? ??
? ? # Plot the salary summary
? ? sns.barplot(x=['Highest', 'Lowest', 'Average'], y=[max_salary, min_salary, avg_salary], ax=axs[i])
? ? axs[i].set_title(f'{role} Salary')
? ? axs[i].set_ylabel('Salary (USD)')
? ??
? ? # Add value labels to the bars
? ? for j, value in enumerate([max_salary, min_salary, avg_salary]):
? ? ? ? axs[i].text(j, value, f'${value:,}', ha='center', va='bottom')
# Adjust spacing between subplots and remove any excess blank space
plt.tight_layout()
# Show the plot
plt.show()
Salary distribution
# Distribution of salaries
salary_distribution = df['salary_in_usd'].describe().round(2)
print('\nSalary Distribution:')
print(salary_distribution)
# Salary distribution
plt.figure(figsize=(10, 6)) # Create a new figure for the plot with a specific size of 10 inches in width and 6 inches in height
sns.histplot(df[df['salary_in_usd'] < df['salary_in_usd'].quantile(0.95)], x='salary_in_usd', bins=20, kde=True, color='skyblue') # Generate a histogram plot using seaborn, restricting the data to values below the 95th percentile of 'salary_in_usd' column, with 20 bins, including a kernel density estimate, and setting the color to 'skyblue'
plt.title('Salary Distribution') # Add a title to the plot as 'Salary Distribution'
plt.xlabel('Salary in USD') # Label the x-axis as 'Salary in USD'
plt.ylabel('Frequency') # Label the y-axis as 'Frequency'
plt.show() # Display the plot
This code creates a histogram of the 'salary_in_usd' column, but only?includes salaries below the 95th percentile to exclude outliers.
Top job salaries
# Subset the data for the most frequent job title
top_job_titles = df['job_title'].value_counts().nlargest(10).index
subset_df = df[df['job_title'].isin(top_job_titles)]
# Set the figure size and adjust spacing
plt.figure(figsize=(12, 8))
sns.set(font_scale=1.0)
sns.set_style("whitegrid")
# Create the box plot
sns.boxplot(data=subset_df, x='job_title', y='salary_in_usd', order=top_job_titles, palette='viridis')
# Set plot title and labels
plt.title('Salary Distribution for Top Job Titles', fontsize=16)
plt.xlabel('Job Title', fontsize=14)
plt.ylabel('Salary in USD', fontsize=14)
# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')
# Adjust y-axis limits for better visualization of outliers
plt.ylim(bottom=0)
# Add more space between plots
plt.tight_layout()
# Show the plot
plt.show()
Each box in the plot represents the salary range for a specific job title. By comparing the positions and lengths of the boxes, we can gain insights into the salary distribution across different job titles. If a box is positioned higher on the y-axis, it indicates a higher median salary for that job title compared to others. Similarly, if a box is longer, it suggests a wider salary range and potentially more variability in salaries.
The whiskers extending from the boxes represent the data within a certain range, usually 1.5 times the IQR. Any data points beyond the whiskers are considered outliers and are represented as individual points on the plot.
领英推荐
By examining this box plot, we can observe the range of salaries and the variations across the top job titles. We can identify job titles with higher or lower median salaries, as well as those with wider or narrower salary distributions. This information can provide insights into salary trends, potential salary gaps, and the overall salary landscape for different job titles in the dataset.
Distribution of Employment Types
# Replace specific values in the 'employment_type' column
df['employment_type'] = df['employment_type'].replace({'FT': 'Full-time', 'PT': 'Part-time', 'C': 'Contract', 'I': 'Internship', 'F': 'Freelance', 'CT': 'Contract', 'FL': 'Freelance'})
# Calculate the distribution of values in the 'employment_type' column
employment_type_distribution = df['employment_type'].value_counts()
# Print the heading for the employment type distribution
print('Employment Type Distribution:')
# Print the distribution of employment types
print(employment_type_distribution)
# Create a new figure for the plot with a specific size of 10 inches in width and 6 inches in height
plt.figure(figsize=(10, 6))
# Generate a countplot using seaborn's 'countplot' function
barplot = sns.countplot(x='employment_type', data=df, order=df['employment_type'].value_counts().index, palette='viridis')
# The x-axis represents the 'employment_type' column from the DataFrame 'df'.
# The bars are ordered based on the value counts of each employment type.
# The color palette is set to 'viridis'.
# Iterate over the bars in the countplot
for p in barplot.patches:
? ? height = p.get_height()
? ? # Get the height (count) of the bar
? ? # Annotate each bar with its count
? ? barplot.annotate(format(round(height), ','), (p.get_x() + p.get_width() / 2., height), ha='center', va='center', xytext=(0, 5), textcoords='offset points')
? ? # The count is formatted with comma separators.
? ? # The annotation is positioned at the center of the bar's height with a small offset.
# Add a title to the plot as 'Employment Type Distribution'
plt.title('Employment Type Distribution')
# Label the x-axis as 'Employment Type'
plt.xlabel('Employment Type')
# Label the y-axis as 'Count'
plt.ylabel('Count')
# Display the plot
plt.show()
The majority of data scientists are employed full-time, which aligns with the common expectation for this profession, with 3,718 instances. There are also 17 part-time, 10 contract, and 10 freelance positions.
Average Salary by Job Title
# Calculate the average salary by job title
average_salary_by_job_title = df.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=False).astype(int)
# Print the heading for average salary by job title
print('\nAverage Salary by Job Title:')
# Print the average salary values for each job title
print(average_salary_by_job_title)
The job title with the highest average salary is?"Data Science Tech Lead", with an average salary of $375,000.
The job title with the lowest average salary is?"Power BI Developer", with an average salary of $5,409
Top 15 Average Salaries by Job Title in the Data Science Field
# Select the top 15 average salaries from the 'average_salary_by_job_title' Serie
top_10_salaries = average_salary_by_job_title.nlargest(15)
# Create a new figure for the plot with a specific size of 10 inches in width and 6 inches in height
plt.figure(figsize=(10, 6))
# Generate a bar plot using seaborn's 'barplot' function
barplot = sns.barplot(y=top_10_salaries.index, x=top_10_salaries.values, palette='viridis')
# Add text annotations to the bar plot
for i in range(top_10_salaries.shape[0]):
? ? # Create a label for the salary value by dividing it by 1000 and formatting it to display as a string with one decimal place followed by 'k'
? ? salary_label = f'${top_10_salaries[i] / 1000:.1f}k'
? ? # Add text annotations to the bar plot, positioning the salary value slightly to the right of the bar and the label at the center of the bar
? ? barplot.text(top_10_salaries[i]+1000, i, salary_label, ha='left', va='center')
# Add a title to the plot as 'Top 15 Average Salaries by Job Title'
plt.title('Top 15 Average Salaries by Job Title')
# Label the x-axis as 'Average Salary (USD)'
plt.xlabel('Average Salary (USD)')
# Label the y-axis as 'Job Title'
plt.ylabel('Job Title')
# Display the plot
plt.show()
The plot displays the top 15 job titles in terms of average salaries. Each bar represents a job title, and the length of the bar indicates the average salary associated with that job title.
Bottom salaries
# Select the bottom 15 average salaries from the 'average_salary_by_job_title' Serie
bottom_15_salaries = average_salary_by_job_title.nsmallest(15)[::-1]
# Create a new figure for the plot with a specific size of 10 inches in width and 6 inches in height
plt.figure(figsize=(10, 6))
# Generate a bar plot using seaborn's 'barplot' function
barplot = sns.barplot(y=bottom_15_salaries.index, x=bottom_15_salaries.values, palette='viridis')
# Add text annotations to the bar plot
for i in range(bottom_15_salaries.shape[0]):
? ? # Create a label for the salary value by dividing it by 1000 and formatting it to display as a string with one decimal place followed by 'k'
? ? salary_label = f'${bottom_15_salaries.values[i] / 1000:.1f}k'
? ? # Add text annotations to the bar plot, positioning the salary value at the corresponding position on the x-axis and the label at the center of the bar
? ? barplot.text(bottom_15_salaries.values[i], i, salary_label, va='center')
# Add a title to the plot as 'Bottom 15 Average Salaries by Job Title'
plt.title('Bottom 15 Average Salaries by Job Title')
# Label the x-axis as 'Average Salary (USD)'
plt.xlabel('Average Salary (USD)')
# Label the y-axis as 'Job Title'
plt.ylabel('Job Title')
# Display the plot
plt.show()
The plot displays the bottom 15 job titles in terms of average salaries. Each bar represents a job title, and the length of the bar indicates the average salary associated with that job title.
Experience level distribution
# Replace specific values in the 'experience_level' colum
df['experience_level'] = df['experience_level'].replace({'SE': 'Senior', 'MI': 'Mid-level', 'EN': 'Entry-level', 'EX': 'Executive'})
# Calculate the distribution of values in the 'experience_level' column
experience_level_distribution = df['experience_level'].value_counts()
# Print the heading for experience level distribution
print('\nExperience Level Distribution:')
# Print the distribution of experience levels
print(experience_level_distribution)
# Create a new figure for the plot with a specific size of 8 inches in width and 6 inches in heigh
plt.figure(figsize=(8, 6))
# Generate a countplot using seaborn's 'countplot' function
barplot = sns.countplot(data=df, x='experience_level', palette='viridis', order=['Executive', 'Senior', 'Mid-level', 'Entry-level'])
# The data is sourced from the DataFrame 'df', and the x-axis represents the 'experience_level' column.
# The color palette is set to 'viridis', and the order of the bars is specified as ['Executive', 'Senior', 'Mid-level', 'Entry-level'].
for p in barplot.patches:
? ? height = p.get_height()
? ? # Get the height (count) of each bar.
? ? barplot.annotate(format(round(height), ','), (p.get_x() + p.get_width() / 2., height), ha='center', va='center', xytext=(0, 5), textcoords='offset points')
? ? # Add text annotations to each bar, displaying the count with comma separators.
? ? # The annotation is positioned at the center of each bar's height with a small offset.
# Add a title to the plot as 'Experience Level Distribution'
plt.title('Experience Level Distribution')
# Label the x-axis as 'Experience Level'
plt.xlabel('Experience Level')
# Label the y-axis as 'Count'
plt.ylabel('Count')
# Display the plot
plt.show()
The plot shows the distribution of data scientists across different experience levels. The taller bars represent a higher number of data scientists in the 'Mid-level' and 'Senior' categories. On the other hand, the 'Executive' category has the fewest data scientists, indicated by the shortest bar. The 'Entry-level' category falls in between, with a moderate number of data scientists. This distribution gives us insights into the composition of data scientists based on their experience levels, helping us understand the experience requirements and expertise within the DS field.
Average salary by experience level
# Replace specific values in the 'experience_level' colum
df['experience_level'] = df['experience_level'].replace({'Ex': 'Executive', 'SE': 'Senior', 'Mi': 'Mid-level', 'EN': 'Entry-level'})
# Calculate the average salary by experience level
average_salary_by_experience_level = df.groupby('experience_level')['salary_in_usd'].mean().sort_values(ascending=False).astype(int)
# Print the heading for average salary by experience level
print('\nAverage Salary by Experience Level:')
# Print the average salary values by experience level
print(average_salary_by_experience_level)
The experience level with the highest average salary is?"EX" (Expert), with an average salary of approximately $194,931.
The experience level with the lowest average salary is?"EN" (Entry-level), with an average salary of approximately $78,546.
# Replace experience level abbreviations with full description
df['experience_level'] = df['experience_level'].replace({'Ex': 'Executive', 'SE': 'Senior', 'Mi': 'Mid-level', 'EN': 'Entry-level'})
# Calculate average salary by experience level
average_salary_by_experience_level = df.groupby('experience_level')['salary_in_usd'].mean().sort_values(ascending=False).astype(int)
# Set a lighter font style
sns.set(font_scale=0.8)
# Set the style without the dark background
sns.set_style('ticks')
# Create the plot
plt.figure(figsize=(10, 6))
barplot = sns.barplot(x=average_salary_by_experience_level.values, y=average_salary_by_experience_level.index, palette='viridis')
# Add value labels to the bars
for i, value in enumerate(average_salary_by_experience_level.values):
? ? barplot.text(value + 1000, i, f'${value/1000:.1f}k', ha='left', va='center')
# Set plot title and labels
plt.title('Average Salary by Experience Level')
plt.xlabel('Average Salary (USD)')
plt.ylabel('Experience Level')
# Show the plot
plt.show()
the "Executive" experience level has the highest average salary, followed by "Senior", "Mid-level", and "Entry-level". This suggests that as professionals gain more experience and progress in their careers, they tend to earn higher salaries.
The visualization provides insights into the relationship between experience level and salary in the field. It highlights the importance of experience in determining salary levels and can help individuals in understanding the salary expectations associated with different experience levels.
Remote work ratios
# Replace specific values in the 'remote_ratio' column
df['remote_ratio'] = df['remote_ratio'].replace({0: 'In-office', 50: 'Hybrid', 100: 'Fully Remote'})
# Calculate the distribution of values in the 'remote_ratio' column
remote_ratio_distribution = df['remote_ratio'].value_counts()
# Print the heading for remote work ratio distribution
print('Remote Work Ratio Distribution:')
# Print the distribution of remote work ratios
print(remote_ratio_distribution)
# Reindex the 'remote_ratio_distribution' Series to ensure the desired order of categories: 'In-office', 'Hybrid', 'Fully Remote'
remote_ratio_distribution = remote_ratio_distribution.reindex(['In-office', 'Hybrid', 'Fully Remote'])
# Create a new figure for the plot with a specific size of 10 inches in width and 6 inches in height
plt.figure(figsize=(10, 6))
# Generate a bar plot using seaborn's 'barplot' function
barplot = sns.barplot(x=remote_ratio_distribution.index, y=remote_ratio_distribution.values, palette='viridis')
# The x-axis represents the categories from the 'remote_ratio_distribution' index,
# the y-axis represents the count values from the 'remote_ratio_distribution' values,
# and the color palette is set to 'viridis'.
# Add text annotations to the bars
for i, value in enumerate(remote_ratio_distribution.values):
? ? # Iterate over the values in the 'remote_ratio_distribution' values
? ? barplot.text(i, value + 100, f'{value}', ha='center', va='bottom')
? ? # Add text annotations to the bars, displaying the count values just above each bar.
# Add a title to the plot as 'Remote Work Ratio Distribution'
plt.title('Remote Work Ratio Distribution')
# Label the x-axis as 'Remote Work Ratio'
plt.xlabel('Remote Work Ratio')
# Remove the y-axis label
plt.ylabel('')
# Remove the y-axis tick labels
barplot.set_yticklabels([])
# Adjust the y-axis limit to improve the visibility of the bars and annotations
plt.ylim(0, remote_ratio_distribution.max() + 500)
# Improve the spacing between the elements of the plot
plt.tight_layout()
# Display the plot
plt.show()
Conclusion
In conclusion, the Data Science Job Salaries Dataset provides valuable insights into the job market for data science professionals. By examining the dataset, we were able to uncover patterns and trends that shed light on salary ranges, job titles, experience requirements, and the most sought-after skills in the field.
However, it is important to note that our analysis is not exhaustive, and there is still much more exploration and interpretation that can be done with this dataset. To continue the analysis and delve deeper into the findings, I recommend referring to the Kaggle notebook associated with this dataset. The notebook provides a platform for further exploration, data visualization, and advanced modeling techniques.
By leveraging the power of the Kaggle platform and the comprehensive dataset, researchers, aspiring data scientists, and industry professionals can continue to gain valuable insights and make informed decisions in the ever-evolving landscape of data science job opportunities.