Data Science has become more and more popular over the years and probably one of the most common questions is: how do I get started in this field?. And one common answer to the previous question is: work on some competitions in the Kaggle platform.
Kaggle is indeed one of the most popular platforms when it comes to Data Science and even if you are not willing to compete, there is a long list of datasets available for you to work on and sharpen your abilities. One of these datasets is the 2019 Kaggle Machine Learning and Data Science Survey, and that is the one I am going to use in this project to see what we can learn from it.
The target questions that are going to guide the following workflow are:
Finally, one model will be created to see how well we can distinguish the different professions in Data Science field, given other information like salary, courses attended and activities performed.
# Loading packages:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
# Reading the file:
survey = pd.read_csv('multiple_choice_responses.csv')
survey.head()
# Checking dataframe's shape:
survey.shape
As we can see, there are 246 different variables available in the survey and almost 20000 observations. The first row contains the explanation for each one of the questions. Because of that, I have decided to delete this first row and transform column names so they can relate to the content of the question.
Also, in a previous analysis, I have already selected some columns of interest that could help answering the target questions.
# List for selecting columns of interest:
target_columns = ['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6',
'Q9_Part_1', 'Q9_Part_2', 'Q9_Part_3', 'Q9_Part_4', 'Q9_Part_5', 'Q9_Part_6', 'Q9_Part_7', 'Q9_Part_8',
'Q10',
'Q13_Part_1', 'Q13_Part_2', 'Q13_Part_3', 'Q13_Part_4', 'Q13_Part_5', 'Q13_Part_6', 'Q13_Part_7',
'Q13_Part_8', 'Q13_Part_9', 'Q13_Part_10', 'Q13_Part_11', 'Q13_Part_12',
'Q15',
'Q18_Part_1', 'Q18_Part_2', 'Q18_Part_3', 'Q18_Part_4', 'Q18_Part_5', 'Q18_Part_6', 'Q18_Part_7',
'Q18_Part_8', 'Q18_Part_9', 'Q18_Part_10', 'Q18_Part_11', 'Q18_Part_12',
'Q28_Part_1', 'Q28_Part_2', 'Q28_Part_3', 'Q28_Part_4', 'Q28_Part_5', 'Q28_Part_6', 'Q28_Part_7',
'Q28_Part_8', 'Q28_Part_9', 'Q28_Part_10', 'Q28_Part_11', 'Q28_Part_12',
'Q34_Part_1', 'Q34_Part_2', 'Q34_Part_3', 'Q34_Part_4', 'Q34_Part_5', 'Q34_Part_6', 'Q34_Part_7',
'Q34_Part_8', 'Q34_Part_9', 'Q34_Part_10', 'Q34_Part_11', 'Q34_Part_12']
# Dictionary for transforming columns' name to more informative values:
columns_names = {'Q1': 'Age', 'Q2': 'Gender', 'Q3': 'Res_Country', 'Q4': 'Education',
'Q5': 'Current_Role', 'Q6': 'Company_Size',
'Q9_Part_1': 'actv_Data-Analysis', 'Q9_Part_2': 'actv_Data-Infra', 'Q9_Part_3': 'actv_ML-Prototypes',
'Q9_Part_4': 'actv_ML-Services', 'Q9_Part_5': 'actv_ML-Improvement', 'Q9_Part_6': 'actv_ML-Reserach',
'Q9_Part_7': 'actv_None', 'Q9_Part_8': 'actv_Other',
'Q10': 'Annual_Income',
'Q13_Part_1': 'course_Udacity', 'Q13_Part_2': 'course_Coursera', 'Q13_Part_3': 'course_edX',
'Q13_Part_4': 'course_DataCamp', 'Q13_Part_5': 'course_DataQuest', 'Q13_Part_6': 'course_Kaggle',
'Q13_Part_7': 'course_Fast-ai', 'Q13_Part_8': 'course_Udemy', 'Q13_Part_9': 'course_LinkedIn',
'Q13_Part_10': 'course_University-Degree', 'Q13_Part_11': 'course_None', 'Q13_Part_12': 'course_Other',
'Q15': 'Years_Coding',
'Q18_Part_1': 'prog_Python', 'Q18_Part_2': 'prog_R', 'Q18_Part_3': 'prog_SQL', 'Q18_Part_4': 'prog_C',
'Q18_Part_5': 'prog_C++', 'Q18_Part_6': 'prog_Java', 'Q18_Part_7': 'prog_Javascript',
'Q18_Part_8': 'prog_Typerscript', 'Q18_Part_9': 'prog_Bash', 'Q18_Part_10': 'prog_MATLAB',
'Q18_Part_11': 'prog_None', 'Q18_Part_12': 'prog_Other',
'Q28_Part_1': 'ml_framew_Scikit', 'Q28_Part_2': 'ml_framew_TensorFlow', 'Q28_Part_3': 'ml_framew_Keras',
'Q28_Part_4': 'ml_framew_RandomForest', 'Q28_Part_5': 'ml_framew_Xgboost', 'Q28_Part_6': 'ml_framew_Pytorch',
'Q28_Part_7': 'ml_framew_Caret', 'Q28_Part_8': 'ml_framew_LightGBM', 'Q28_Part_9': 'ml_framew_SparkMLib',
'Q28_Part_10': 'ml_framew_Fast-ai', 'Q28_Part_11': 'ml_framew_None', 'Q28_Part_12': 'ml_framew_Other',
'Q34_Part_1': 'relDB_MySQL', 'Q34_Part_2': 'relDB_PostgresSQL', 'Q34_Part_3': 'relDB_SQLite',
'Q34_Part_4': 'relDB_SQLServer', 'Q34_Part_5': 'relDB_Oracle', 'Q34_Part_6': 'relDB_Access',
'Q34_Part_7': 'relDB_AWS-relDB-Server', 'Q34_Part_8': 'relDB_AWS-DynamoDB', 'Q34_Part_9': 'relDB_AzureSQL',
'Q34_Part_10': 'relDB_GoogleCloudSQL', 'Q34_Part_11': 'relDB_None', 'Q34_Part_12': 'relDB_Other'}
# Selecting columns of interest:
survey = survey[target_columns]
# Deleting first row and updating the index:
survey = survey.drop([0], axis = 0).reset_index(drop = True)
# Changing columns' name:
survey.rename(columns = columns_names, inplace = True)
# Checking first rows after transformations:
survey.head()
# Checking dataframe's shape after transformations:
survey.shape
We have now reduced our original 264 columns to 64 columns of interest.
Next step, let's check which position people who participated in the survey play in their companies, once we are interested only in roles related to Data Science.
# Counting observations for different roles:
survey.Current_Role.value_counts()
Although the roles represent different fields and specializations, let's assume that the roles most related to Data Science, representing activities that often overlap one another, are: Data Scientist, Data Analyst, Business Analyst, Data Engineer and DBA/Database Engineer.
# Filtering dataframe to the specific roles of interest:
roles_list = ['Data Scientist', 'Data Analyst', 'Business Analyst', 'Data Engineer', 'DBA/Database Engineer']
ds_survey = survey[survey['Current_Role'].isin(roles_list)].reset_index(drop = True)
ds_survey.shape
After this first filter and transformations, we start our exploratory analysis in order to answer the questions previously proposed.
ds_survey.Education.value_counts()
Among the different professions we filtered for, we can see most of the people has achieved a Master's degree as the highest education level. The second most common highest level of education is the Bachelor's degree.
Let's dig deeper into these numbers and try to get some extra information from them.
def bar_plot_edu(two_occupations, data = ds_survey, palette = 'GnBu_r'):
'''
INPUT:
two_occupations - list, 1 or 2 "Current_Role" values as string
data - dataframe, default = ds_survey
palette - string, matplotlib palette, default = "GnBu_r"
OUTPUT:
figure plot - bar plot, x = Education Level, y = Percentage (one plot per "Current Role")
'''
# Defining the labels and style of the figure:
x_label = 'Education Level'
y_label = 'Percentage'
sns.set_style("whitegrid")
# Dictionary for transforming degree's values:
label_dict = {'Doctoral degree': 'Doctoral',
'Master’s degree': 'Master',
'Bachelor’s degree': 'Bachelor',
'Professional degree': 'Professional',
'Some college/university study without earning a bachelor’s degree': 'Not Bachelor',
'No formal education past high school': 'High School',
'I prefer not to answer': 'No Answer'}
# If a plot for a single profession is required:
if len(two_occupations) == 1:
# Configuring column values:
df = data[data['Current_Role'] == two_occupations[0]]
ed_count = df['Education'].value_counts()
label = list(ed_count.index)
for k, v in label_dict.items():
for i in range(0, len(label)):
if k == label[i]:
label[i] = v
height = [perc for perc in (ed_count.values/sum(ed_count.values))*100]
# Creating the figure:
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (10, 7))
sns.set_style("whitegrid")
# Annotating percentage:
for occ in range(0, len(label)):
ax.text(occ, (height[occ] + 1), str(round(height[occ], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center' , ha= 'center')
# Setting the plot:
sns.barplot(x = label, y = height, palette = palette).set(
xlabel = x_label,
ylabel = y_label,
title = 'Education Level for ' + two_occupations[0],
ylim = (0,60))
sns.despine(left=True)
fig.show()
# If a plot for two professions is required:
elif len(two_occupations) == 2:
# Configuring first column values:
df0 = data[data['Current_Role'] == two_occupations[0]]
ed_count0 = df0['Education'].value_counts()
label0 = list(ed_count0.index)
for k, v in label_dict.items():
for i in range(0, len(label0)):
if k == label0[i]:
label0[i] = v
height0 = [perc for perc in (ed_count0.values/sum(ed_count0.values))*100]
# Confirguring second column values:
df1 = data[data['Current_Role'] == two_occupations[1]]
ed_count1 = df1['Education'].value_counts()
label1 = list(ed_count1.index)
for k, v in label_dict.items():
for i in range(0, len(label1)):
if k == label1[i]:
label1[i] = v
height1 = [perc for perc in (ed_count1.values/sum(ed_count1.values))*100]
# Creating the figure:
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (16.5, 5))
sns.set_style("whitegrid")
# Annotating percentage 1:
for occ in range(0, len(label0)):
ax[0].text(occ, (height0[occ] + 1), str(round(height0[occ], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center' , ha= 'center')
# Setting plot 1:
sns.barplot(x = label0, y = height0, palette = palette, ax = ax[0]).set_title('Education Level for ' + two_occupations[0])
ax[0].set_xlabel(x_label)
ax[0].set_ylabel(y_label)
ax[0].set_ylim(0,60)
# Annotating percentage 2:
for occ in range(0, len(label1)):
ax[1].text(occ, (height1[occ] + 1), str(round(height1[occ], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center' , ha= 'center')
# Setting plot 2:
sns.barplot(x = label1, y = height1, palette = palette, ax = ax[1]).set_title('Education Level for ' + two_occupations[1])
ax[1].set_xlabel(x_label)
ax[1].set_ylabel(y_label)
ax[1].set_ylim(0,60)
sns.despine(left=True)
fig.show()
# If more than two professions are required:
else:
return print('two_occupations must be a list with 1 or 2 columns of interest')
bar_plot_edu([roles_list[0]], palette = 'summer')
bar_plot_edu([roles_list[1], roles_list[2]], palette = 'summer')
bar_plot_edu([roles_list[3], roles_list[4]], palette = 'summer')
We can see that, independent of the role in the company, professionals with a Master's degree are always the majority among the ones who participated in the survey. The lowest proportion of Master's degrees is for DBA/Database Engineer and still it's a rate of approximately 45%.
For all the occupations studied, the second most common education level is the Bachelor's degree. It's curious that, among data scientists, Bachelor's and Doctoral degrees percentages are pretty close to each other, especially in comparison with the other professions.
When it comes to education, there is one more important aspect that we want to understand, and it's about the courses these professionals are more likely to attend. The relevance of the courses and platforms available online is unquestionable, as many offer a more practical approach and allow students to evolve in their own pace. Also, in the post-pandemic world, the adoption of online education should be even more relevant.
# Creating a list with the columns related to courses and platforms:
platforms_list = ['course_Udacity', 'course_Coursera', 'course_edX', 'course_DataCamp', 'course_DataQuest', 'course_Kaggle',
'course_Fast-ai', 'course_Udemy', 'course_LinkedIn', 'course_University-Degree', 'course_None', 'course_Other']
# Defining function to create a figure with the information needed:
def platform_rank(two_occupations, data = ds_survey, palette = 'GnBu_r'):
'''
INPUT:
two_occupations - list, 1 or 2 "Current_Role" (profession) values as string
data - dataframe, default = ds_survey
palette - string, matplotlib palette, default = "GnBu_r"
OUTPUT:
figure plot - horizontal bar plot, x = Percentage, y = Platforms used by each profession
'''
# Creating objects:
plat1 = []
plat2 = []
perc1 = []
perc2 = []
df1 = pd.DataFrame()
df2 = pd.DataFrame()
# Defining the labels and style of the figure:
x_label = 'Percentage'
y_label = 'Educational Platform'
sns.set_style("whitegrid")
# If a plot for a single profession is required:
if len(two_occupations) == 1:
# Filtering dataframe for the specific profession:
df = data[data['Current_Role'] == two_occupations[0]]
# Creating new dataframe with the needed values:
for platform in platforms_list:
plat1.append(platform.split('_')[1])
perc1.append(round(df[platform].notnull().mean()*100, 2))
df1['Platform'] = plat1
df1['Percentage'] = perc1
df1 = df1.sort_values('Percentage', ascending = False).reset_index(drop = True)
# Creating the figure:
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (9, 7))
# Annotating percentages1:
for i in range(0, len(df1['Percentage'])):
ax.text((df1['Percentage'][i] + 0.5), i, str(round(df1['Percentage'][i], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center')
# Setting the plot:
sns.barplot(x = 'Percentage', y = 'Platform', data = df1, palette = palette).set(
xlabel = x_label,
ylabel = y_label,
title = 'Educational Platforms Used by ' + two_occupations[0] + 's',
xlim = (0,70))
sns.despine(left=True)
fig.show()
# If a plot for two professions is required:
elif len(two_occupations) == 2:
# Filtering dataframe for the specific profession:
df = data[data['Current_Role'] == two_occupations[0]]
# Creating new dataframes with the needed values:
for platform in platforms_list:
plat1.append(platform.split('_')[1])
perc1.append(round(df[platform].notnull().mean()*100, 2))
df1['Platform'] = plat1
df1['Percentage'] = perc1
df1 = df1.sort_values('Percentage', ascending = False).reset_index(drop = True)
df = data[data['Current_Role'] == two_occupations[1]]
for platform in platforms_list:
plat2.append(platform.split('_')[1])
perc2.append(round(df[platform].notnull().mean()*100, 2))
df2['Platform'] = plat2
df2['Percentage'] = perc2
df2 = df2.sort_values('Percentage', ascending = False).reset_index(drop = True)
# Creating the figure:
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (15.5, 8.5))
# Annotating percentages 1:
for i in range(0, len(df1['Percentage'])):
ax[0].text((df1['Percentage'][i] + 0.5), i, str(round(df1['Percentage'][i], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center')
# Setting plot 1:
sns.barplot(x = 'Percentage', y = 'Platform', data = df1, palette = palette,
ax = ax[0]).set_title('Educational Platforms Used by ' + two_occupations[0] + 's')
ax[0].set_xlabel(x_label)
ax[0].set_ylabel(y_label)
ax[0].set_xlim(0,55)
# Annotating percentages 2:
for i in range(0, len(df2['Percentage'])):
ax[1].text((df2['Percentage'][i] + 0.5), i, str(round(df2['Percentage'][i], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center')
# Setting plot 2:
sns.barplot(x = 'Percentage', y = 'Platform', data = df2, palette = palette,
ax = ax[1]).set_title('Educational Platforms Used by ' + two_occupations[1] + 's')
ax[1].set_xlabel(x_label)
ax[1].set_ylabel(y_label)
ax[1].set_xlim(0,55)
sns.despine(left=True)
fig.show()
# If more than two professions are required:
else:
return print('two_occupations must be a list with 1 or 2 columns of interest')
platform_rank([roles_list[0]], palette = 'summer')
platform_rank([roles_list[1], roles_list[2]], palette = 'summer')
platform_rank([roles_list[3], roles_list[4]], palette = 'summer')
According to the professionals that participated in the survey, Coursera seems to be the most used educational platform. It is important to say that the question presented specific options to be selected, and more than one could be chosen. With that being said, it looks like the majority of the professionals has already attended some course through Coursera's platform in comparison to the other platforms.
It is interesting to see that for the roles of data engineers and DBA/database engineers, Kaggle's platform outsands in second place, differing from the rank of other professionals.
Let's check now the most common activities that these professionals perform in their daily work. When it comes to Data Science, it is usual for people to be more attracted to topics realted to machine learning algorithms. So it will be interesting to see if that is the activity that most of the professionals in the survey performs.
# Creating a list with the columns related to activities:
activities_list = ['actv_Data-Analysis', 'actv_Data-Infra', 'actv_ML-Prototypes',
'actv_ML-Services', 'actv_ML-Improvement', 'actv_ML-Reserach', 'actv_None', 'actv_Other']
# Defining function to create a figure with the information needed:
def activites_rank(two_occupations, data = ds_survey, palette = 'GnBu_r'):
'''
INPUT:
two_occupations - list, 1 or 2 "Current_Role" (profession) values as string
data - dataframe, default = ds_survey
palette - string, matplotlib palette, default = "GnBu_r"
OUTPUT:
figure plot - horizontal bar plot, x = Percentage, y = Activities performed by each profession
'''
# Creating objects:
actv1 = []
actv2 = []
perc1 = []
perc2 = []
df1 = pd.DataFrame()
df2 = pd.DataFrame()
# Defining the labels and style of the figure:
x_label = 'Percentage'
y_label = 'Activities Performed'
sns.set_style("whitegrid")
# If a plot for a single profession is required:
if len(two_occupations) == 1:
# Filtering dataframe for the specific profession:
df = data[data['Current_Role'] == two_occupations[0]]
# Creating new dataframe with the needed values:
for activity in activities_list:
actv1.append(activity.split('_')[1])
perc1.append(round(df[activity].notnull().mean()*100, 2))
df1['Activity'] = actv1
df1['Percentage'] = perc1
df1 = df1.sort_values('Percentage', ascending = False).reset_index(drop = True)
# Creating the figure:
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (10, 7))
# Annotating percentages:
for i in range(0, len(df1['Percentage'])):
ax.text((df1['Percentage'][i] + 0.5), i, str(round(df1['Percentage'][i], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center')
# Setting the plot:
sns.barplot(x = 'Percentage', y = 'Activity', data = df1, palette = palette).set(
xlabel = x_label,
ylabel = y_label,
title = 'Activities Performed by ' + two_occupations[0] + 's',
xlim = (0,80))
sns.despine(left=True)
fig.show()
# If a plot for two professions is required:
elif len(two_occupations) == 2:
# Filtering dataframe for the specific profession:
df = data[data['Current_Role'] == two_occupations[0]]
# Creating new dataframe with the needed values:
for activity in activities_list:
actv1.append(activity.split('_')[1])
perc1.append(round(df[activity].notnull().mean()*100, 2))
df1['Activity'] = actv1
df1['Percentage'] = perc1
df1 = df1.sort_values('Percentage', ascending = False).reset_index(drop = True)
# Filtering dataframe for the specific profession:
df = data[data['Current_Role'] == two_occupations[1]]
# Creating new dataframes with the needed values:
for activity in activities_list:
actv2.append(activity.split('_')[1])
perc2.append(round(df[activity].notnull().mean()*100, 2))
df2['Activity'] = actv2
df2['Percentage'] = perc2
df2 = df2.sort_values('Percentage', ascending = False).reset_index(drop = True)
# Creating the figure:
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (15.5, 8.5))
# Annotating percentages 1:
for i in range(0, len(df1['Percentage'])):
ax[0].text((df1['Percentage'][i] + 0.5), i, str(round(df1['Percentage'][i], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center')
# Setting plot 1:
sns.barplot(x = 'Percentage', y = 'Activity', data = df1, palette = palette,
ax = ax[0]).set_title('Activities Performed by ' + two_occupations[0] + 's')
ax[0].set_xlabel(x_label)
ax[0].set_ylabel(y_label)
ax[0].set_xlim(0,60)
# Annotating percentages 2:
for i in range(0, len(df2['Percentage'])):
ax[1].text((df2['Percentage'][i] + 0.5), i, str(round(df2['Percentage'][i], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center')
# Setting plot 2:
sns.barplot(x = 'Percentage', y = 'Activity', data = df2, palette = palette,
ax = ax[1]).set_title('Activities Performed by ' + two_occupations[1] + 's')
ax[1].set_xlabel(x_label)
ax[1].set_ylabel(y_label)
ax[1].set_xlim(0,60)
sns.despine(left=True)
fig.show()
# If more than two professions are required:
else:
return print('two_occupations must be a list with 1 or 2 columns of interest')
activites_rank([roles_list[0]], palette = 'cool')
activites_rank([roles_list[1], roles_list[2]], palette = 'cool')
activites_rank([roles_list[3], roles_list[4]], palette = 'cool')
We can see that data analysis is the most performed role, except when it comes to data engineers. In that case, as expected, working in tasks related to data infrastructure is the most common task. Even in that case, data analysis represents an important activity.
According to the survey, dealing with data infrastructure is the second most common task for data and business analysts.
Even though activities related to machine learning are pretty representative among data scientist's tasks, still data analysis plays a more relevant role in their daily work. It's interesting to note that, even though infrastructure is most commonly related to data engineer professionals, more than 30% of the data scientists perform activities related to data infrastructure.
Our goal now is to identify which are the most common tools used in Data Science, would it be programming language, machine learning framework or relational database.
# Creating a list with the columns related to programming languages:
languages_list = ['prog_Python', 'prog_R', 'prog_SQL', 'prog_C', 'prog_C++', 'prog_Java', 'prog_Javascript',
'prog_Typerscript', 'prog_Bash', 'prog_MATLAB', 'prog_None', 'prog_Other']
# Defining function to create a figure with the information needed:
def languages_rank(two_occupations, data = ds_survey, palette = 'GnBu_r'):
'''
INPUT:
two_occupations - list, 1 or 2 "Current_Role" (profession) values as string
data - dataframe, default = ds_survey
palette - string, matplotlib palette, default = "GnBu_r"
OUTPUT:
figure plot - horizontal bar plot, x = Percentage, y = Languages used by each profession
'''
# Creating objects:
lang1 = []
lang2 = []
perc1 = []
perc2 = []
df1 = pd.DataFrame()
df2 = pd.DataFrame()
# Defining the labels and style of the figure:
x_label = 'Percentage'
y_label = 'Programming Languages'
sns.set_style("whitegrid")
# If a plot for a single profession is required:
if len(two_occupations) == 1:
# Filtering dataframe for the specific profession:
df = data[data['Current_Role'] == two_occupations[0]]
# Creating new dataframe with the needed values:
for language in languages_list:
lang1.append(language.split('_')[1])
perc1.append(round(df[language].notnull().mean()*100, 2))
df1['Language'] = lang1
df1['Percentage'] = perc1
df1 = df1.sort_values('Percentage', ascending = False).reset_index(drop = True)
# Creating the figure:
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (9, 7))
# Annotating percentages:
for i in range(0, len(df1['Percentage'])):
ax.text((df1['Percentage'][i] + 0.5), i, str(round(df1['Percentage'][i], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center')
# Setting the plot:
sns.barplot(x = 'Percentage', y = 'Language', data = df1, palette = palette).set(
xlabel = x_label,
ylabel = y_label,
title = 'Languages Used by ' + two_occupations[0] + 's',
xlim = (0,90))
sns.despine(left=True)
fig.show()
# If a plot for two professions is required:
elif len(two_occupations) == 2:
# Filtering dataframe for the specific profession:
df = data[data['Current_Role'] == two_occupations[0]]
# Creating new dataframe with the needed values:
for language in languages_list:
lang1.append(language.split('_')[1])
perc1.append(round(df[language].notnull().mean()*100, 2))
df1['Language'] = lang1
df1['Percentage'] = perc1
df1 = df1.sort_values('Percentage', ascending = False).reset_index(drop = True)
# Filtering dataframe for the specific profession:
df = data[data['Current_Role'] == two_occupations[1]]
# Creating new dataframes with the needed values:
for language in languages_list:
lang2.append(language.split('_')[1])
perc2.append(round(df[language].notnull().mean()*100, 2))
df2['Language'] = lang2
df2['Percentage'] = perc2
df2 = df2.sort_values('Percentage', ascending = False).reset_index(drop = True)
# Creating the figure:
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (15.5, 8.5))
# Annotating percentages 1:
for i in range(0, len(df1['Percentage'])):
ax[0].text((df1['Percentage'][i] + 0.5), i, str(round(df1['Percentage'][i], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center')
# Setting plot 1:
sns.barplot(x = 'Percentage', y = 'Language', data = df1, palette = palette,
ax = ax[0]).set_title('Languages Used by ' + two_occupations[0] + 's')
ax[0].set_xlabel(x_label)
ax[0].set_ylabel(y_label)
ax[0].set_xlim(0,80)
# Annotating percentages 2:
for i in range(0, len(df2['Percentage'])):
ax[1].text((df2['Percentage'][i] + 0.5), i, str(round(df2['Percentage'][i], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center')
# Setting plot 2:
sns.barplot(x = 'Percentage', y = 'Language', data = df2, palette = palette,
ax = ax[1]).set_title('Languages Used by ' + two_occupations[1] + 's')
ax[1].set_xlabel(x_label)
ax[1].set_ylabel(y_label)
ax[1].set_xlim(0,80)
sns.despine(left=True)
fig.show()
# If more than two professions are required:
else:
return print('two_occupations must be a list with 1 or 2 columns of interest')
languages_rank([roles_list[0]], palette = 'autumn')
languages_rank([roles_list[1], roles_list[2]], palette = 'autumn')
languages_rank([roles_list[3], roles_list[4]], palette = 'autumn')
Undoubtedly, Python and SQL are the languages most used in the Data Science field. SQL only beats Python when it comes to DBA/database engineer, which would be expected.
For data scientists, Python is used for almost 80% of the survey's participants, showing us how popular and powerful this language can be. In the other hand, the simplicity of the SQL language does not relate to its power: no matter what role in Data Science, according to the survey, SQL will be really important while performing the needed tasks.
R is consolidated in the third position, representing an alternative to Python, although it doesn't stand out as the main language for any of the groups.
# Creating a list with the columns related to programming languages:
mlframew_list = ['ml_framew_Scikit', 'ml_framew_TensorFlow', 'ml_framew_Keras', 'ml_framew_RandomForest',
'ml_framew_Xgboost', 'ml_framew_Pytorch', 'ml_framew_Caret', 'ml_framew_LightGBM',
'ml_framew_SparkMLib', 'ml_framew_Fast-ai', 'ml_framew_None', 'ml_framew_Other']
# Defining function to create a figure with the information needed:
def mlframew_rank(two_occupations, data = ds_survey, palette = 'GnBu_r'):
'''
INPUT:
two_occupations - list, 1 or 2 "Current_Role" (profession) values as string
data - dataframe, default = ds_survey
palette - string, matplotlib palette, default = "GnBu_r"
OUTPUT:
figure plot - horizontal bar plot, x = Percentage, y = Frameworks used by each profession
'''
# Creating objects:
frame1 = []
frame2 = []
perc1 = []
perc2 = []
df1 = pd.DataFrame()
df2 = pd.DataFrame()
# Defining the labels and style of the figure:
x_label = 'Percentage'
y_label = 'Machine Learning Framework'
sns.set_style("whitegrid")
# If a plot for a single profession is required:
if len(two_occupations) == 1:
# Filtering dataframe for the specific profession:
df = data[data['Current_Role'] == two_occupations[0]]
# Creating new dataframe with the needed values:
for framework in mlframew_list:
frame1.append(framework.split('_')[2])
perc1.append(round(df[framework].notnull().mean()*100, 2))
df1['Framework'] = frame1
df1['Percentage'] = perc1
df1 = df1.sort_values('Percentage', ascending = False).reset_index(drop = True)
# Creating the figure:
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (9, 7))
# Annotating percentages1:
for i in range(0, len(df1['Percentage'])):
ax.text((df1['Percentage'][i] + 0.5), i, str(round(df1['Percentage'][i], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center')
# Setting the plot:
sns.barplot(x = 'Percentage', y = 'Framework', data = df1, palette = palette).set(
xlabel = x_label,
ylabel = y_label,
title = 'Frameworks Used by ' + two_occupations[0] + 's',
xlim = (0,80))
sns.despine(left=True)
fig.show()
# If a plot for two professions is required:
elif len(two_occupations) == 2:
# Filtering dataframe for the specific profession:
df = data[data['Current_Role'] == two_occupations[0]]
# Creating new dataframe with the needed values:
for framework in mlframew_list:
frame1.append(framework.split('_')[2])
perc1.append(round(df[framework].notnull().mean()*100, 2))
df1['Framework'] = frame1
df1['Percentage'] = perc1
df1 = df1.sort_values('Percentage', ascending = False).reset_index(drop = True)
# Filtering dataframe for the specific profession:
df = data[data['Current_Role'] == two_occupations[1]]
# Creating new dataframes with the needed values:
for framework in mlframew_list:
frame2.append(framework.split('_')[2])
perc2.append(round(df[framework].notnull().mean()*100, 2))
df2['Framework'] = frame2
df2['Percentage'] = perc2
df2 = df2.sort_values('Percentage', ascending = False).reset_index(drop = True)
# Creating the figure:
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (15.5, 8.5))
# Annotating percentages 1:
for i in range(0, len(df1['Percentage'])):
ax[0].text((df1['Percentage'][i] + 0.5), i, str(round(df1['Percentage'][i], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center')
# Setting plot 1:
sns.barplot(x = 'Percentage', y = 'Framework', data = df1, palette = palette,
ax = ax[0]).set_title('Frameworks Used by ' + two_occupations[0] + 's')
ax[0].set_xlabel(x_label)
ax[0].set_ylabel(y_label)
ax[0].set_xlim(0,60)
# Annotating percentages 2:
for i in range(0, len(df2['Percentage'])):
ax[1].text((df2['Percentage'][i] + 0.5), i, str(round(df2['Percentage'][i], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center')
# Setting plot 2:
sns.barplot(x = 'Percentage', y = 'Framework', data = df2, palette = palette,
ax = ax[1]).set_title('Frameworks Used by ' + two_occupations[1] + 's')
ax[1].set_xlabel(x_label)
ax[1].set_ylabel(y_label)
ax[1].set_xlim(0,60)
sns.despine(left=True)
fig.show()
# If more than two professions are required:
else:
return print('two_occupations must be a list with 1 or 2 columns of interest')
mlframew_rank([roles_list[0]], palette = 'autumn')
mlframew_rank([roles_list[1], roles_list[2]], palette = 'autumn')
mlframew_rank([roles_list[3], roles_list[4]], palette = 'autumn')
Once we found out that Python is the most used language among Kaggle survey's participants, it was expected that Scikit-Learn would be as well the most used machine learning framework, and we have just confirmed that.
What stands out is the fact that among data and business analysts, Random Forest is the second most used machine learning framework, while for the other professionals this place belongs to Keras framework.
# Creating a list with the columns related to programming languages:
reldb_list = ['relDB_MySQL', 'relDB_PostgresSQL', 'relDB_SQLite', 'relDB_SQLServer', 'relDB_Oracle',
'relDB_Access', 'relDB_AWS-relDB-Server', 'relDB_AWS-DynamoDB', 'relDB_AzureSQL',
'relDB_GoogleCloudSQL', 'relDB_None', 'relDB_Other']
# Defining function to create a figure with the information needed:
def reldb_rank(two_occupations, data = ds_survey, palette = 'GnBu_r'):
'''
INPUT:
two_occupations - list, 1 or 2 "Current_Role" (profession) values as string
data - dataframe, default = ds_survey
palette - string, matplotlib palette, default = "GnBu_r"
OUTPUT:
figure plot - horizontal bar plot, x = Percentage, y = Relational databases used by each profession
'''
# Creating objects:
db1 = []
db2 = []
perc1 = []
perc2 = []
df1 = pd.DataFrame()
df2 = pd.DataFrame()
# Defining the labels and style of the figure:
x_label = 'Percentage'
y_label = 'Relational Database'
sns.set_style("whitegrid")
# If a plot for a single profession is required:
if len(two_occupations) == 1:
# Filtering dataframe for the specific profession:
df = data[data['Current_Role'] == two_occupations[0]]
# Creating new dataframe with the needed values:
for db in reldb_list:
db1.append(db.split('_')[1])
perc1.append(round(df[db].notnull().mean()*100, 2))
df1['Database'] = db1
df1['Percentage'] = perc1
df1 = df1.sort_values('Percentage', ascending = False).reset_index(drop = True)
# Creating the figure:
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (9, 7))
# Annotating percentages:
for i in range(0, len(df1['Percentage'])):
ax.text((df1['Percentage'][i] + 0.5), i, str(round(df1['Percentage'][i], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center')
# Setting the plot:
sns.barplot(x = 'Percentage', y = 'Database', data = df1, palette = palette).set(
xlabel = x_label,
ylabel = y_label,
title = 'Relational Databases Used by ' + two_occupations[0] + 's',
xlim = (0,50))
sns.despine(left=True)
fig.show()
# If a plot for two professions is required:
elif len(two_occupations) == 2:
# Filtering dataframe for the specific profession:
df = data[data['Current_Role'] == two_occupations[0]]
# Creating new dataframe with the needed values:
for db in reldb_list:
db1.append(db.split('_')[1])
perc1.append(round(df[db].notnull().mean()*100, 2))
df1['Database'] = db1
df1['Percentage'] = perc1
df1 = df1.sort_values('Percentage', ascending = False).reset_index(drop = True)
# Filtering dataframe for the specific profession:
df = data[data['Current_Role'] == two_occupations[1]]
# Creating new dataframes with the needed values:
for db in reldb_list:
db2.append(db.split('_')[1])
perc2.append(round(df[db].notnull().mean()*100, 2))
df2['Database'] = db2
df2['Percentage'] = perc2
df2 = df2.sort_values('Percentage', ascending = False).reset_index(drop = True)
# Creating the figure:
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (15.5, 8.5))
# Annotating percentages 1:
for i in range(0, len(df1['Percentage'])):
ax[0].text((df1['Percentage'][i] + 0.5), i, str(round(df1['Percentage'][i], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center')
# Setting plot 1:
sns.barplot(x = 'Percentage', y = 'Database', data = df1, palette = palette,
ax = ax[0]).set_title('Relational Databases Used by ' + two_occupations[0] + 's')
ax[0].set_xlabel(x_label)
ax[0].set_ylabel(y_label)
ax[0].set_xlim(0,50)
# Annotating percentages 2:
for i in range(0, len(df2['Percentage'])):
ax[1].text((df2['Percentage'][i] + 0.5), i, str(round(df2['Percentage'][i], 1))+'%', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center')
# Setting plot 2:
sns.barplot(x = 'Percentage', y = 'Database', data = df2, palette = palette,
ax = ax[1]).set_title('Relational Databases Used by ' + two_occupations[1] + 's')
ax[1].set_xlabel(x_label)
ax[1].set_ylabel(y_label)
ax[1].set_xlim(0,50)
sns.despine(left=True)
fig.show()
# If more than two professions are required:
else:
return print('two_occupations must be a list with 1 or 2 columns of interest')
reldb_rank([roles_list[0]], palette = 'autumn')
reldb_rank([roles_list[1], roles_list[2]], palette = 'autumn')
reldb_rank([roles_list[3], roles_list[4]], palette = 'autumn')
When it comes to relational databases, MySQL outsantds in the majority of the professional groups. MySQL only gives place to SQL Server in the group of DBA/databases engineers.
We can see that cloud services like Google Cloud SQL still haven't overcome traditional databases like MySQL, Postgrees and SQL Server, but many believe that it's just a matter of time, since cloud computing provides a lot of benefits in terms of physical structure, maintenance or costs (it is possible to pay according to the storage size or usage time).
Let's talk about money?
# Checking how salary have been treated over the survey:
ds_survey['Annual_Income'].value_counts()
# Checking NaN values for salary:
null_perc = ds_survey['Annual_Income'].isnull().mean()
print('About ' + str(int((null_perc*100))) + '% of the participants did not declare their incomes.')
Two things to be considered:
# Creating function to transform income range into average salary:
def range_to_avg_salary(row):
'''
INPUT:
row - range of salary values as string
OUTPUT:
avg_salary - average salary value as integer
'''
if type(row) == str:
# Salary range 0 - 999:
if row.startswith('$'):
avg_salary = 500
# Salary range > 500,000:
elif row.startswith('>'):
avg_salary = 500000
# Salary range 1,000 to 500,000 or Nan:
else:
numbers = row.replace(',', '').split('-')
num1 = int(numbers[0])
num2 = int(numbers[1])
avg_salary = (num1 + (num2 + 1))/2
else:
avg_salary = np.nan
return avg_salary
# Applying function to the Annual_Income column:
ds_survey['Avg_Salary'] = ds_survey['Annual_Income'].apply(range_to_avg_salary)
# Deleting rows with NaN values for salary:
ds_survey_sal = ds_survey.dropna(subset = ['Avg_Salary'])
# Average salary by role (mean):
ds_survey_sal.groupby('Current_Role')['Avg_Salary'].mean().sort_values(ascending = False)
# Average salary by role (median):
ds_survey_sal.groupby('Current_Role')['Avg_Salary'].median().sort_values(ascending = False)
# Average salary by gender (mean):
ds_survey_sal.groupby('Gender')['Avg_Salary'].mean().sort_values(ascending = False)
# Counting observations for 'Prefer to self_describe' value:
ds_survey_sal.Gender.value_counts()
# Since there's only 8 observations for 'Prefer to self_describe' gender value, we are not going to consider them:
ds_survey_gender = ds_survey_sal[ds_survey_sal['Gender'] != 'Prefer to self-describe']
# Average salary by gender (mean) - excluding 'Prefer to self_describe' gender value:
ds_survey_gender.groupby('Gender')['Avg_Salary'].mean().sort_values(ascending = False)
# Average salary by gender (median) - excluding 'Prefer to self_describe' gender value:
ds_survey_gender.groupby('Gender')['Avg_Salary'].median().sort_values(ascending = False)
# Average salary by Education (mean):
ds_survey_sal.groupby('Education')['Avg_Salary'].mean().sort_values(ascending = False)
# Average salary by Education (median):
ds_survey_sal.groupby('Education')['Avg_Salary'].median().sort_values(ascending = False)
# Creating plot for average salary per occupation:
# Grouping by occupation and calculating the mean value:
occ_sal = ds_survey_sal.groupby('Current_Role')['Avg_Salary'].mean().sort_values(ascending = False)
# Creating the figure:
x_label = 'Occupation'
y_label = 'Mean Annual Salary'
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (9, 5))
# Annotating avg salary:
for occ in range(0, len(occ_sal.values)):
ax.text(occ, (occ_sal.values[occ] + 1500), str(int(occ_sal.values[occ]))+'$', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center' , ha= 'center')
# Setting the plot:
sns.barplot(x = occ_sal.index, y = occ_sal.values, palette = 'winter').set(
xlabel = x_label,
ylabel = y_label,
title = 'Mean Annual Salary per Occupation',
ylim = (0,70000))
sns.despine(left=True)
fig.show()
# Grouping by Company Size and calculating the mean value:
comp_sal = ds_survey_sal.groupby('Company_Size')['Avg_Salary'].mean().sort_values(ascending = False)
# Creating the figure:
x_label = 'Company Size'
y_label = 'Mean Annual Salary'
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (9, 5))
# Annotating avg salary:
for comp in range(0, len(comp_sal.values)):
ax.text(comp, (comp_sal.values[comp] + 1500), str(int(comp_sal.values[comp]))+'$', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center' , ha= 'center')
# Setting the plot:
sns.barplot(x = comp_sal.index, y = comp_sal.values, palette = 'winter').set(
xlabel = x_label,
ylabel = y_label,
title = 'Mean Annual Salary per Company Size - All Occupations',
ylim = (0,70000))
sns.despine(left=True)
fig.show()
# Grouping by Country and calculating the mean value:
country_sal = ds_survey_sal.groupby('Res_Country')['Avg_Salary'].mean().sort_values(ascending = False)
# Creating the figure:
x_label = 'Country of Residence'
y_label = 'Mean Annual Salary'
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (11, 5))
# Annotating avg salary:
for country in range(0, 7):
ax.text(country, (country_sal.values[country] + 1500), str(int(country_sal.values[country]))+'$', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center' , ha= 'center')
# Setting the plot:
sns.barplot(x = country_sal.index[0:7], y = country_sal.values[0:7], palette = 'winter').set(
xlabel = x_label,
ylabel = y_label,
title = 'Top 7 Mean Annual Salary per Country of Residence - All Occupations',
ylim = (0,130000))
sns.despine(left=True)
fig.show()
Calculating the mean annual income for the different occupations, it looks like data scientists are more likely to be better paid than the others, followed by data engineers.
For the overall occupations in the Data Science field, bigger companies seem to pay best salaries, and the United States is the country with the highest average salary.
def plot_salary(occupation, ds_df = ds_survey_sal, ds_gen = ds_survey_gender, palette = 'GnBu_r', y_top = 90000):
'''
INPUT:
occupation - string, 1 "Current_Role" (profession) value as string
ds_df - dataframe with salaries converted to numeric, default = ds_survey_sal
ds_gen - dataframe with filtered gender, default = ds_suhttps://medium.com/@evertonbin/5-stories-data-tell-us-about-data-scientists-7b970a90a8aarvey_gender
palette - string, matplotlib palette, default = "GnBu_r"
y_top = numeric, default = 90000 (limit for y axis)
OUTPUT:
figure plot - horizontal bar plot, x = Percentage, y = Relational databases used by each profession
'''
# Filtering dataframe for specific occupation:
df1 = ds_df[ds_df['Current_Role'] == occupation]
# Grouping by Education:
ed_df = df1.groupby('Education')['Avg_Salary'].mean().sort_values(ascending = False)
# Simplifying degree value:
label_dict = {'Doctoral degree': 'Doctoral',
'Master’s degree': 'Master',
'Bachelor’s degree': 'Bachelor',
'Professional degree': 'Professional',
'Some college/university study without earning a bachelor’s degree': 'Not Bachelor',
'No formal education past high school': 'High School',
'I prefer not to answer': 'No Answer'}
label = list(ed_df.index)
for k, v in label_dict.items():
for i in range(0, len(label)):
if k == label[i]:
label[i] = v
# Filtering dataframe for specific occupation:
df2 = ds_gen[ds_gen['Current_Role'] == occupation]
# Grouping by Gender:
gen_df = df2.groupby('Gender')['Avg_Salary'].mean().sort_values(ascending = False)
sns.set_style("whitegrid")
# Creating the figure:
fig = plt.figure(figsize=(15.5,7))
gs = gridspec.GridSpec(nrows=1,
ncols=3,
figure=fig,
width_ratios= [1, 1, 1],
height_ratios=[1],
wspace=0.3,
hspace=1)
ax0 = fig.add_subplot(gs[0:2])
ax1 = fig.add_subplot(gs[2:3])
# Annotating avg salary per Education:
for ed in range(0, len(ed_df.values)):
ax0.text(ed, (ed_df.values[ed] + 1500), str(int(ed_df.values[ed]))+'$', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center' , ha= 'center')
# Annotating avg salary per Gender:
for gn in range(0, len(gen_df.values)):
ax1.text(gn, (gen_df.values[gn] + 1500), str(int(gen_df.values[gn]))+'$', c = 'k',
fontsize=12, fontstretch = 'normal', fontweight = 'light', va= 'center' , ha= 'center')
# Setting plot 1:
sns.barplot(x = label, y = ed_df.values, palette = palette,
ax = ax0).set_title('Mean Annual Salary by Occupation - ' + occupation)
ax0.set_xlabel('Educational Level')
ax0.set_ylabel('Mean Annual Salary')
ax0.set_ylim(0, y_top)
# Setting plot 2:
sns.barplot(x = gen_df.index, y = gen_df.values, palette = palette,
ax = ax1).set_title('Mean Annual Salary by Gender - ' + occupation)
ax1.set_xlabel('Gender')
ax1.set_ylabel('Mean Annual Salary')
ax1.set_ylim(0, y_top)
sns.despine(left=True)
fig.show()
plot_salary(roles_list[0], palette = 'winter')
plot_salary(roles_list[1], palette = 'winter')
plot_salary(roles_list[2], palette = 'winter')
plot_salary(roles_list[3], palette = 'winter')
plot_salary(roles_list[4], palette = 'winter', y_top = 100000)
In terms of salary, there's a lot to consider before jumping to conclusions, since people that have preferred not to declare their gender appear as the ones with the highest mean salary. Also, among data / business analysts and dba/database engineers, people who decided not to anwer about their educational level appear well ranked with better mean annual salaries.
Given these limitations, Doctoral degree is usually related to highest payments. When it comes to data scientists, data engineers and data analysts, Master's degree also relates to better payment, although it doesn't seem to be true for the other occupations. In the other hand, Bachelor's degree cannot be related to higher incomes, what could indicate that, except for Doctoral and Master's degree, more important than the educational background is the knowledge in the field of interest.
Sadly, but not surprisingly, women have lower average salary in comparison to men in all the Data Science fields treated here.
# Tranforming specific columns to dummy format:
full_list = platforms_list + activities_list + languages_list + mlframew_list + reldb_list
for col in full_list:
bool_col = ds_survey[col].isnull()
for i in range(0, len(bool_col)):
if bool_col[i]:
ds_survey[col].iloc[i] = int(0)
else:
ds_survey[col].iloc[i] = int(1)
ds_survey[col] = pd.to_numeric(ds_survey[col])
#Tranforming Current Role into categorical type, since that's the column we want to predict:
ds_survey['Current_Role'] = ds_survey['Current_Role'].astype('category')
# Getting dummies from the other columns of interest:
cat_columns = ['Education', 'Years_Coding']
df_model = pd.get_dummies(ds_survey, columns = cat_columns, drop_first = True)
df_model.shape
# Since we want to see how well educational background, courses, skills and activities can distinguish different roles,
# we are going to drop the other columns that don't relate to that:
df_model = df_model.drop(columns = ['Age', 'Gender', 'Res_Country', 'Company_Size', 'Annual_Income', 'Avg_Salary'])
df_model.head()
# Splitting dataframe into X and y:
X = df_model.loc[:, df_model.columns != 'Current_Role']
y = df_model['Current_Role']
# Creating a Decision Tree model for the whole dataframe to check for main variables:
tree_model = DecisionTreeClassifier(random_state = 100)
tree_model.fit(X, y)
# Checking feature importance:
feat_imp = pd.DataFrame()
feat_imp['Feature'] = X.columns
feat_imp['Importance'] = tree_model.feature_importances_
feat_imp = feat_imp.sort_values('Importance', ascending = False).reset_index(drop = True)
# Checking 15 first most important features:
feat_imp.head(15)
# Checking 15 last important features:
feat_imp.tail(15)
We can see that one machine learning activity appears as the most important feature to distinguish the different professions. We have seen before that those machine learning activities are most commonly performed by data scientists and that is probably why it was chosen as the most important feature.
The other main features have similar importance values, being the ones related to courses and activities the most common.
We are going to select 50 of them to build our predictive model, according to the importance order, and see how well our model performs on predicting which profession relates to each profile.
# Selecting 50 first features:
features = []
for i in range(0, 50):
feature = feat_imp.Feature.iloc[i]
features.append(feature)
# Checking class balance:
df_model.Current_Role.value_counts()
We have five different classes and some of them with only few observations. We are not going to balance the dataframe, since we just want to have an overall view on how well our model can perform when distinguishing the different classes/professions.
# Splitting data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 100)
# Using GridSearchCV to find the best parameters:
grid_values = {'n_estimators': [25, 50, 75, 100, 150], 'max_features': [3, 5, 7, 10, 13], 'max_depth': [None, 5, 15, 30]}
rf_model = RandomForestClassifier()
grid_rf_f1 = GridSearchCV(rf_model, param_grid = grid_values, scoring = 'f1_micro')
grid_rf_f1.fit(X_train, y_train)
# Best parameters:
grid_rf_f1.best_params_
# Best score:
grid_rf_f1.best_score_
# Training model for best parameters:
rf_final_model = RandomForestClassifier(max_depth = 15, max_features = 10, n_estimators = 75)
rf_final_model.fit(X_train, y_train)
# Predicting for test values:
y_pred = rf_final_model.predict(X_test)
f1 = f1_score(y_test, y_pred, average = 'micro')
f1
We can see that our model has not performed well when it comes to well understand profile differences that would lead to different professions and roles in the Data Science field.
But it makes sense. Data Science is a relatively new field and it is characterized by the intersection of different areas of knowledge, such as statistics, computer science and business. With that being said, even though one data scientist may not be responsible for data infrastructure, he will still have to know some programming language, while one data engineer may not perform with machine learning algorithms, but still will have to deal with data transformation when creating data pipelines.
It's still common to see companies announcing job opportunities, let's say, for a data analyst, while the job description refers to a data engineer. We couldn't call it a big mistake, once these different 'specializations' have so much in common, and in some companies they are responsible for the whole process, from data structure to model deployment.