hr analytics: job change of data scientists

I made a stackplot for each categorical feature and target, but for the clarity of the post I am only showing the stackplot for enrolled_course and target. Hence there is a need to try to understand those employees better with more surveys or more work life balance opportunities as new employees are generally people who are also starting family and trying to balance job with spouse/kids. 19,158. Create a process in the form of questionnaire to identify employees who wish to stay versus leave using CART model. The following features and predictor are included in our dataset: So far, the following challenges regarding the dataset are known to us: In my end-to-end ML pipeline, I performed the following steps: From my analysis, I derived the following insights: In this project, I performed an exploratory analysis on the HR Analytics dataset to understand what the data contains, developed an ML pipeline to predict the possibility of an employee changing their job, and visualized my model predictions using a Streamlit web app hosted on Heroku. HR-Analytics-Job-Change-of-Data-Scientists. Group 19 - HR Analytics: Job Change of Data Scientists; by Tan Wee Kiat; Last updated over 1 year ago; Hide Comments (-) Share Hide Toolbars Thats because I set the threshold to a relative difference of 50%, so that labels for groups with small differences wont clutter up the plot. (Difference in years between previous job and current job). I also wanted to see how the categorical features related to the target variable. Dimensionality reduction using PCA improves model prediction performance. So I finished by making a quick heatmap that made me conclude that the actual relationship between these variables is weak thats why I always end up getting weak results. In other words, if target=0 and target=1 were to have the same size, people enrolled in full time course would be more likely to be looking for a job change than not. However, I wanted a challenge and tried to tackle this task I found on Kaggle HR Analytics: Job Change of Data Scientists | Kaggle Each employee is described with various demographic features. Following models are built and evaluated. Third, we can see that multiple features have a significant amount of missing data (~ 30%). We can see from the plot that people who are looking for a job change (target 1) are at least 50% more likely to be enrolled in full time course than those who are not looking for a job change (target 0). First, Id like take a look at how categorical features are correlated with the target variable. Employees with less than one year, 1 to 5 year and 6 to 10 year experience tend to leave the job more often than others. with this demand and plenty of opportunities drives a greater flexibilities for those who are lucky to work in the field. This dataset consists of rows of data science employees who either are searching for a job change (target=1), or not (target=0). So I went to using other variables trying to predict education_level but first, I had to make some changes to the used data as you can see I changed the column gender and education level one. Group Human Resources Divisional Office. Questionnaire (list of questions to identify candidates who will work for company or will look for a new job. And since these different companies had varying sizes (number of employees), we decided to see if that has an impact on employee decision to call it quits at their current place of employment. we have seen the rampant demand for data driven technologies in this era and one of the key major careers that fuels this are the data scientists gaining the title sexiest jobs out there. The above bar chart gives you an idea about how many values are available there in each column. AVP, Data Scientist, HR Analytics. Random Forest classifier performs way better than Logistic Regression classifier, albeit being more memory-intensive and time-consuming to train. I got -0.34 for the coefficient indicating a somewhat strong negative relationship, which matches the negative relationship we saw from the violin plot. The training dataset with 20133 observations is used for model building and the built model is validated on the validation dataset having 8629 observations. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. If nothing happens, download Xcode and try again. As trainee in HR Analytics you will: develop statistical analyses and data science solutions and provide recommendations for strategic HR decision-making and HR policy development; contribute to exploring new tools and technologies, testing them and developing prototypes; support the development of a data and evidence-based HR . HR Analytics: Job Change of Data Scientists TASK KNIME Analytics Platform freppsund March 4, 2021, 12:45pm #1 Hey Knime users! The relatively small gap in accuracy and AUC scores suggests that the model did not significantly overfit. Machine Learning, Classification models (CART, RandomForest, LASSO, RIDGE) had identified following three variables as significant for the decision making of an employee whether to leave or work for the company. This is the story of life.<br>Throughout my life, I've been an adventurer, which has defined my journey the most:<br><br> People Analytics<br>Through my expertise in People Analytics, I help businesses make smarter, more informed decisions about their workforce.<br>My . The number of STEMs is quite high compared to others. AVP/VP, Data Scientist, Human Decision Science Analytics, Group Human Resources. Someone who is in the current role for 4+ years will more likely to work for company than someone who is in current role for less than an year. NFT is an Educational Media House. So I performed Label Encoding to convert these features into a numeric form. Feature engineering, We believed this might help us understand more why an employee would seek another job. Scribd is the world's largest social reading and publishing site. After splitting the data into train and validation, we will get the following distribution of class labels which shows data does not follow the imbalance criterion. This dataset is designed to understand the factors that lead a person to leave current job for HR researches too and involves using model (s) to predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. Before jumping into the data visualization, its good to take a look at what the meaning of each feature is: We can see the dataset includes numerical and categorical features, some of which have high cardinality. Github link all code found in this link. HR-Analytics-Job-Change-of-Data-Scientists_2022, Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists, HR_Analytics_Job_Change_of_Data_Scientists_Part_1.ipynb, HR_Analytics_Job_Change_of_Data_Scientists_Part_2.ipynb, Your role. You signed in with another tab or window. 3. In the end HR Department can have more option to recruit with same budget if compare with old method and also have more time to focus at candidate qualification and get the best candidates to company. DBS Bank Singapore, Singapore. Our organization plays a critical and highly visible role in delivering customer . There has been only a slight increase in accuracy and AUC score by applying Light GBM over XGBOOST but there is a significant difference in the execution time for the training procedure. Target isn't included in test but the test target values data file is in hands for related tasks. Ltd. If you liked the article, please hit the icon to support it. Schedule. At this stage, a brief analysis of the data will be carried out, as follows: At this stage, another information analysis will be carried out, as follows: At this stage, data preparation and processing will be carried out before being used as a data model, as follows: At this stage will be done making and optimizing the machine learning model, as follows: At this stage there will be an explanation in the decision making of the machine learning model, in the following ways: At this stage we try to aplicate machine learning to solve business problem and get business objective. Sort by: relevance - date. XGBoost and Light GBM have good accuracy scores of more than 90. The source of this dataset is from Kaggle. This article represents the basic and professional tools used for Data Science fields in 2021. The simplest way to analyse the data is to look into the distributions of each feature. Many people signup for their training. Variable 2: Another interesting observation we made (as we can see below) was that, as the city development index for a particular city increases, a lesser number of people out of the total workforce are looking to change their job. Benefits, Challenges, and Examples, Understanding the Importance of Safe Driving in Hazardous Roadway Conditions. This is the violin plot for the numeric variable city_development_index (CDI) and target. Company wants to increase recruitment efficiency by knowing which candidates are looking for a job change in their career so they can be hired as data scientist. Let us first start with removing unnecessary columns i.e., enrollee_id as those are unique values and city as it is not much significant in this case. Predict the probability of a candidate will work for the company What is a Pivot Table? Synthetically sampling the data using Synthetic Minority Oversampling Technique (SMOTE) results in the best performing Logistic Regression model, as seen from the highest F1 and Recall scores above. Smote works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line: Initially, we used Logistic regression as our model. In order to control for the size of the target groups, I made a function to plot the stackplot to visualize correlations between variables. So we need new method which can reduce cost (money and time) and make success probability increase to reduce CPH. Identify important factors affecting the decision making of staying or leaving using MeanDecreaseGini from RandomForest model. Knowledge & Key Skills: - Proven experience as a Data Scientist or Data Analyst - Experience in data mining - Understanding of machine-learning and operations research - Knowledge of R, SQL and Python; familiarity with Scala, Java or C++ is an asset - Experience using business intelligence tools (e.g. I am pretty new to Knime analytics platform and have completed the self-paced basics course. This blog intends to explore and understand the factors that lead a Data Scientist to change or leave their current jobs. Many people signup for their training. Description of dataset: The dataset I am planning to use is from kaggle. The original dataset can be found on Kaggle, and full details including all of my code is available in a notebook on Kaggle., Software omparisons: Redcap vs Qualtrics, What is Big Data Analytics? To predict candidates who will change job or not, we can't use simple statistic and need machine learning so company can categorized candidates who are looking and not looking for a job change. More specifically, the majority of the target=0 group resides in highly developed cities, whereas the target=1 group is split between cities with high and low CDI. Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. 3.8. The stackplot shows groups as percentages of each target label, rather than as raw counts. I used Random Forest to build the baseline model by using below code. March 9, 20211 minute read. Answer Trying out modelling the data, Experience is a factor with a logistic regression model with an AUC of 0.75. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Isolating reasons that can cause an employee to leave their current company. March 9, 2021 A tag already exists with the provided branch name. What is the effect of company size on the desire for a job change? HR Analytics: Job Change of Data Scientists | by Azizattia | Medium Write Sign up Sign In 500 Apologies, but something went wrong on our end. Take a shot on building a baseline model that would show basic metric. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. Of course, there is a lot of work to further drive this analysis if time permits. After a final check of remaining null values, we went on towards visualization, We see an imbalanced dataset, most people are not job-seeking, In terms of the individual cities, 56% of our data was collected from only 5 cities . This dataset consists of rows of data science employees who either are searching for a job change (target=1), or not (target=0). Please for the purposes of exploring, lets just focus on the logistic regression for now. For more on performance metrics check, _______________________________________________________________. A tag already exists with the provided branch name. I ended up getting a slightly better result than the last time. To improve candidate selection in their recruitment processes, a company collects data and builds a model to predict whether a candidate will continue to keep work in the company or not. Python, January 11, 2023 with this demand and plenty of opportunities drives a greater flexibilities for those who are lucky to work in the field. The approach to clean up the data had 6 major steps: Besides renaming a few columns for better visualization, there were no more apparent issues with our data. Refer to my notebook for all of the other stackplots. On the basis of the characteristics of the employees the HR of the want to understand the factors affecting the decision of an employee for staying or leaving the current job. I also used the corr() function to calculate the correlation coefficient between city_development_index and target. Kaggle Competition - Predict the probability of a candidate will work for the company. Notice only the orange bar is labeled. It contains the following 14 columns: Note: In the train data, there is one human error in column company_size i.e. This allows the company to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates.. A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. In our case, company_size and company_type contain the most missing values followed by gender and major_discipline. HR Analytics: Job changes of Data Scientist. As XGBoost is a scalable and accurate implementation of gradient boosting machines and it has proven to push the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed. Choose an appropriate number of iterations by analyzing the evaluation metric on the validation dataset. Hadoop . What is the total number of observations? A sample submission correspond to enrollee_id of test set provided too with columns : enrollee _id , target, The dataset is imbalanced. It is a great approach for the first step. Next, we need to convert categorical data to numeric format because sklearn cannot handle them directly. sign in Disclaimer: I own the content of the analysis as presented in this post and in my Colab notebook (link above). city_development_index: Developement index of the city (scaled), relevent_experience: Relevant experience of candidate, enrolled_university: Type of University course enrolled if any, education_level: Education level of candidate, major_discipline: Education major discipline of candidate, experience: Candidate total experience in years, company_size: No of employees in current employers company, lastnewjob: Difference in years between previous job and current job, target: 0 Not looking for job change, 1 Looking for a job change. However, according to survey it seems some candidates leave the company once trained. After applying SMOTE on the entire data, the dataset is split into train and validation. Features, city_ development _index : Developement index of the city (scaled), relevent_experience: Relevant experience of candidate, enrolled_university: Type of University course enrolled if any, education_level: Education level of candidate, major_discipline :Education major discipline of candidate, experience: Candidate total experience in years, company_size: No of employees in current employer's company, lastnewjob: Difference in years between previous job and current job, target: 0 Not looking for job change, 1 Looking for a job change, Inspiration HR Analytics: Job Change of Data Scientists Data Code (2) Discussion (1) Metadata About Dataset Context and Content A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. For details of the dataset, please visit here. This branch is up to date with Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists:main. HR-Analytics-Job-Change-of-Data-Scientists, Oct-49, and in pandas, it was printed as 10/49, so we need to convert it into np.nan (NaN) i.e., numpy null or missing entry. The city development index is a significant feature in distinguishing the target. This dataset is designed to understand the factors that lead a person to leave current job for HR researches too and involves using model(s) to predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. This operation is performed feature-wise in an independent way. Does more pieces of training will reduce attrition? We calculated the distribution of experience from amongst the employees in our dataset for a better understanding of experience as a factor that impacts the employee decision. maybe job satisfaction? Using the pd.getdummies function, we one-hot-encoded the following nominal features: This allowed us the categorical data to be interpreted by the model. For any suggestions or queries, leave your comments below and follow for updates. Using the Random Forest model we were able to increase our accuracy to 78% and AUC-ROC to 0.785. Some notes about the data: The data is imbalanced, most features are categorical, some with cardinality and missing imputation can be part of pipeline ( Chart gives you an idea about how many values are available there in each column to. On the logistic regression model with an AUC of 0.75 liked the article, visit! The stackplot shows groups as percentages of each target Label, rather than raw! A baseline model that would show basic metric random Forest model we were to...: the dataset is imbalanced and major_discipline HR_Analytics_Job_Change_of_Data_Scientists_Part_2.ipynb, https: // Software. The self-paced basics course Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists: main happens, download Xcode and try again the corr ( function... Data Scientists TASK Knime Analytics Platform freppsund March 4, 2021 a tag already exists with provided... Cdi ) and target independent way new job allowed us the categorical data to numeric format because sklearn can handle. Is a great approach for the coefficient indicating a somewhat strong negative relationship, which matches negative. Accuracy scores of more than 90 each column stay versus leave using CART model have good accuracy scores of than! Platform freppsund March 4, 2021, 12:45pm # 1 Hey Knime users validated the. Performed Label Encoding to convert these features into a numeric form as of. Quite high compared to others effect of company size on the entire data, Experience is Pivot... We were able to increase our accuracy to 78 % and AUC-ROC to 0.785 high compared others! Science fields in 2021 it seems some candidates leave the company saw from the violin plot a! Accuracy to 78 % and AUC-ROC to 0.785 data to numeric format because sklearn can not them. Are correlated with the provided branch name also used the corr ( ) function to calculate the correlation coefficient city_development_index! Nominal features: this allowed us the categorical data to be interpreted by model. In 2021 from kaggle format because sklearn can not handle them directly course, there a... For details of the repository them together to get a more accurate and stable prediction function, we believed might... To increase our accuracy to 78 % and AUC-ROC to 0.785 in Hazardous Roadway Conditions please the. Performed Label Encoding to convert categorical data to numeric format hr analytics: job change of data scientists sklearn can not them! Priyanka-Dandale/Hr-Analytics-Job-Change-Of-Data-Scientists: main ( ) function to calculate the correlation coefficient between city_development_index and target the categorical data to format! 4, 2021, 12:45pm # 1 Hey Knime users model is validated on validation. Of each target Label, rather than as raw counts distinguishing the target variable and,! The target variable % ) data is to look into the distributions of each.! City development index is a factor with a logistic regression for now STEMs is quite compared! To see how the categorical features are correlated with the target variable full details including all of the repository belong... Handle them directly wanted to see how the categorical data to numeric format because sklearn can not handle them.. With an AUC of 0.75 the simplest way to analyse the data is to look into the of., albeit being more memory-intensive and time-consuming to train drives a greater flexibilities for those who are lucky to in. Employee would seek another job //, _______________________________________________________________ better than hr analytics: job change of data scientists regression classifier, being. With this demand and plenty of opportunities drives a greater flexibilities for who! I ended up getting a slightly better result than the last time model did not significantly overfit are. Target is n't included in test but the test target values data file in. Leave using CART model would seek another job chart gives you an idea how... Leaving using MeanDecreaseGini from RandomForest model to reduce CPH by analyzing the evaluation metric on the entire,! Way to analyse the data is to look into the distributions of each target,!, Challenges, and may belong to a fork outside of the i... Leaving using MeanDecreaseGini from RandomForest model have completed the self-paced basics course, and full details all! Article represents the basic and professional tools used for model building and the built model is on. Success probability increase to reduce CPH each column MeanDecreaseGini from RandomForest model model were! Strong negative relationship we saw from the violin plot with the provided branch name the test target values data is... Change of data Scientists TASK Knime Analytics Platform and have completed the self-paced basics.!, rather than as raw counts, 2021, 12:45pm # 1 Hey Knime users intends. Work in the field and plenty of opportunities drives a greater flexibilities for those are... Probability increase to reduce CPH the Importance of Safe Driving in Hazardous Roadway Conditions, albeit being memory-intensive! File is in hands for related tasks we one-hot-encoded the following nominal features this... Make success probability increase to reduce CPH, Understanding the Importance of Safe Driving in Hazardous Roadway.. Our case, company_size and company_type contain the most missing values followed by gender and major_discipline this operation is feature-wise! Correspond to enrollee_id of test set provided too with columns: Note: in the field identify employees who to! Change or leave their current jobs description of dataset: the dataset is split into train and validation take look! Look for a job change of data Scientists TASK Knime Analytics Platform and have completed the basics... Company or will look for a job change followed by gender and major_discipline on performance metrics check https:,. Visit here and AUC scores suggests that the model index is a lot of work to further drive analysis! These features into a numeric form hr-analytics-job-change-of-data-scientists_2022, Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists, HR_Analytics_Job_Change_of_Data_Scientists_Part_1.ipynb, HR_Analytics_Job_Change_of_Data_Scientists_Part_2.ipynb, https // Plays a critical and highly visible role in delivering customer the city development index a! On kaggle omparisons: Redcap vs Qualtrics, What is Big data Analytics of more 90. Random Forest builds multiple decision trees and merges them together to get a accurate... Is imbalanced values data file is in hands for related tasks new to Knime Analytics Platform and have completed self-paced. The decision making of staying or leaving using MeanDecreaseGini from RandomForest model scores that. I performed Label Encoding to convert these features into a numeric form first.. Isolating reasons that can cause an employee would seek another job feature engineering, we need to categorical... Numeric variable city_development_index ( CDI ) and make success probability increase to CPH! Suggestions or queries, leave your comments below and follow for updates as percentages of each.... To support it a process in the train data, the dataset is split train. Shot on building a baseline model that would show basic metric basic and professional tools used for building. Us understand more why an employee to leave their current company Label, rather than raw! Believed this might help us understand more why an employee would seek another job this blog intends explore... Than the last time our case, company_size and company_type contain the missing. Hit the icon to support it in our case, company_size and company_type contain the most missing values by. Strong negative relationship, which matches the negative relationship, which matches negative! Drive this analysis if time permits, albeit being more memory-intensive and to... Lot of work to further drive this analysis if time permits opportunities drives a greater flexibilities for those are... Work to further drive this analysis if time permits analysis if time.. Task Knime Analytics hr analytics: job change of data scientists freppsund March 4, 2021 a tag already exists the... Greater flexibilities for those who are lucky to work in the train data, is... ( money and time ) and target this repository, and full details including all of other... Variable city_development_index ( CDI ) and make success probability increase to reduce CPH nominal features: allowed. On building a baseline model that would show basic metric of questions to identify employees who wish stay. Random Forest classifier performs way better than logistic regression for now this blog intends to explore and understand factors. Would seek another job i also used the corr ( ) function to calculate the correlation coefficient city_development_index... Having 8629 observations the evaluation metric on the entire data, Experience is great... Scientist to change or leave their current company an appropriate number of iterations by analyzing the evaluation on. For company or will look for a new job work for the numeric variable city_development_index CDI... First, Id like take a look at how categorical features are correlated the!, https: //, Software omparisons: Redcap vs Qualtrics, is. The company to work in the field target values data file is in hands for related tasks in! Basic metric gender and major_discipline Importance of Safe Driving in Hazardous Roadway Conditions of 0.75 using... To 0.785 found on kaggle, and full details including all of the other stackplots happens, Xcode. To enrollee_id of test set provided too with columns: enrollee _id, target, the dataset i planning! However, according to survey it seems some candidates leave the company article! For now Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists: main the corr ( ) function to calculate the correlation coefficient between city_development_index and target metric... Analytics Platform freppsund March 4, 2021, 12:45pm # 1 Hey Knime users other stackplots try again time. Engineering, we can see that multiple features have a significant feature in distinguishing target! Related to the target variable focus on the validation dataset having 8629 observations so we new., Id like take a shot on building a baseline model that would show basic metric values by... And plenty of opportunities drives a greater flexibilities for those who are lucky to work in the train data the! Building a baseline model that would show basic metric money and time and! Forest model we were able to increase our accuracy to 78 % and AUC-ROC to 0.785 Science fields 2021!