Archive December 2019

Logistic regression to predict absenteeism- approach

Business Problem:

In today environment there is a high competitiveness which increase pressure on employee. High competitiveness leads unachievable goals, which cause an employee health issues, and health issue will lead absenteeism of employee.

With a given dataset an organisation is trying to predict employee absenteeism.

What is absenteeism in the business context?

Absence from work during normal working hours, resulting in temporary incapacity to execute regular working activity.

Purpose Of Model:

Explore whether a person presenting certain characteristics is expected to be away from work at some points in time or not.

Dataset:

I have downloaded a data set from kaggle called ‘Absenteeism_data.csv’ which contain following information.

  • Reason_1 – A Type of Reason to be absent.
  • Reason_2 – A Type of Reason to be absent.
  • Reason_3 – A Type of Reason to be absent.
  • Reason_4 – A Type of Reason to be absent.
  • Month Value – Month in which employee has been absent.
  • Day of the Week – Days
  • Transportation Expense – Expense in dollar
  • Distance to Work – Distance of workplace in Km
  • Age – Age of employee
  • Daily Work Load Average – Average amount of time spent working per day shown in minutes.
  • Body Mass Index – Body Mass index of employee.
  • Education – Education category(1 – high school education, 2 – Graduate, 3 – Post graduate, 4 – A Master or Doctor )
  • Children – No of children an employee has
  • Pet – Whether employee has pet or not?
  • Absenteeism Time in Hours – How many hours an employee has been absent.

Following are the main action we will take in this project.

  1. Build the model in python
  2. Save the result in Mysql.
  3. Visualise the end result in Tableau

Python for model building:

We are going to take following steps to predict absenteeism:

Load the data

Import the ‘Absenteeism_data.csv’ with the help of pandas

Identify dependent Variable i.e. identify the Y:

We have to be categories and we must find a way to say if someone is ‘being absent too much’ or not. what we’ve decided to do is to take the median of the dataset as a cut-off line in this way the dataset will be balanced (there will be roughly equal number of 0s and 1s for the logistic regression) as balancing is a great problem for ML, this will work great for us alternatively, if we had more data, we could have found other ways to deal with the issue for instance, we could have assigned some arbitrary value as a cut-off line, instead of the median.

Note that what line does is to assign 1 to anyone who has been absent 4 hours or more (more than 3 hours) that is the equivalent of taking half a day off initial code from the lecture targets = np.where(data_preprocessed[‘Absenteeism Time in Hours’] > 3, 1, 0)

Choose Algorithm to develop model:

As our Y (dependent variable) is 1 or o i.e. absent or not absent so we are going to use Logistic regression for our analysis.

Select Input for the regression:

We have to select our all x variables i.e. all independent variable which we will use for regression analysis.

Data Pre-processing:

Remove or treat missing value

In our case there is no missing value so we don’t have to worry about missing value. Yes, there are some columns who is not adding any value in our analysis such as ID which is unique in every case so we will remove it.

Remove Outliers

In our case there are no outliers so we don’t have to worry. But in general if you have outlier you can take log of your x variable to remove outliers.

Standardize the data

standardization is one of the most common pre-processing tools since data of different magnitude (scale) can be biased towards high values, we want all inputs to be of similar magnitude this is a peculiarity of machine learning in general – most (but not all) algorithms do badly with unscaled data. A very useful module we can use is Standard Scaler. It has much more capabilities than the straightforward ‘pre-processing’ method. We will create a variable that will contain the scaling information for this particular dataset.

Here’s the full documentation:

http://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Choose the column to scales

In this section we need to choose that variable which need to transform or scale.in our case we need to scale  [‘Reason_1’, ‘Reason_2’, ‘Reason_3’, ‘Reason_4′,’Education’, pet and ‘children’], because these are the columns  which contain categorical data but in numerical form so we need to transform them.

What about the other column?

‘Month Value’, ‘Day of the Week’,  ‘Transportation Expense’,  ‘Distance to Work’,  ‘Age’, ‘Daily Work Load Average’, ‘Body Mass Index’ . These are the numerical value and their data type is int. so we do not have to transform them but will keep in our analysis.

Note:-

You can ask why we are doing analysis manually column wise?

Because it is always good to analyse data feature wise it gives us a confidence for our model and we can easily interpret our model analysis.

Split Data into train and test

Divide our data into train and test and build the model on train data set.

Apply Algorithm

As per our scenario we are going to use logistic regression in our case. Following steps will take place

Train the model

First we will divide the data into train and test. We will build our model on train data set.

Test the model

When we successfully developed our model then we need to test with a new data set which is testing data sets.

Find the intercepts and coefficient

Find out the beta values and coefficient from model.

Interpreting the coefficients

Find out which feature is adding more values in predictions of Y.

Save the model

Need to save the model which we have prepared so far. To do that we need to pickle the model.

Two executable file will save in your python directory one ‘model’ and the other is ‘scaler’

To save your .Ipnyb file in form of executable, save the same as .py file.

Check Model performance on totally new data set with same features.

Now we have a totally new data set which has same feature as per previous data set but contain different values.

Note – To do that your executable file ‘model’, scaler’ and ‘.py’ file should be in same folder.

Mysql for Data store

Save the prediction in data base (Mysql)

It is always good to save data and prediction on centralised data base.  So create a data base in mysql and create a table with all field available in your predicted data frame i.e ‘df_new_obs’

Import ‘pymysql’ library to make connection between ipynb notebook and mysql.

Setup the connection with user name and password and insert the predicted output values. In the data base.

Tableau for Data visualization

Connect the data base with Tableau and visualize the result

As we know tableau is a strong tool to visualise the data. So in our case we will connect our database with tableau and visualise our result and present to the business.

To connect tableau with my sql we need to take following steps.

  • Open the tableau desktop application.
  • Click on connect data source as mysql.
  • Put your data base address, username and password.
  • Select the data base.
  • Drag the table and visualize your data.

Logistic Regression-Theory

As these days in analytics interview most of the interviewer ask questions about two algorithms which is logistic and linear regression. But why is there any reason behind?

Yes, there is a reason behind that these algorithm are very easy to interpret. I believe you should have in-depth understanding of these algorithms.

In this article we will learn about logistic regression in details. So let’s deep dive in Logistic regression.

What is Logistic Regression?

Logistic regression is a classification technique which helps to predict the probability of an outcome that can only have two values. Logistic Regression is used when the dependent variable (target) is categorical.

Types of logistic Regression:

  • Binary(Pass/fail or 0/1)
  • Multi(Cats, Dog, Sheep)
  • Ordinal(Low, Medium, High)

On the other hand, a logistic regression produces a logistic curve, which is limited to values between 0 and 1. Logistic regression is similar to a linear regression, but the curve is constructed using the natural logarithm of the “odds” of the target variable, rather than the probability.

What is Sigmoid Function:

To map predicted values with probabilities, we use the sigmoid function. The function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.

S(z) = 1/1+e−z

Where:

  • s(z)  = output between 0 and 1 (probability estimate)
  • z = input to the function (your algorithm’s prediction e.g.  b0 + b1*x)
  • e = base of natural log

Graph

In Linear Regression, we use the Ordinary Least Square (OLS) method to determine the best coefficients to attain good model fit but In Logistic Regression, we use maximum likelihood method to determine the best coefficients and eventually a good model fit.

How Maximum Likelihood method works?

For a binary classification (1/0), maximum likelihood will try to find the values of  b0 and b1 such that the resultant probabilities are close to either 1 or 0.

Logistic Regression Assumption:

I got a very good consolidated assumption on Towards Data science website, which I am putting here.

  • Binary logistic regression requires the dependent variable to be binary.
  • For a binary regression, the factor level 1 of the dependent variable should represent the desired outcome.
  • Only meaningful variables should be included.
  • The independent variables should be independent of each other. That is, the model should have little or no multicollinearity.
  • The independent variables are linearly related to the log of odds.
  • Logistic regression requires quite large sample sizes.

Performance evaluation methods of Logistic Regression.

Akaike Information Criteria (AIC):

We can say AIC works as a counter part of adjusted R square in multiple regression. The thumb rules of AIC are Smaller the better. AIC penalizes increasing number of coefficients in the model. In other words, adding more variables to the model wouldn’t let AIC increase. It helps to avoid overfitting.

To measure AIC of a single mode will not fruitful. To use AIC correctly build 2-3 logistic model and compare their AIC. The model which will have lowest AIC will relatively batter.

Null Deviance and Residual Deviance:

  • Null deviance is calculated from the model with no features, i.e. only intercept. The null model predicts class via a constant probability.
  • Residual deviance is calculated from the model having all the features. In both null and residual lower the value batter the model is.

Confusion Matrix:

It is nothing but a tabular representation of Actual vs Predicted values. This helps us to find the accuracy of the model and avoid overfitting. This is how it looks like

So now we can calculate the accuracy.

True Positive Rate (TPR):

It shows how many positive values, out of all the positive values, have been correctly predicted.

The formula to calculate the true positive rate is (TP/TP + FN). Or TPR =  1 - False Negative Rate. It is also known as Sensitivity or Recall.

False Positive Rate (FPR):

It shows how many negative values, out of all the negative values, have been incorrectly predicted.

The formula to calculate the false positive rate is (FP/FP + TN). Also, FPR = 1 - True Negative Rate.

True Negative Rate (TNR):

It represents how many negative values, out of all the negative values, have been correctly predicted. The formula to calculate the true negative rate is (TN/TN + FP). It is also known as Specificity.

False Negative Rate (FNR):

It indicates how many positive values, out of all the positive values, have been incorrectly predicted. The formula to calculate false negative rate is (FN/FN + TP).

Precision:

It indicates how many values, out of all the predicted positive values, are actually positive. The formula is (TP / TP + FP)

F Score:

F score is the harmonic mean of precision and recall. It lies between 0 and 1. Higher the value, better the model. Formula is  2((precision*recall) / (precision + recall)).

Receiver Operator Characteristic (ROC):

ROC is use to determine the accuracy of a classification model. It determines the model’s accuracy using Area Under Curve (AUC). Higher the area batter the model. ROC is plotted between True Positive Rate (Y axis) and False Positive Rate (X Axis).

In below graph yellow line represents the ROC curve at 0.5 thresholds. At this point, sensitivity = specificity.

Decision Tree – Theory

Decision tree is very simple yet a powerful algorithm for classification and regression. As name suggest it has tree like structure. It is a non-parametric technique. A decision tree typically starts with a single node, which branches into possible outcomes. Each of those outcomes leads to additional nodes, which branch off into other possibilities. This gives it a treelike shape.

For example of a decision tree can be explained using below binary tree. Let’s suppose you want to predict whether a person is fit by their given information like age, eating habit, and physical activity, etc. The decision nodes here are questions like ‘What’s the age?’, ‘Does he exercise?’, ‘Does he eat a lot of pizzas’? And the leaves, which are outcomes like either ‘fit’, or ‘unfit’. In this case this was a binary classification problem (yes or no type problem).

There are two main types of Decision Trees:

Classification trees (Yes/No types)

What we’ve seen above is an example of classification tree, where the outcome was a variable like ‘fit’ or ‘unfit’. Here the decision variable is categorical.

Regression trees (Continuous data types)

Here the decision or the outcome variable is Continuous, e.g. a number like 123.

Image source google.com

The top-most item, in this example, “Age < 30 ?” is called the root. It’s where everything starts from. Branches are what we call each line. A leaf is everything that isn’t the root or a branch.

A general algorithm for a decision tree can be described as follows:

  1. Pick the best attribute/feature. The best attribute is one which best splits or separates the data.
  2. Ask the relevant question.
  3. Follow the answer path.
  4. Go to step 1 until you arrive to the answer.

Terms used with Decision Trees:

  1. Root Node – It represents entire population or sample and this further gets divided into two or more similar sets.
  2. Splitting – Process to divide a node into two or more sub nodes.
  3. Decision Node – A sub node is divided further sub node, called decision node.
  4. Leaf/Terminal Node – Node which do not split further called leaf node.
  5. Pruning – When we remove sub-nodes of a decision node, this process is called pruning.
  6. Branch/ Sub-tree – A sub-section of entire tree is called branch or subtree.
  7. Parent and child node – A node, which is divided into sub-nodes is called parent node of sub-nodes whereas sub-nodes are the child of parent node.

Let’s understand above terms with the below image

Image Source google.com

Types of Decision Trees

  1. Categorical Variable decision tree – Decision Tree which has categorical target variable then it called as categorical variable.
  2. Continuous Variable Decision Tree – Decision Tree which has continuous target variable then it is called as Continuous Variable Decision Tree.

Advantages of Decision Tree

  1. Easy to understand – Algorithm is very easy to understand even for people from non-analytical background. A person without statistical knowledge can interpret them.
  2. Useful in data exploration – It is the fastest algorithm to identify most significant variables and relation between variables. It help us to identify those variables which has better power to predict target variable.
  3. Decision tree do not required more effort from user side for data preparation.
  4. This algorithm is not affected by outliers or missing value to an extent, so it required less data cleaning effort as compare to other model.
  5. This model can handle both numerical and categorical variables.
  6. The number of hyper-parameters to be tuned is almost null.

Disadvantages of Decision Tree

  1. Over Fitting – It is the most common problem in decision tree. This issue has resolved by setting constraints on model parameters and pruning. Over fitting is an phenomena where your model create a complex tree that do not generalize the data very well.
  2. Small variations in the data can result completely different tree which mean it unstable the model. This problem is called variance, which need to lower by method like bagging and boosting.
  3. If some class is dominate in your model then decision tree learner can create a biased tree. So it is recommended to balance the data set prior to fitting with the decision tree.
  4. Calculations can become complex when there are many class label.

Decision Tree Flowchart

Image Source google.com

How does a tree decide where to split?

In decision tree making splits effect the accuracy of model. The decision criteria are different for classification and regression trees. Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.

The algorithm selection is also based on type of target variables. The four most commonly used algorithms in decision tree are:

  1. CHAID – Chi-Square Interaction Detector
  2. CART – Classification and regression trees.

Let’s discuss both methods in detail

CHAID – Chi-Square Interaction Detector

It is an algorithm to find out the statistical significance between the differences between sub-nodes and parent node. It works with categorical target variable such as “yes” or “no”.

Algorithm follows following steps:

  1. Iterate all available x variables.
    1. Check if the variable is numeric
    2. If the variable is numeric make it categorical by decile and percentile.
    3. Figure out all possible cuts.
    4. For each possible cut it will do Chi-Square test and store the P value
    5. Choose that cut which give least p value.
  2. Cut the data using that variable and that cut which gives least P value.

CART – Classification and regression trees

There are basically two subtypes for this algorithm.

Gini index:

It says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure. It works with categorical target variable “Success” or “Failure”.

Gini = 1-P^2 – (1-p)^2 , Here p is the probability

Gain = Gini of parents leaf – weighted average of Gini of the nodes (Weights are proportional to population of each child node)

Steps to Calculate Gini for a split

  1. Iterate all available x variables.
    1. Check if the variable is numeric
    2. If the variable is numeric make it categorical by decile and percentile.
    3. Figure out all possible cuts.
    4. Calculate gain for each split
    5. Choose that cut which gives the highest cut.
  2. Cut the data using that variable and that cut which gives maximum gain

Entropy Tree:

To understand entropy tree we need to first understand what entropy is?

Entropy – Entropy is basically measures the level of impurity in a group of examples. If the sample is completely homogeneous, then the entropy is zero and if the sample is an equally divided (50% — 50%), it has entropy of one.

Entropy = -p log2 p — q log2q

Here p and q is the probability of success and failure respectively in that node. Entropy is also used with categorical target variable. It chooses the split which has lowest entropy compared to parent node and other splits. The lesser the entropy, the better it is.

Gain = Entropy of parents leaf – weighted average of entropy of the nodes (Weights are proportional to population of each child node)

Steps to Calculate Entropy for a split

  1. Iterate all available x variables.
    1. Check if the variable is numeric
    2. If the variable is numeric make it categorical by decile and percentile.
    3. Figure out all possible cuts.
    4. Calculate gain for each split
    5. Choose that cut which gives the highest cut.
  2. Cut the data using that variable and that cut which gives maximum gain.

Decision Tree Regression

As we have discussed above with the help of decision tree we can also solve the regression problem. So let’s see what the steps are.

Following steps are involved in algorithm.

  1. Iterate all available x variables.
    1. Check if the variable is numeric
    2. If the variable is numeric make it categorical by decile and percentile.
    3. Figure out all possible cuts.
    4. For each cuts calculate MSE
    5. Choose that cut and that variable which gives the minimum MSE.
  2. Cut the data using that variable and that cut which gives minimum MSE.

Stopping Criteria of Decision Tree

  1. Pure Node – If tree find a pure node, that particular leaf will stop growing.
  2. User defined depth
  3. Minimum observation in the node
  4. Minimum observation in the leaf