Category Machine learning projects

Logistic regression to predict absenteeism- approach

Business Problem:

In today environment there is a high competitiveness which increase pressure on employee. High competitiveness leads unachievable goals, which cause an employee health issues, and health issue will lead absenteeism of employee.

With a given dataset an organisation is trying to predict employee absenteeism.

What is absenteeism in the business context?

Absence from work during normal working hours, resulting in temporary incapacity to execute regular working activity.

Purpose Of Model:

Explore whether a person presenting certain characteristics is expected to be away from work at some points in time or not.

Dataset:

I have downloaded a data set from kaggle called ‘Absenteeism_data.csv’ which contain following information.

  • Reason_1 – A Type of Reason to be absent.
  • Reason_2 – A Type of Reason to be absent.
  • Reason_3 – A Type of Reason to be absent.
  • Reason_4 – A Type of Reason to be absent.
  • Month Value – Month in which employee has been absent.
  • Day of the Week – Days
  • Transportation Expense – Expense in dollar
  • Distance to Work – Distance of workplace in Km
  • Age – Age of employee
  • Daily Work Load Average – Average amount of time spent working per day shown in minutes.
  • Body Mass Index – Body Mass index of employee.
  • Education – Education category(1 – high school education, 2 – Graduate, 3 – Post graduate, 4 – A Master or Doctor )
  • Children – No of children an employee has
  • Pet – Whether employee has pet or not?
  • Absenteeism Time in Hours – How many hours an employee has been absent.

Following are the main action we will take in this project.

  1. Build the model in python
  2. Save the result in Mysql.
  3. Visualise the end result in Tableau

Python for model building:

We are going to take following steps to predict absenteeism:

Load the data

Import the ‘Absenteeism_data.csv’ with the help of pandas

Identify dependent Variable i.e. identify the Y:

We have to be categories and we must find a way to say if someone is ‘being absent too much’ or not. what we’ve decided to do is to take the median of the dataset as a cut-off line in this way the dataset will be balanced (there will be roughly equal number of 0s and 1s for the logistic regression) as balancing is a great problem for ML, this will work great for us alternatively, if we had more data, we could have found other ways to deal with the issue for instance, we could have assigned some arbitrary value as a cut-off line, instead of the median.

Note that what line does is to assign 1 to anyone who has been absent 4 hours or more (more than 3 hours) that is the equivalent of taking half a day off initial code from the lecture targets = np.where(data_preprocessed[‘Absenteeism Time in Hours’] > 3, 1, 0)

Choose Algorithm to develop model:

As our Y (dependent variable) is 1 or o i.e. absent or not absent so we are going to use Logistic regression for our analysis.

Select Input for the regression:

We have to select our all x variables i.e. all independent variable which we will use for regression analysis.

Data Pre-processing:

Remove or treat missing value

In our case there is no missing value so we don’t have to worry about missing value. Yes, there are some columns who is not adding any value in our analysis such as ID which is unique in every case so we will remove it.

Remove Outliers

In our case there are no outliers so we don’t have to worry. But in general if you have outlier you can take log of your x variable to remove outliers.

Standardize the data

standardization is one of the most common pre-processing tools since data of different magnitude (scale) can be biased towards high values, we want all inputs to be of similar magnitude this is a peculiarity of machine learning in general – most (but not all) algorithms do badly with unscaled data. A very useful module we can use is Standard Scaler. It has much more capabilities than the straightforward ‘pre-processing’ method. We will create a variable that will contain the scaling information for this particular dataset.

Here’s the full documentation:

http://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Choose the column to scales

In this section we need to choose that variable which need to transform or scale.in our case we need to scale  [‘Reason_1’, ‘Reason_2’, ‘Reason_3’, ‘Reason_4′,’Education’, pet and ‘children’], because these are the columns  which contain categorical data but in numerical form so we need to transform them.

What about the other column?

‘Month Value’, ‘Day of the Week’,  ‘Transportation Expense’,  ‘Distance to Work’,  ‘Age’, ‘Daily Work Load Average’, ‘Body Mass Index’ . These are the numerical value and their data type is int. so we do not have to transform them but will keep in our analysis.

Note:-

You can ask why we are doing analysis manually column wise?

Because it is always good to analyse data feature wise it gives us a confidence for our model and we can easily interpret our model analysis.

Split Data into train and test

Divide our data into train and test and build the model on train data set.

Apply Algorithm

As per our scenario we are going to use logistic regression in our case. Following steps will take place

Train the model

First we will divide the data into train and test. We will build our model on train data set.

Test the model

When we successfully developed our model then we need to test with a new data set which is testing data sets.

Find the intercepts and coefficient

Find out the beta values and coefficient from model.

Interpreting the coefficients

Find out which feature is adding more values in predictions of Y.

Save the model

Need to save the model which we have prepared so far. To do that we need to pickle the model.

Two executable file will save in your python directory one ‘model’ and the other is ‘scaler’

To save your .Ipnyb file in form of executable, save the same as .py file.

Check Model performance on totally new data set with same features.

Now we have a totally new data set which has same feature as per previous data set but contain different values.

Note – To do that your executable file ‘model’, scaler’ and ‘.py’ file should be in same folder.

Mysql for Data store

Save the prediction in data base (Mysql)

It is always good to save data and prediction on centralised data base.  So create a data base in mysql and create a table with all field available in your predicted data frame i.e ‘df_new_obs’

Import ‘pymysql’ library to make connection between ipynb notebook and mysql.

Setup the connection with user name and password and insert the predicted output values. In the data base.

Tableau for Data visualization

Connect the data base with Tableau and visualize the result

As we know tableau is a strong tool to visualise the data. So in our case we will connect our database with tableau and visualise our result and present to the business.

To connect tableau with my sql we need to take following steps.

  • Open the tableau desktop application.
  • Click on connect data source as mysql.
  • Put your data base address, username and password.
  • Select the data base.
  • Drag the table and visualize your data.

Banking Credit Card Spend Prediction and Identify Drivers for Spends

Business Problem:

One of the global banks would like to understand what factors driving credit card spend are. The bank want use these insights to calculate credit limit. In order to solve the problem, the bank conducted survey of 5000 customers and collected data.

The objective of this case study is to understand what’s driving the total spend (Primary Card + Secondary card). Given the factors, predict credit limit for the new applicants.

Data Availability:

  • Data for the case are available in xlsx format.
  • The data have been provided for 5000 customers.
  • Detailed data dictionary has been provided for understanding the data in the data.
  • Data is encoded in the numerical format to reduce the size of the data however some of the variables are categorical. You can find the details in the data dictionary

Let’s develop a machine learning model for further analysis.

Store Sales Prediction – Forecasting

Business Context:

The objective is predicting store sales using historical markdown data. One challenge of modelling retail data is the need to make decisions based on limited history. If Christmas comes but once a year, so does the chance to see how strategic decisions impacted the bottom line.

Business Problem:

Company provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modelling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

Data Availability:

stores.csv: This file contains anonymized information about the 45 stores, indicating the type and size of store.

train.csv: This is the historical training data, which covers to 2010-02-05 to 2012-11- 01, Within this file you will find the following fields:

  • Store – the store number
  • Dept – the department number
  • Date – the week
  • Weekly_Sales – sales for the given department in the given store
  • IsHoliday – whether the week is a special holiday week

test.csv: This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

features.csv: This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

  • Store – the store number
  • Date – the week
  • Temperature – average temperature in the region
  • Fuel_Price – cost of fuel in the region
  • MarkDown1-5 – anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
  • CPI – the consumer price index
  • Unemployment – the unemployment rate
  • IsHoliday – whether the week is a special holiday week

Let’s develop a machine learning model for further analysis.

Credit Card Segmentation

Data Available:

  • CC GENERAL.csv

Business Context:

A Bank wants to develop a customer segmentation to define marketing strategy. The sample dataset summarizes the usage behaviour of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioural variables.

Business Requirements:

Advanced data preparation: Build an enriched customer profile by deriving “intelligent” KPIs such as:

  • Monthly average purchase and cash advance amount
  • Purchases by type (one-off, instalments)
  • Average amount per purchase and cash advance transaction,
  • Limit usage (balance to credit limit ratio),
  • Payments to minimum payments ratio etc.
  • Advanced reporting: Use the derived KPIs to gain insight on the customer profiles.
  • Identification of the relationships/ affinities between services.
  • Clustering: Apply a data reduction technique factor analysis for variable reduction technique and a clustering algorithm to reveal the behavioural segments of credit card holders
  • Identify cluster characteristics of the cluster using detailed profiling.
  • Provide the strategic insights and implementation of strategies for given set of cluster characteristics.

Data Dictionary:

  • CUST_ID: Credit card holder ID
  • BALANCE: Monthly average balance (based on daily balance averages)
  • BALANCE_FREQUENCY: Ratio of last 12 months with balance
  • PURCHASES: Total purchase amount spent during last 12 months
  • ONEOFF_PURCHASES: Total amount of one-off purchases
  • INSTALLMENTS_PURCHASES: Total amount of installment purchases
  • CASH_ADVANCE: Total cash-advance amount
  • PURCHASES_ FREQUENCY: Frequency of purchases (Percent of months with at least one purchase)
  • ONEOFF_PURCHASES_FREQUENCY: Frequency of one-off-purchases PURCHASES_INSTALLMENTS_FREQUENCY: Frequency of installment purchases
  • CASH_ADVANCE_ FREQUENCY: Cash-Advance frequency
  • AVERAGE_PURCHASE_TRX: Average amount per purchase transaction
  • CASH_ADVANCE_TRX: Average amount per cash-advance transaction
  • PURCHASES_TRX: Average amount per purchase transaction
  • CREDIT_LIMIT: Credit limit
  • PAYMENTS: Total payments (due amount paid by the customer to decrease their statement balance) in the period
  • MINIMUM_PAYMENTS: Total minimum payments due in the period.
  • PRC_FULL_PAYMEN: Percentage of months with full payment of the due statement balance
  • TENURE: Number of months as a customer

Let’s develop a machine learning model for further analysis.

Network Intrusion Detection

In this case study we need to predict anomalies and attacks in the network.

Business Problem:

The task is to build network intrusion detection system to detect anomalies and attacks in the network.

There are two problems.

  1. Binomial Classification: Activity is normal or attack.
  2. Multinomial classification: Activity is normal or DOS or PROBE or R2L or U2R .

Data Availability:

This data is KDDCUP’99 data set, which is widely used as one of the few publicly available data sets for network-based anomaly detection systems.

For more about data you can visit to http://www.unb.ca/cic/datasets/nsl.html

BASIC FEATURES OF EACH NETWORK CONNECTION VECTOR

  1. Duration: Length of time duration of the connection
  2.  Protocol_type: Protocol used in the connection
  3.  Service: Destination network service used
  4.  Flag: Status of the connection – Normal or Error
  5.  Src_bytes: Number of data bytes transferred from source to destination in single connection
  6.  Dst_bytes: Number of data bytes transferred from destination to source in single connection
  7.  Land: if source and destination IP addresses and port numbers are equal then, this variable takes value 1 else 0
  8.  Wrong_fragment: Total number of wrong fragments in this connection
  9.  Urgent: Number of urgent packets in this connection. Urgent packets are packets with the urgent bit activated.
  10. Hot: Number of „hot‟ indicators in the content such as: entering a system directory, creating programs and executing programs.
  11. Num_failed _logins: Count of failed login attempts.
  12. Logged_in Login Status: 1 if successfully logged in; 0 otherwise.
  13. Num_compromised: Number of “compromised’ ‘ conditions.
  14. Root_shell: 1 if root shell is obtained; 0 otherwise.
  15.  Su_attempted: 1 if “su root” command attempted or used; 0 otherwise.
  16.  Num_root: Number of “root” accesses or number of operations performed as a root in the connection.
  17. Num_file_creations: Number of file creation operations in the connection.
  18. Num_shells: Number of shell prompts.
  19. Num_access_files: Number of operations on access control files .
  20. Num_outbound_cmds: Number of outbound commands in an ftp session.
  21. Is_hot_login: 1 if the login belongs to the “hot” list i.e., root or admin; else 0.
  22. Is_guest_login: 1 if the login is a “guest” login; 0 otherwise .
  23. Count: Number of connections to the same destination host as the current connection in the past two seconds
  24. Srv_count: Number of connections to the same service (port number) as the current connection in the past two seconds.
  25. Serror_rate: The percentage of connections that have activated the flag (4) s0, s1, s2 or s3, among the connections aggregated in count (23 )
  26. Srv_serror_rate: The percentage of connections that have activated the flag (4) s0, s1, s2 or s3, among the connections aggregated in srv_count (24)
  27. Rerror_rate: The percentage of connections that have activated the flag (4) REJ, among the connections aggregated in count (23)
  28. Srv_rerror_rate: The percentage of connections that have activated the flag (4) REJ, among the connections aggregated in srv_count (24)
  29. Same_srv_rate: The percentage of connections that were to the same service, among the connections aggregated in count (23)
  30. Diff_srv_rate: The percentage of connections that were to different services, among the connections aggregated in count (23)
  31. Srv_diff_host_ rate: The percentage of connections that were to different destination machines among the connections aggregated in srv_count (24)
  32. Dst_host_count: Number of connections having the same destination host IP address.
  33. Dst_host_srv_ count: Number of connections having the same port number.
  34. Dst_host_same _srv_rate: The percentage of connections that were to the same service, among the connections aggregated in dst_host_count (32) .
  35. Dst_host_diff_ srv_rate: The percentage of connections that were to different services, among the connections aggregated in dst_host_count (32)
  36. Dst_host_same _src_port_rate: The percentage of connections that were to the same source port, among the connections aggregated in dst_host_srv_c ount (33) .
  37. Dst_host_srv_ diff_host_rate: The percentage of connections that were to different destination machines, among the connections aggregated in dst_host_srv_count (33).
  38. Dst_host_serro r_rate: The percentage of connections that have activated the flag (4) s0, s1, s2 or s3, among the connections aggregated in dst_host_count (32).
  39. Dst_host_srv_s error_rate: The percent of connections that have activated the flag (4) s0, s1, s2 or s3, among the connections aggregated in dst_host_srv_c ount (33).
  40. Dst_host_rerro r_rate: The percentage of connections that have activated the flag (4) REJ, among the connections aggregated in dst_host_count (32) .
  41. Dst_host_srv_r error_rate: The percentage of connections that have activated the flag (4) REJ, among the connections aggregated in dst_host_srv_c ount (33).

Attack Class:

Let’s develop a machine learning model for further analysis.

Online Job Posting Analysis

Business Context:

The project seeks to understand the overall demand for labour in the Armenian online job market from the 19,000 job postings from 2004 to 2015 posted on Career Center, an Armenian human resource portal. Through text mining on this data, we will be able to understand the nature of the ever-changing job market, as well as the overall demand for labour in the Armenia economy. The data was originally scraped from a Yahoo! Mailing group.

Business Objectives:

Our main business objectives are to understand the dynamics of the labour market of Armenia using the online job portal post as a proxy. A secondary objective is to implement advanced text analytics as a proof of concept to create additional features such as enhanced search function that can add additional value to the users of the job portal.

So as a Data scientist you need to answer following business questions .

Job Nature and Company Profiles:

What are the types of jobs that are in demand in Armenia? How are the job natures changing over time?

Desired Characteristics and Skill-Sets:

What are the desired characteristics and skill -set of the candidates based on the job description dataset? How these are desired characteristics changing over time?

IT Job Classification:

Build a classifier that can tell us from the job description and company description whether a job is IT or not, so that this column can be automatically populated for new job postings. After doing so, understand what important factors are which drives this classification.

Similarity of Jobs:

Given a job title, find the 5 top jobs that are of a similar nature, based on the job post.

What should be our Text mining goal?

For the IT Job classification business question, you should aim to create supervised learning classification models that are able to classify based on the job text data accurately, is it an IT job.

On the business question of Job Nature and Company Profiles. Unsupervised learning techniques, such as topic modelling and other techniques such as term frequency counting will be applied to the data, including time period segmented dataset. Qualitative assessment will be done on the results to help us understand the job postings.

To understand the desired characteristics and skill -sets demanded by employers in the job ads, unsupervised learning methods such as K-means clustering will be used after appropriate dimension reduction.

For Job Queries business question, we propose exploring the usage of Latent Semantic Model and Matrix Similarity methods for information retrieval. The results will be assessed qualitatively. To return the top 5 most similar job posting, the job text data are vectorised using different models such as word2vec, and doc2vec and similarity scores are obtained using cosine similarity scores, ranked and returned as the answer which is then evaluated individually for relevance.

Data Understanding:

The data was obtained from Kaggle competition. Each row represents a job post. The dataset representation is tabular, but many of the columns are textual/unstructured in nature. Most notably, the columns job Description, Job Requirement, Required Qual, ApplicationP and AboutC are textual. The column job post is an amalgamation of these various textual columns.

Also provided sample job posting (attached with data set)

Let’s develop a machine learning model for further analysis.

Bank Review and Complaints Analysis

Business Problem

Central banks collecting information about customer satisfaction with the services provided by different bank. Also collects the information about the complaints.

  • Bank users give ratings and write reviews about services on central bank websites. These reviews and ratings help to banks evaluate services provided and take necessary to action improve customer service. While ratings are useful to convey the overall experience, they do not convey the context which led a reviewer to that experience.
  • If we look at only the rating, it is difficult to guess why the user rated the service as 4 stars. However, after reading the review, it is not difficult to identify that the review talks about good “service” and “expectations”.

So the Business Requirement is to analyze customer reviews and predict customer satisfaction with the reviews. It should include following tasks.

  • Data processing
  • Key positive words/negative words (most frequent words)
  • Classification of reviews into positive, negative and neutral
  • Identify key themes of problems (using clustering, topic models)
  • Predicting star ratings using reviews
  • Perform intent analysis

Datasets:

BankReviews.xlsx.

The data is a detailed dump of customer reviews/complaints (~500) of different services at different banks.

Data Dictionary:

  • Date (Day the review was posted)
  • Stars (1–5 rating for the business)
  • Text (Review text),
  • Bank name

Let’s develop a machine learning model for further analysis.