Banking Credit Card Spend Prediction and Identify Drivers for Spends

Business Problem:

One of the global banks would like to understand what factors driving credit card spend are. The bank want use these insights to calculate credit limit. In order to solve the problem, the bank conducted survey of 5000 customers and collected data.

The objective of this case study is to understand what’s driving the total spend (Primary Card + Secondary card). Given the factors, predict credit limit for the new applicants.

Data Availability:

  • Data for the case are available in xlsx format.
  • The data have been provided for 5000 customers.
  • Detailed data dictionary has been provided for understanding the data in the data.
  • Data is encoded in the numerical format to reduce the size of the data however some of the variables are categorical. You can find the details in the data dictionary

Let’s develop a machine learning model for further analysis.

Store Sales Prediction – Forecasting

Business Context:

The objective is predicting store sales using historical markdown data. One challenge of modelling retail data is the need to make decisions based on limited history. If Christmas comes but once a year, so does the chance to see how strategic decisions impacted the bottom line.

Business Problem:

Company provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modelling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

Data Availability:

stores.csv: This file contains anonymized information about the 45 stores, indicating the type and size of store.

train.csv: This is the historical training data, which covers to 2010-02-05 to 2012-11- 01, Within this file you will find the following fields:

  • Store – the store number
  • Dept – the department number
  • Date – the week
  • Weekly_Sales – sales for the given department in the given store
  • IsHoliday – whether the week is a special holiday week

test.csv: This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

features.csv: This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

  • Store – the store number
  • Date – the week
  • Temperature – average temperature in the region
  • Fuel_Price – cost of fuel in the region
  • MarkDown1-5 – anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
  • CPI – the consumer price index
  • Unemployment – the unemployment rate
  • IsHoliday – whether the week is a special holiday week

Let’s develop a machine learning model for further analysis.

Linear Regression-Theory

Linear regression is a supervised machine learning technique where we need to predict a continuous output, which has a constant slope.

There are two main types of linear regression:

1. Simple Regression:

Through simple linear regression we predict response using single features.

If you recall, the line equation (y = mx + c) we studied in schools. Let’s understand what these parameters say and how this equation works in linear regression.

Y = βo + β1X + ∈

Where, Y = Dependent Variable ( This is the variable we predict )

            X = Independent Variable ( This is the variable we use to make a prediction )

            βo – This is the intercept term. It is the prediction value you get when X = 0

            β1 – This is the slope term. It explains the change in Y when X changes by 1 unit.

∈ – This represents the residual value, i.e. the difference between actual and predicted values.

2. Multivariable regression:

It is nothing but extension of simple linear regression. It attempts to model the relationship between two or more features and a response by fitting a linear equation to observed data.

Multi variable linear equation might look like this, where w represents the coefficients, or weights, our model will try to learn.

f(x,y,z)=w1x+w2y+w3z

Let’s understand it with example.

In a company for sales predictions, these attributes might include a company’s advertising spend on radio, TV, and newspapers.

Sales=w1Radio+w2TV+w3News

Linear Regression geometrical representation

So our goal in linear regression model is:

Find a line or plane that best fits the data points. Here best fit means minimise the sum of errors across our training data.

Types of Deliverable in linear regression:

Typically there are following questions that a business wanted to know

  1. They wanted to know their sales or profit prediction.
  2. Drivers(What drives the sales?)
    • All variable that have significant beta.
    • Which factors are detrimental /incremental?
    •  All the drivers, which one should target first?(Variable with highest absolute value)
  3. Why you are predicting these values?
    • To answer this question, you need calculate (beta*X )for each X variable and you need to choose the highest value and accordingly you can choose your driver after that convince business why you have chosen the particular driver.

So now the question arises how we calculate Beta values?

To calculate beta we use OLS(ordinary least squared) method.

Assumptions of Linear Regression:

1. X variables (Explanatory variable) should be linearly related to Y (Response Variable):

Meaning:

If you plot a scatter plot between x variable and Y, most of the data point should be around the straight line.

How to check?

Draw the scatter plot between each x variable and y variable.

What happens if the assumption is violated?

MSE(Error) will be high.

What to do if variable is not linear?

  • Drop the variable – But in this case will loose the information.
  • Take log(x+1) of x variables. 

2.Residual or the y variable should be normally distributed:

Meaning:

Residuals (errors) or Y, when plotted in a histogram produces a bell shaped curve.

How to check?

Plot a histogram of Y, when plotted histogram produces a bell- shaped curve then it follows normality.

Or we can also use  q-q plot(quantile- quantile plot) of residuals

What happens if the assumption is violated?

It means all the P values has been calculated wrongly.

What to do if assumption is violated?

In that case we need to transform our Y such a way so that it become normal. To do that we need to use log of Y.

3.There should not be any relationship between X variables (i.e no multicollinearity)

Meaning:

X variable should not have any linear relationship between themselves. It’s obvious that we don’t want same information repeat mode.

How to check?

  1. Calculate correlation between every X with every other X variable.
  2. Second method is calculate VIF(Variance influence factor)

What happens if the assumption is violated?

Your beta value sign will fluctuate.

What to do if assumption is violated?

Drop those X variable whose VIF is greater than 10(VIF>10)

4. The variance of error should remain constant over value of Y (Homoscedasticity/ No heteroskedasticity )

Meaning:

Spread of residuals should remain constant with values of Y.

How to check?

Draw scatter plot of residuals VS Y.

What happens if the assumption is violated?

Your P value will not accurate.

What to do if assumption is violated?

In that case we need to transform our Y such a way so that it become normal. To do that we need to use log of Y.

5. There should not be any auto-correlation between the residuals.

Meaning:

Correlation of residuals with lead residuals. Here lead residuals means next residual(Which we will see in next chapter )

How to check?

Use DW stats(Durbin Watson Stats)

            If DW stats ~ 2, then no auto correlation.

What happens if the assumption is violated?

Your P value will not accurate.

What to do if assumption is violated?

Understand the reason why it is happening?

If autocorrelation is due to Y then cannot build linear regression model.

If autocorrelation is due to X then drop that X variable.

In the next lecture we will see how to implement leaner regression in python.

Linear regression with python

Company Objective:

Let’s suppose You just got some contract work with an Ecommerce company based in New York City that sells clothing online but they also have in-store style and clothing advice sessions. Customers come in to the store, have sessions/meetings with a personal stylist, then they can go home and order either on a mobile app or website for the clothes they want.

The company is trying to decide whether to focus their efforts on their mobile app experience or their website. They’ve hired you on contract to help them figure it out! Let’s get started!

Just follow the steps below to analyze the customer data (it’s fake, don’t worry I didn’t give you real credit card numbers or emails ). Click here to download

Click here to download .ipnyb notebook