November 2019 – Page 3

Nov 4, 2019

Pandas-Groupby

By Datasciencelovers in Python for Data Science Tag data analysis, dataframe, groupby, pandas

Groupby basically split the data into different groups depending upon the criteria.

Groupby function can be use for following operations on the original object.
They are −

Splitting the Object
Applying a function
Combining the results

Let’s understand groupby by following python code

Nov 6, 2019

Pandas-Merging/Joining

By Datasciencelovers in Python for Data Science Tag data analysis, merge, pandas join, pandas merging

Merging two datasets is the process of bringing two datasets together into one and aligning the rows from each based on common attributes or columns.

Data Merging:

The pandas.merge() method joins two data frames and align the rows from each other by a “key” variable that contains unique values.

In pandas there are separate “Merge” and “Join” method but both do the similar work.

With pandas.merge(), you can only combine 2 data frames at a time. If you have more than 2 data frames to merge, you will have to use this method multiple times.

Let’s see pandas.merge() and some of the available arguments to pass. Here is the general structure and the recommended bare minimum arguments to pass.

pandas.merge(left_data_frame, right_data_frame, on= , how= )

left is one of the data frames.
right is the other data frame
on is the variable, a.k.a the column, on which you want to merge on. This is the keyvariable and has to be the same name in both data frames.
If the data frames has different column names for the merge variables then you can also use left_on and right_on.
- left_on is the variable name in the left data frame to be merged on.
- right_on is the variable name in the left data frame to be merged on.

how is where you pass the options of merging. These include:

“inner”, where only the observations with matching values based on the “on” argument that is passed are kept.
“left”, where all observations will be kept from the data frame in the left argument regardless if there is matching values with the data frame in the right argument. observations that do not have a matching value based on the on argument in the “right” data frame will be discarded.
“right”, where all observations will be kept from the data frame in the right argument regardless if there is matching values with the data frame in the left argument. Observations that do not have a matching value based on the on argument in the “left” data frame will be discarded.
“outer”, all observations will be kept from both data frames.

Now let’s understand the concepts with example.

We are going to use following data set for operation.

user_usage.csv – This dataset containing users monthly mobile usage details.
user_device.csv – This data set is containing details of an individual “use” of the system, with dates and device information.
android_devices.csv – This dataset with device and manufacturer data, which lists all Android devices and their model code.

Let’s understand the concept with coding..

Nov 6, 2019

Pandas-concat() method

By Datasciencelovers in Python for Data Science Tag concatination, data analysis, data frame, pandas concat

The pandas.concat() method combines two data frames by stacking them on top of each other. If one of the data frames does not contain a variable column or variable rows, observations in that data frame will be filled with NaN values.

new_concat_dataframe = pd.concat([dataframe1, dataframe2], ignore_index= “true”)

Note – If you wish for a new index starting at 0, pass the “ignore_index” argument as “true”.

Let’s understand Concat() function through coding.

Nov 7, 2019

Pandas–Missing Data

By Datasciencelovers in Python for Data Science Tag data analysis, data frame, missing values, pandas

In real scenario missing data is a big problem in data analysis. In machine learning and data mining accuracy get compromised because of poor quality of data caused by missing values.

Missing Data is represented as NA(Not Available) or NAN(Not a number) values in pandas.

Why data is missing?

Let’s suppose you have surveyed different people where you need their name, address, phone number and income, but some user don’t want to share their address and income so in this way many datasets went missing.

Finding missing values

To check missing values in pandas DataFrame we use function isnull() and notnull(). Both of the function checks whether the values is nan or not. These functions also used in Pandas Series, to find null values.

Cleaning / Filling Missing values:

There are following ways to treat missing values.

Filling missing values using fillna(), replace():

To fill null values in data set we use fillna() and replace().To do this, we can call the fillna() function on a dataframe column and specifying either mean() or median() as a parameter.

#Impute with mean on column_1
df[‘column_1’] = df[‘column_1’].fillna( df[‘column_1’].mean() )

#Impute with median on column_1
df[‘column_1’] = df[‘column_1’].fillna( df[‘column_1’].median() )

Besides mean and median, imputing missing data with 0 can also be a good idea in some cases.

Impute with value 0 on column_1
df[‘column_1’] = df[‘column_1’].fillna(0)

2. Dropping missing values using dropna():

This is not a good method to handle missing value treatment. If your data has large number of missing value you can’t use this method because when you use this method you might be loose some important information.

In order to drop a null values from a dataframe, we used dropna() function this fuction drop Rows/Columns of datasets with Null values in different ways.

#Drop rows with null values
df = df.dropna(axis=0)

#Drop column_1 rows with null values
df[‘column_1’] = df[‘column_1’].dropna(axis=0)

The axis parameter determines the dimension that the function will act on.
axis=0 removes all rows that contain null values.
axis=1 removes all columns instead that contain null values.

Let’s understand the concept with python.

Nov 8, 2019

Pandas–Operations

By Datasciencelovers in Python for Data Science Tag data analysis, data science, dataframe, pandas operations

There are various useful pandas operation available which is really handy in data analysis.

In this lecture we are going to cover following topics:

How to find unique values.
How to select data with multiple conditions?
How to apply function on a particular column?
How to remove column?
How to get column and index name?
Sorting by column
Checking null value
Filling in NaN values with something else
Pivot table creation
Change column name in pre-existing data frame.
.map() and .apply() function
Get column name in the data frame
Change order of column in the data frame
Add new column in existing data frame
Data type conversion
Date and time conversion

Let’s see all these operations in python

Nov 20, 2019

Data Science – An Introduction

By Datasciencelovers in Machine Learning Tag datascience, machine

What is Data science?

It is a study that deals with the identification and extraction of meaningful information from data sources with the help of various scientific methods and algorithms. This helps in better decision making, promotional offers and predictive analytics for any business or organization.

What are the skills required to be a Data scientist?

Programing Skill
- Python
- R
- Database Query Languages.
Statistics and Probability
BI Tools – Tableau, Power BI, Qlik Sense
Business Domain Knowledge

Data Science Life Cycle:

Data Scientist VS Data Analyst VS Data Engineer

Data Analyst:

It is an entry-level job for those professionals who are interested in getting into a data-related job. Organisation expect from Data Analyst to understand data handling, modeling and reporting techniques along with a strong understanding of the business. A Data analyst required a good knowledge of visualization tools and database. There are two most popular and common tools used by the data analysts are SQL and Microsoft Excel.

It is necessary for the data analyst to have good presentation skills. This helps them to communicate the end results with the team and help them to reach proper solutions.

Data Engineer:

A Data Engineer specializes in preparing data for analytical usage. They have good idea about Data pipelining with performance optimization. A Data Engineer required strong technical background with the ability to create and integrate APIs. Data Engineering also involves the development of platforms and architectures for data processing.

So what skills required being a Data Engineer?

Data Warehousing & ETL
Advanced programming knowledge
Machine learning concept knowledge
In-depth knowledge of SQL/ database
Hadoop-based Analytics

Data Scientist:

A data scientist is a person who applies their knowledge in statistics and building machine learning models to make predictions to answer the key business questions. They use to deal with big messy data set and a big data wranglers. They apply their math, programing and statistics skills on the data set to clean and organize.

Once data is in clean form then Data scientist apply machine learning algorithm to find hidden insights in the data and draw a meaningful summary out of that.

Skill set for a data scientist:-

In depth programing knowledge of SAS/R/Python.
Statistics and Mathematics concepts.
Machine learning algorithm.
Python library such as Pandas, numpy, scypi, Matplotlib, Seaborn, StatsModels.

Nov 21, 2019

Linear Regression-Theory

By Datasciencelovers in Machine Learning Tag linear regression, machine learning, supervised learning

Linear regression is a supervised machine learning technique where we need to predict a continuous output, which has a constant slope.

There are two main types of linear regression:

1. Simple Regression:

Through simple linear regression we predict response using single features.

If you recall, the line equation (y = mx + c) we studied in schools. Let’s understand what these parameters say and how this equation works in linear regression.

Y = βo + β1X + ∈

Where, Y = Dependent Variable ( This is the variable which we want to predict )

X = Independent Variable ( This is the variable which we use to make prediction )

βo – This is the intercept term. It is the prediction value you get when X = 0

β1 – This is the slope term. It explains the change in Y when X changes by 1 unit.

∈ – This represents the residual value, i.e. the difference between actual and predicted values.

2. Multivariable regression:

It is nothing but extension of simple linear regression. It attempts to model the relationship between two or more features and a response by fitting a linear equation to observed data.

Multi variable linear equation might look like this, where w represents the coefficients, or weights, our model will try to learn.

f(x,y,z)=w1x+w2y+w3z

Let’s understand it with example.

In a company for sales predictions, these attributes might include a company’s advertising spend on radio, TV, and newspapers.

Sales=w1Radio+w2TV+w3News

Linear Regression geometrical representation

So our goal in linear regression model is:

Find a line or plane that best fits the data points. Here best fit means minimise the sum of errors across our training data.

Types of Deliverable in linear regression:

Typically there are following questions that a business wanted to know

They wanted to know their sales or profit prediction.
Drivers(What drives the sales?)
- All variable that have significant beta.
- Which factors are detrimental /incremental?
- All the drivers, which one should target first?(Variable with highest absolute value)
How to predict drivers?
- To answer this question, you need calculate (beta*X )for each X variable and you need to choose the highest value and accordingly you can choose your driver after that convince business why you have chosen the particular driver.

So now the question arises how we calculate Beta values?

To calculate the beta values we will use OLS(ordinary least squared) method.

Assumptions of Linear Regression:

1. X variables (Explanatory variable) should be linearly related to Y (Response Variable):

Meaning:

If you plot a scatter plot between x variable and Y, most of the data point should be around the straight line.

How to check?

Draw the scatter plot between each x variable and y variable.

What happens if the assumption is violated?

MSE(Mean Squared Error) will be high. MSE is nothing but the average of squared error occurred between the predicted values and actual values. It can be written as:

Where,

N=Total number of observation
Yi = Actual value
(a1x_i+a₀)= Predicted value.

What to do if variable is not linear?

Drop the variable – But in this case will loose the information.
Take log(x+1) of x variables.

2.Residual or the Y variable should be normally distributed:

Meaning:

Residuals (errors) or Y, when plotted in a histogram produces a bell shaped curve.

Residuals: The distance between the actual value and predicted values is called residual. If the observed points are far from the regression line, then the residual will be high, and so cost function will high. If the scatter points are close to the regression line, then the residual will be small and hence the cost function.

How to check?

Plot a histogram of Y, when plotted histogram produces a bell- shaped curve then it follows normality.

Or we can also use q-q plot(quantile- quantile plot) of residuals

What happens if the assumption is violated?

It means all the P values has been calculated wrongly.

What to do if assumption is violated?

In that case we need to transform our Y such a way so that it become normal. To do that we need to use log of Y.

3.There should not be any relationship between X variables (i.e no multicollinearity)

Meaning:

X variable should not have any linear relationship between themselves. It’s obvious that we don’t want same information in repeat mode.

How to check?

Calculate correlation between every X with every other X variable.
Second method is to, calculate VIF(Variance influence factor)

What happens if the assumption is violated?

Your beta’s values sign will fluctuate.

What to do if assumption is violated?

Drop those X variable whose VIF is greater than 10(VIF>10)

4. The variance of error should remain constant over value of Y (Homoscedasticity/ No heteroskedasticity )

Meaning:

Spread of residuals should remain constant with values of Y.

How to check?

Draw scatter plot of residuals VS Y.

What happens if the assumption is violated?

Your P value will not accurate.

What to do if assumption is violated?

In that case we need to transform our Y such a way so that it become normal. To do that we need to use log of Y.

5. There should not be any auto-correlation between the residuals.

Meaning:

Correlation of residuals with lead residuals. Here lead residuals means next calculated residual.

How to check?

Use DW stats(Durbin Watson Stats)

If DW stats ~ 2, then no auto correlation.

What happens if the assumption is violated?

Your P value will not accurate.

What to do if assumption is violated?

Understand the reason why it is happening?

If autocorrelation is due to Y then cannot build linear regression model.

If autocorrelation is due to X then drop that X variable.

How to check Model Performance?

The Goodness of fit determines how the line of regression fits the set of observations. The process of finding the best model out of various models is called optimization. It can be achieved by below method:

R-squared method:

R-squared is a statistical method that determines the goodness of fit.
It measures the strength of the relationship between the dependent and independent variables on a scale of 0-100%.
The high value of R-square determines the less difference between the predicted values and actual values and hence represents a good model.
It is also called a coefficient of determination, or coefficient of multiple determination for multiple regression.
It can be calculated from the below formula:

In the next lecture we will see how to implement leaner regression in python.

Nov 24, 2019

Linear regression with python

By Datasciencelovers in Machine Learning Tag algorithm, data science, linear regression, machine learning

Company Objective:

Let’s suppose You just got some contract work with an Ecommerce company based in New York City that sells clothing online but they also have in-store style and clothing advice sessions. Customers come in to the store, have sessions/meetings with a personal stylist, then they can go home and order either on a mobile app or website for the clothes they want.

The company is trying to decide whether to focus their efforts on their mobile app experience or their website. They’ve hired you on contract to help them figure it out! Let’s get started!

Just follow the steps below to analyze the customer data (it’s fake, don’t worry I didn’t give you real credit card numbers or emails ). Click here to download

Click here to download .ipnyb notebook

Archive November 2019

Data Merging:

Now let’s understand the concepts with example.

Note – If you wish for a new index starting at 0, pass the “ignore_index” argument as “true”.

Why data is missing?

Finding missing values

Cleaning / Filling Missing values:

What is Data science?

What are the skills required to be a Data scientist?

Data Science Life Cycle:

Data Scientist VS Data Analyst VS Data Engineer

Data Analyst:

Data Engineer:

Data Scientist:

1. Simple Regression:

2. Multivariable regression:

Linear Regression geometrical representation

Types of Deliverable in linear regression:

So now the question arises how we calculate Beta values?

Assumptions of Linear Regression:

1. X variables (Explanatory variable) should be linearly related to Y (Response Variable):

2.Residual or the Y variable should be normally distributed:

3.There should not be any relationship between X variables (i.e no multicollinearity)

4. The variance of error should remain constant over value of Y (Homoscedasticity/ No heteroskedasticity )

5. There should not be any auto-correlation between the residuals.

How to check Model Performance?

R-squared method:

Company Objective: