Python is very easy to learn and implement. For many people including myself python language is easy to fall in love with. Since his first appearance in 1991, python popularity is increasing day by day. Among interpreted languages Python is distinguished by its large and active scientific computing community. Adoption of Python for scientific computing in both industry applications and academic research has increased significantly since the early 2000s.
For data analysis and exploratory analysis and data visualization, Python has upper hand as compare with the many other domain-specific open source and commercial programming languages and tools, such as R, MATLAB, SAS, Stata, and others. In recent years, Python’s improved library support (primarily pandas) has made it a strong alternative for data manipulation tasks. Combined with python’s strength in general purpose programming, it is an excellent choice as a single language for building data-centric applications.
So in short we can say due to following reason we should choose python for data analysis.
It’s very simple language to understand.
It’s an open source.
Strong data science inbuilt library.
Apart from the long existing demand in the web development projects, the use of Python is only growing to grow as AI/ML projects become more main stream and popular with global businesses.
As you can see below chart, python is the most shouting language in the industry.
To successfully create and run the code we will required environment
set up which will have both general-purpose python as well as the special packages
required for Data science.
In this tutorial we will discuss about python 3, because Python 2 won’t be supported after 2020 and Python 3 has been around since 2008. So if you are new to Python, it is definitely worth much more to learn the new Python 3 and not the old Python 2.
Anaconda is a package manager, an environment manager, a Python/R data science distribution, and a collection of over 1,500+ open source packages. Anaconda is free and easy to install, and it offers free community support too.
From the Start menu, click the Anaconda Navigator desktop app.
Run Python in a Jupyter Notebook:
On Navigator’s Home tab, in the Applications panel on the right, scroll to the Jupyter Notebook tile and click the Install button to install Jupyter Notebook.
Launch Jupyter Notebook by clicking Jupyter Notebook’s Launch button.This will launch a new browser window (or a new tab) showing the.
On the top of the right hand side, there is a drop down menu labeled “New”. Create a new Notebook with the Python version you installed.
Rename your Notebook. Either click on the current name and edit it or find rename under File in the top menu bar. You can name it to whatever you’d like, but for this example we’ll use MyFirstAnacondaNotebook.
In the first line of the Notebook, type or copy/paste print(“Hello Anaconda”)
Save your Notebook by either clicking the save and checkpoint icon or select File – Save and Checkpoint in the top menu.
NumPy is the most basic and a powerful package for working with data in python. It stands for ‘Numerical Python’. It is a library consisting of multidimensional array objects and a collection of routines for processing of array. It contains a collection of tools and technique that can be used to solve on a computer mathematical models of problem in science and engineering.
If you are going to work on data analysis or machine learning projects, then you should have solid understanding of NumPy . Because other packages for data analysis (like pandas) is built on top of NumPy and the scikit-learn package which is used to build machine learning applications works heavily with NumPy as well .
What is Array?
A array is basically nothing but a pointer. It is a combination of memory address, a data type, a shapes and strides.
data pointer indicates the memory address of the first bytes in the array.
data type or dtype pointer describes the kind of elements that are contained within
shape indicates the shape of array
strides are the numbers of bytes that should be skipped in memory to go to the
next element. If your strides are (10,1) you need to proceed one byte to get
the next column and 10 bytes to locate the next row.
So in short we can say an array contains information about the raw data, how to locate an element and how to interpret an element.
Operations using NumPy:
NumPy, a developer can perform the following operations −
Mathematical and logical operations on arrays.
Operations related to linear algebra. NumPy has in-built functions for linear algebra and random number generation.
It is highly recommended you install Python using the Anaconda distribution to make sure all underlying dependencies (such as Linear Algebra libraries) all sync up with the use of a conda install. If you have Anaconda, install NumPy by going to your terminal or command prompt and typing:
a python open source library which allow you to perform data manipulation,
analysis and cleaning. It is build on top of NumPy . It is a most important
library for data science.
According to Wikipedia “Pandas is derived from the term “panel data”, an econometrics term for data sets that include observations over multiple time periods for the same individuals.”
are the advantages of pandas for Data Scientist.
Easily handling missing data.
It provides an efficient way to slicing and data wrangling.
It is helpful to merge, concatenate or reshape the data.
It has includes a powerful time series tool to work with.
How to install Pandas?
To install python pandas go to command line/terminal and type “pip install pandas” or else if you have anaconda install in the system just type in “conda install pandas”. Once the installation is completed, go to your IDE(Jupyter) and simply import it by typing “import pandas as pd”.
In next chapter
we will learn about pandas Series.
The first main data type we will learn about for pandas is the Series data type.
A series is a one-dimensional data structure. A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn’t need to hold numeric data, it can hold any arbitrary Python Object.
So important point to
remember for pandas series is:
Values of Data Mutable
Let’s import Pandas and explore the Series object with the help of python.
A data frame is a standard way to store data
and data is aligned in a tabular fashion in rows and columns.
DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index Let us assume that we are creating a data frame with student’s data, it will look something like this.
A pandas DataFrame can be created using the following constructor
Let’s suppose You just got some contract work with an Ecommerce company based in New York City that sells clothing online but they also have in-store style and clothing advice sessions. Customers come in to the store, have sessions/meetings with a personal stylist, then they can go home and order either on a mobile app or website for the clothes they want.
The company is trying to decide whether to focus their efforts on their mobile app experience or their website. They’ve hired you on contract to help them figure it out! Let’s get started!
Just follow the steps below to analyze the customer data (it’s fake, don’t worry I didn’t give you real credit card numbers or emails ). Click here to download
days in analytics interview most of the interviewer ask questions about two algorithms
which is logistic and linear regression. But why is there any reason behind?
there is a reason behind that these algorithm are very easy to interpret. I
believe you should have in-depth understanding of these algorithms.
In this article we will learn about logistic regression in details. So let’s deep dive in Logistic regression.
What is Logistic Regression?
Logistic regression is a classification technique which helps to predict the probability of an outcome that can only have two values. Logistic Regression is used when the dependent variable (target) is categorical.
Types of logistic Regression:
On the other hand, a logistic regression produces a logistic curve, which is limited to values between 0 and 1. Logistic regression is similar to a linear regression, but the curve is constructed using the natural logarithm of the “odds” of the target variable, rather than the probability.
What is Sigmoid Function:
To map predicted values with probabilities, we use the sigmoid function. The function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.
S(z) = 1/1+e−z
s(z) = output between 0 and 1 (probability
= input to the function (your algorithm’s prediction e.g. b0 + b1*x)
= base of natural log
In Linear Regression, we use the Ordinary Least Square (OLS) method to determine the best coefficients to attain good model fit but In Logistic Regression, we use maximum likelihood method to determine the best coefficients and eventually a good model fit.
How Maximum Likelihood method
For a binary classification (1/0), maximum likelihood will try to find the values of b0 and b1 such that the resultant probabilities are close to either 1 or 0.
Logistic Regression Assumption:
I got a
very good consolidated assumption on Towards Data science website, which I am
logistic regression requires the dependent variable to be binary.
a binary regression, the factor level 1 of the dependent variable should
represent the desired outcome.
meaningful variables should be included.
independent variables should be independent of each other. That is, the model
should have little or no multicollinearity.
independent variables are linearly related to the log of odds.
regression requires quite large sample sizes.
evaluation methods of Logistic Regression.
Akaike Information Criteria (AIC):
We can say AIC
works as a counter part of adjusted R square in multiple regression. The thumb rules
of AIC are Smaller the better. AIC
penalizes increasing number of coefficients in the model. In other words,
adding more variables to the model wouldn’t let AIC increase. It helps to
To measure AIC of a single mode will not fruitful. To use AIC correctly build 2-3 logistic model and compare their AIC. The model which will have lowest AIC will relatively batter.
Null Deviance and Residual Deviance:
Null deviance is calculated from the model with no features, i.e. only intercept. The null model predicts class via a constant probability.
Residual deviance is calculated from the model having all the features. In both null and residual lower the value batter the model is.
It is nothing but a tabular representation of Actual vs Predicted values. This helps us to find the accuracy of the model and avoid overfitting. This is how it looks like
So now we can calculate the accuracy.
True Positive Rate (TPR):
shows how many positive values, out of all the positive values, have
been correctly predicted.
The formula to calculate the true positive rate is (TP/TP + FN). Or TPR = 1 - False Negative Rate. It is also known as Sensitivity or Recall.
False Positive Rate (FPR):
shows how many negative values, out of all the negative values, have
been incorrectly predicted.
The formula to calculate the false positive rate is (FP/FP + TN). Also, FPR = 1 - True Negative Rate.
True Negative Rate (TNR):
It represents how many negative values, out of all the negative values, have been correctly predicted. The formula to calculate the true negative rate is (TN/TN + FP). It is also known as Specificity.
False Negative Rate (FNR):
It indicates how many positive values, out of all the positive values, have been incorrectly predicted. The formula to calculate false negative rate is (FN/FN + TP).
It indicates how many values, out of all the predicted positive values, are actually positive. The formula is (TP / TP + FP).
F score is the harmonic mean of precision and recall. It lies between 0 and 1. Higher the value, better the model. Formula is 2((precision*recall) / (precision + recall)).
Receiver Operator Characteristic (ROC):
ROC is use to determine the accuracy of a classification model. It determines the model’s accuracy using Area Under Curve (AUC). Higher the area batter the model. ROC is plotted between True Positive Rate (Y axis) and False Positive Rate (X Axis).
In below graph yellow line represents the ROC curve at 0.5 thresholds. At this point, sensitivity = specificity.