November 2019

Nov 12, 2019

Matplotlib-Introduction

Matplotlib is the “grandfather” library of data visualization with Python. It was created by John Hunter. He created it to try to replicate MatLab’s (another programming language) plotting capabilities in Python. So if you happen to be familiar with matlab, matplotlib will feel natural to you.

It is an excellent 2D and 3D graphics library for generating scientific figures.

Some of the major Pros of Matplotlib are:

Generally easy to get started for simple plots
Support for custom labels and texts
Great control of every element in a figure
High-quality output in many formats
Very customizable in general

Matplotlib allows you to create reproducible figures programmatically. Let’s learn how to use it! I encourage you just to explore the official Matplotlib web page: http://matplotlib.org/

Installation of Matplotlib:

To install the latest release of matplotlib, you can use pip:

pip install matplotlib

You can also use conda to install the latest version of matplotlib:

conda install matplotlib

Now from next lecture we will learn how to plot different kind of charts and plot with the help of matplotlib.

Nov 12, 2019

Matplotlib-plots

By Datasciencelovers in Data visualization Tag area plot, bargraph, histogram, matplotlib, pieplot, scatter plot

There are various plots which can be created using python matplotlib. Some of them is listed below:

Bar graph
Histogram.
Scatter plot
Area plot
Pie plot

Now lets understand how to plot above graphs.

Nov 12, 2019

Seaborn-Introduction

By Datasciencelovers in Data visualization Tag Data visualization, python-seaborn, seaborn

As per Seaborn’s official website, they state,

“If matplotlib “tries to make easy things easy and hard things possible”, seaborn tries to make a well-defined set of hard things easy too”

So we can say seaborn is an amazing python data visualization library built on top of the matplotlib.

Why one should you Seaborn instead of matplotlib?

Seaborn comes with a large number of high-level interfaces and customized themes where matplotlib lacks as it’s not easy to figure out the settings that makes plots attractive.
Matplotlib functions don’t work well with dataframes, whereas seaborn does.

Installation:

To install the latest release of seaborn, you can use pip.

pip install seaborn

You can also use conda to install the latest version of seaborn

conda install seaborn

Nov 12, 2019

Seaborn-Categorical Data Plots

By Datasciencelovers in Data visualization Tag categorical data plot, data analysis, Data visualization, python-seaborn, seaborn

Now let’s discuss using seaborn to plot categorical data! There are a few main plot types for this:

factorplot
boxplot
violinplot
stripplot
swarmplot
barplot
countplot

Let’s go through examples of each!

Nov 17, 2019

Seaborn- Matrix Plot

By Datasciencelovers in Data visualization Tag cluster map, Data visualization, heatmap, Matrix Plot, seaborn

Matrix plots allow you to plot data as color-encoded matrices and can also be used to indicate clusters within the data (later in the machine learning section we will learn how to formally cluster data).

So in this article we will deal with basically two plots as per follow:

Heatmaps:- A heat map (or heatmap) is a graphical representation of data where values are depicted by color. Heat maps make it easy to visualize complex data and understand it at a glance. To use a heatmap the data should be in a matrix form i.e the index name and the column name must match in some way so that the data that we fill inside the cells are relevant.
Cluster maps:- Cluster maps uses hierarchical clustering. It performs the clustering based on the similarity of the rows and columns.

Let’s begin by exploring seaborn’s heatmap and clutermap

Nov 17, 2019

Seaborn-Grids

By Datasciencelovers in Data visualization Tag Data visualization, Grids, pair grids, pair plots, seaborn

Grids are general types of plots that allow you to map plot types to rows and columns of a grid, this helps you create similar plots separated by features.

In this post we will discuss about following plots:

Pair Grid
Pair plots
Facet Grid
Joint Grid

Lets see how to plot these graph with the help of python seaborn

Nov 10, 2019

Banking Credit Card Spend Prediction and Identify Drivers for Spends

By Datasciencelovers in Machine learning projects Tag credit card spend prediction, linear regression, machine learning, regression, supervised learning

Business Problem:

One of the global banks would like to understand what factors driving credit card spend are. The bank want use these insights to calculate credit limit. In order to solve the problem, the bank conducted survey of 5000 customers and collected data.

The objective of this case study is to understand what’s driving the total spend (Primary Card + Secondary card). Given the factors, predict credit limit for the new applicants.

Data Availability:

Data for the case are available in xlsx format.
The data have been provided for 5000 customers.
Detailed data dictionary has been provided for understanding the data in the data.
Data is encoded in the numerical format to reduce the size of the data however some of the variables are categorical. You can find the details in the data dictionary

Let’s develop a machine learning model for further analysis.

Nov 10, 2019

Store Sales Prediction – Forecasting

By Datasciencelovers in Machine learning projects Tag linear regression, machine learning, store sales predection, supervised learning

Business Context:

The objective is predicting store sales using historical markdown data. One challenge of modelling retail data is the need to make decisions based on limited history. If Christmas comes but once a year, so does the chance to see how strategic decisions impacted the bottom line.

Business Problem:

Company provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modelling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data.

Data Availability:

stores.csv: This file contains anonymized information about the 45 stores, indicating the type and size of store.

train.csv: This is the historical training data, which covers to 2010-02-05 to 2012-11- 01, Within this file you will find the following fields:

Store – the store number
Dept – the department number
Date – the week
Weekly_Sales – sales for the given department in the given store
IsHoliday – whether the week is a special holiday week

test.csv: This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

features.csv: This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

Store – the store number
Date – the week
Temperature – average temperature in the region
Fuel_Price – cost of fuel in the region
MarkDown1-5 – anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
CPI – the consumer price index
Unemployment – the unemployment rate
IsHoliday – whether the week is a special holiday week

Let’s develop a machine learning model for further analysis.

Nov 10, 2019

Credit Card Segmentation

By Datasciencelovers in Machine learning projects Tag clustering, credit card segmentation, machine learning, unsupervised learning

Data Available:

CC GENERAL.csv

Business Context:

A Bank wants to develop a customer segmentation to define marketing strategy. The sample dataset summarizes the usage behaviour of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioural variables.

Business Requirements:

Advanced data preparation: Build an enriched customer profile by deriving “intelligent” KPIs such as:

Monthly average purchase and cash advance amount
Purchases by type (one-off, instalments)
Average amount per purchase and cash advance transaction,
Limit usage (balance to credit limit ratio),
Payments to minimum payments ratio etc.
Advanced reporting: Use the derived KPIs to gain insight on the customer profiles.
Identification of the relationships/ affinities between services.
Clustering: Apply a data reduction technique factor analysis for variable reduction technique and a clustering algorithm to reveal the behavioural segments of credit card holders
Identify cluster characteristics of the cluster using detailed profiling.
Provide the strategic insights and implementation of strategies for given set of cluster characteristics.

Data Dictionary:

CUST_ID: Credit card holder ID
BALANCE: Monthly average balance (based on daily balance averages)
BALANCE_FREQUENCY: Ratio of last 12 months with balance
PURCHASES: Total purchase amount spent during last 12 months
ONEOFF_PURCHASES: Total amount of one-off purchases
INSTALLMENTS_PURCHASES: Total amount of installment purchases
CASH_ADVANCE: Total cash-advance amount
PURCHASES_ FREQUENCY: Frequency of purchases (Percent of months with at least one purchase)
ONEOFF_PURCHASES_FREQUENCY: Frequency of one-off-purchases PURCHASES_INSTALLMENTS_FREQUENCY: Frequency of installment purchases

CASH_ADVANCE_ FREQUENCY: Cash-Advance frequency
AVERAGE_PURCHASE_TRX: Average amount per purchase transaction
CASH_ADVANCE_TRX: Average amount per cash-advance transaction
PURCHASES_TRX: Average amount per purchase transaction
CREDIT_LIMIT: Credit limit
PAYMENTS: Total payments (due amount paid by the customer to decrease their statement balance) in the period
MINIMUM_PAYMENTS: Total minimum payments due in the period.
PRC_FULL_PAYMEN: Percentage of months with full payment of the due statement balance
TENURE: Number of months as a customer

Let’s develop a machine learning model for further analysis.

Nov 10, 2019

Network Intrusion Detection

By Datasciencelovers in Machine learning projects Tag classification, logistic regression, machine learning, multi class classification, Network intrusion detection

In this case study we need to predict anomalies and attacks in the network.

Business Problem:

The task is to build network intrusion detection system to detect anomalies and attacks in the network.

There are two problems.

Binomial Classification: Activity is normal or attack.
Multinomial classification: Activity is normal or DOS or PROBE or R2L or U2R .

Data Availability:

This data is KDDCUP’99 data set, which is widely used as one of the few publicly available data sets for network-based anomaly detection systems.

For more about data you can visit to http://www.unb.ca/cic/datasets/nsl.html

BASIC FEATURES OF EACH NETWORK CONNECTION VECTOR

Duration: Length of time duration of the connection
Protocol_type: Protocol used in the connection
Service: Destination network service used
Flag: Status of the connection – Normal or Error
Src_bytes: Number of data bytes transferred from source to destination in single connection
Dst_bytes: Number of data bytes transferred from destination to source in single connection
Land: if source and destination IP addresses and port numbers are equal then, this variable takes value 1 else 0
Wrong_fragment: Total number of wrong fragments in this connection
Urgent: Number of urgent packets in this connection. Urgent packets are packets with the urgent bit activated.
Hot: Number of „hot‟ indicators in the content such as: entering a system directory, creating programs and executing programs.
Num_failed _logins: Count of failed login attempts.
Logged_in Login Status: 1 if successfully logged in; 0 otherwise.
Num_compromised: Number of “compromised’ ‘ conditions.
Root_shell: 1 if root shell is obtained; 0 otherwise.
Su_attempted: 1 if “su root” command attempted or used; 0 otherwise.
Num_root: Number of “root” accesses or number of operations performed as a root in the connection.
Num_file_creations: Number of file creation operations in the connection.
Num_shells: Number of shell prompts.
Num_access_files: Number of operations on access control files .
Num_outbound_cmds: Number of outbound commands in an ftp session.
Is_hot_login: 1 if the login belongs to the “hot” list i.e., root or admin; else 0.
Is_guest_login: 1 if the login is a “guest” login; 0 otherwise .
Count: Number of connections to the same destination host as the current connection in the past two seconds
Srv_count: Number of connections to the same service (port number) as the current connection in the past two seconds.
Serror_rate: The percentage of connections that have activated the flag (4) s0, s1, s2 or s3, among the connections aggregated in count (23 )
Srv_serror_rate: The percentage of connections that have activated the flag (4) s0, s1, s2 or s3, among the connections aggregated in srv_count (24)
Rerror_rate: The percentage of connections that have activated the flag (4) REJ, among the connections aggregated in count (23)
Srv_rerror_rate: The percentage of connections that have activated the flag (4) REJ, among the connections aggregated in srv_count (24)
Same_srv_rate: The percentage of connections that were to the same service, among the connections aggregated in count (23)
Diff_srv_rate: The percentage of connections that were to different services, among the connections aggregated in count (23)
Srv_diff_host_ rate: The percentage of connections that were to different destination machines among the connections aggregated in srv_count (24)
Dst_host_count: Number of connections having the same destination host IP address.
Dst_host_srv_ count: Number of connections having the same port number.
Dst_host_same _srv_rate: The percentage of connections that were to the same service, among the connections aggregated in dst_host_count (32) .
Dst_host_diff_ srv_rate: The percentage of connections that were to different services, among the connections aggregated in dst_host_count (32)
Dst_host_same _src_port_rate: The percentage of connections that were to the same source port, among the connections aggregated in dst_host_srv_c ount (33) .
Dst_host_srv_ diff_host_rate: The percentage of connections that were to different destination machines, among the connections aggregated in dst_host_srv_count (33).
Dst_host_serro r_rate: The percentage of connections that have activated the flag (4) s0, s1, s2 or s3, among the connections aggregated in dst_host_count (32).
Dst_host_srv_s error_rate: The percent of connections that have activated the flag (4) s0, s1, s2 or s3, among the connections aggregated in dst_host_srv_c ount (33).
Dst_host_rerro r_rate: The percentage of connections that have activated the flag (4) REJ, among the connections aggregated in dst_host_count (32) .
Dst_host_srv_r error_rate: The percentage of connections that have activated the flag (4) REJ, among the connections aggregated in dst_host_srv_c ount (33).

Attack Class:

Let’s develop a machine learning model for further analysis.

Archive November 2019

Installation of Matplotlib:

Why one should you Seaborn instead of matplotlib?

Installation:

Business Problem:

Data Availability:

Business Context:

Business Problem:

Data Availability:

Data Available:

Business Context:

Business Requirements:

Data Dictionary:

Business Problem:

Data Availability:

BASIC FEATURES OF EACH NETWORK CONNECTION VECTOR

Attack Class: