Why Python for data Analysis?

Python is very easy to learn and implement. For many people including myself python language is easy to fall in love with. Since his first appearance in 1991, python popularity is increasing day by day. Among interpreted languages Python is distinguished by its large and active scientific computing community. Adoption of Python for scientific computing in both industry applications and academic research has increased significantly since the early 2000s.

For data analysis and exploratory analysis and data visualization, Python has upper hand as compare with the many other domain-specific open source and commercial programming languages and tools, such as R, MATLAB, SAS, Stata, and others. In recent years, Python’s improved library support (primarily pandas) has made it a strong alternative for data manipulation tasks. Combined with python’s strength in general purpose programming, it is an excellent choice as a single language for building data-centric applications.

So in short we can say due to following reason we should choose python for data analysis.

  • It’s very simple language to understand.
  • It’s an open source.
  • Strong data science inbuilt library.
  • Apart from the long existing  demand in the web development projects, the use of Python is only growing to grow as AI/ML projects become more main stream and popular with global businesses.

As you can see below chart, python is the most shouting language in the industry.

Over the year popularity

Trend in one year

IEEE Spectrum 2017 Survey

Python-Environment Setup

To successfully create and run the code we will required environment set up which will have both general-purpose python as well as the special packages required for Data science.

In this tutorial we will discuss about python 3, because Python 2 won’t be supported after 2020 and Python 3 has been around since 2008. So if you are new to Python, it is definitely worth much more to learn the new Python 3 and not the old Python 2.

Anaconda Installation:

Anaconda is a package manager, an environment manager, a Python/R data science distribution, and a collection of over 1,500+ open source packages. Anaconda is free and easy to install, and it offers free community support too.

To Download Anaconda click on https://www.anaconda.com/distribution/

Over 250+ packages are automatically installed with Anaconda. You can also download other packages using the pip install command.

If you need installation guide you can check the same on anaconda website https://docs.anaconda.com/anaconda/install/

Open Navigator for Window:

From the Start menu, click the Anaconda Navigator desktop app.

Anaconda Navigation

Run Python in a Jupyter Notebook:

  • On Navigator’s Home tab, in the Applications panel on the right, scroll to the Jupyter Notebook tile and click the Install button to install Jupyter Notebook.
  • Launch Jupyter Notebook by clicking Jupyter Notebook’s Launch button.This will launch a new browser window (or a new tab) showing the.
  • On the top of the right hand side, there is a drop down menu labeled “New”. Create a new Notebook with the Python version you installed.
  • Rename your Notebook. Either click on the current name and edit it or find rename  under File in the top menu bar. You can name it to whatever you’d like, but for this  example we’ll use MyFirstAnacondaNotebook.
  • In the first line of the Notebook, type or copy/paste print(“Hello Anaconda”)
  • Save your Notebook by either clicking the save and checkpoint icon or select File – Save and Checkpoint in the top menu.
  • Select cell and press CTR+Enter or Shift+Enter

NumPy–Introduction

NumPy  is the most basic and a powerful package for working with data in python. It stands for ‘Numerical Python’. It is a library consisting of multidimensional array objects and a collection of routines for processing of array. It contains a collection of tools and technique that can be used to solve on a computer mathematical models of problem in science and engineering.

If you are going to work on data analysis or machine learning projects, then you should have solid understanding of NumPy . Because other packages for data analysis (like pandas) is built on top of NumPy  and the scikit-learn package which is used to build machine learning applications works heavily with NumPy  as well .

What is Array?

A array is basically nothing but a pointer. It is a combination of memory address, a data type, a shapes and strides.

  • The data pointer indicates the memory address of the first bytes in the array.
  • The data type or dtype pointer describes the kind of elements that are contained within the array.
  • The shape indicates the shape of array
  • The strides are the numbers of bytes that should be skipped in memory to go to the next element. If your strides are (10,1) you need to proceed one byte to get the next column and 10 bytes to locate the next row.

So in short we can say an array contains information about the raw data, how to locate an element and how to interpret an element.

Operations using NumPy:

Using NumPy, a developer can perform the following operations −

  • Mathematical and logical operations on arrays.
  • Operations related to linear algebra. NumPy has in-built functions for linear algebra and random number generation.

Installation Instruction:

It is highly recommended you install Python using the Anaconda distribution to make sure all underlying dependencies (such as Linear Algebra libraries) all sync up with the use of a conda install. If you have Anaconda, install NumPy by going to your terminal or command prompt and typing:

conda install numpy
or
pip install numpy

 If you do not have Anaconda and can not install it, please refer to following url http://www.datasciencelovers.com/python-for-data-science/python-environment-setup/

Pandas-Introduction

What is pandas?

Pandas is a python open source library which allow you to perform data manipulation, analysis and cleaning. It is build on top of NumPy . It is a most important library for data science.

According to Wikipedia “Pandas is derived from the term “panel data”, an econometrics term for data sets that include observations over multiple time periods for the same individuals.”

Why Pandas?

Following are the advantages of pandas for Data Scientist.

  • Easily handling missing data.
  • It provides an efficient way to slicing and data wrangling.
  • It is helpful to merge, concatenate or reshape the data.
  • It has includes a powerful time series tool to work with.

How to install Pandas?

To install python pandas go to command line/terminal and type “pip install pandas” or else if you have anaconda install in the system just type in “conda install pandas”. Once the installation is completed, go to your IDE(Jupyter) and simply import it by typing “import pandas as pd”.

In next chapter we will learn about pandas Series.