Pandas–Missing Data
In real scenario missing data is a big problem in data analysis. In machine learning and data mining accuracy get compromised because of poor quality of data caused by missing values.
Missing Data is represented as NA(Not Available) or NAN(Not a number) values in pandas.
Why data is missing?
Let’s suppose you have surveyed different people where you need their name, address, phone number and income, but some user don’t want to share their address and income so in this way many datasets went missing.
Finding missing values
To check missing values in pandas DataFrame we use function isnull() and notnull(). Both of the function checks whether the values is nan or not. These functions also used in Pandas Series, to find null values.
Cleaning / Filling Missing values:
There are following ways to treat missing values.
- Filling missing values using fillna(), replace():
To fill null values in data set we use fillna() and replace().To do this, we can call the fillna() function on a dataframe column and specifying either mean() or median() as a parameter.
#Impute with mean on column_1
df[‘column_1’] = df[‘column_1’].fillna( df[‘column_1’].mean() )
#Impute with median on column_1
df[‘column_1’] = df[‘column_1’].fillna( df[‘column_1’].median() )
Besides mean and median, imputing missing data with 0 can also be a good idea in some cases.
Impute with value 0 on column_1
df[‘column_1’] = df[‘column_1’].fillna(0)
2. Dropping missing values using dropna():
This is not a good method to handle missing value treatment. If your data has large number of missing value you can’t use this method because when you use this method you might be loose some important information.
In order to drop a null values from a dataframe, we used dropna() function this fuction drop Rows/Columns of datasets with Null values in different ways.
#Drop rows with null values
df = df.dropna(axis=0)
#Drop column_1 rows with null values
df[‘column_1’] = df[‘column_1’].dropna(axis=0)
The axis parameter determines the dimension that the function will act on.
axis=0 removes all rows that contain null values.
axis=1 removes all columns instead that contain null values.
Let’s understand the concept with python.