Data cleaning using python

Data Cleaning Steps with Python and Pandas

Often we may need to clean the data using Python and Pandas.

This tutorial explains the basic steps for data cleaning by example:

  • Basic exploratory data analysis
  • Detect and remove missing data
  • Drop unnecessary columns and rows
  • Detect outliers
  • Inconsistent data
  • Irrelevant features

What is Data Cleaning? What is dirty Data?

First let’s see what is dirty data:

dirty data is inaccurate, incomplete or inconsistent data

The common features of dirty data are:

  • spelling or punctuation errors
  • incorrect data associated with a field
  • incomplete data
  • outdated data
  • duplicated records

The process of fixing all issues above is known as data cleaning or data cleansing.

Usually data cleaning process has several steps:

  • normalization (optional)
  • detect bad records
  • correct problematic values
  • remove irrelevant or inaccurate data
  • generate report (optional)

At the end of the process data should be:

  • complete
  • up to date
  • accurate
  • correct
  • consistent
  • relevant
  • normalized

Difference of Tidy data vs clean data? Data Tidying vs Data Cleaning? Data cleaning is related to data quality. Data tidying is related to data structure.

Setup

In this post we will use data from Kaggle — A Short History of the Data-science.

To read the data you need to use the following code:

import kaggle link = 'eswarankrishnasamy/2019-kaggle-machine-learning-data-science-survey' kaggle.api.authenticate() kaggle.api.dataset_download_file(link, file_name='multiple_choice_responses.csv', path='data/') 

The downloaded data can be ready by:

import pandas as pd pd.read_csv('data/multiple_choice_responses.csv.zip', low_memory=False) 
Time from Start to Finish (seconds) Q1 Q2 Q2_OTHER_TEXT Q3
Duration (in seconds) What is your age (# years)? What is your gender? — Selected Choice What is your gender? — Prefer to self-describe — Text In which country do you currently reside?
510 22-24 Male -1 France
423 40-44 Male -1 India
83 55-59 Female -1 Germany
391 40-44 Male -1 Australia
Читайте также:  Javascript удалить все переносы строк

Step 1: Exploratory data analysis in Python and Pandas

To start we can do basic exploratory data analysis in Pandas. This will show us more about data:

The first method is head() — which returns the first 5 rows of the dataset.

To see the first 5 rows and 5 columns we can do: df.iloc[0:5,0:5]

The result is truncated for the first 5 columns:

Time from Start to Finish (seconds) Q1 Q2 Q2_OTHER_TEXT Q3
0 Duration (in seconds) What is your age (# years)? What is your gender? — Selected Choice What is your gender? — Prefer to self-describe — Text In which country do you currently reside?
1 510 22-24 Male -1 France
2 423 40-44 Male -1 India
3 83 55-59 Female -1 Germany
4 391 40-44 Male -1 Australia

Next we can see information about the number of the columns and rows by df.shape :

The result is a tuple showing 19718 rows and 246 columns:

Similar information we can get by df.info() :

 RangeIndex: 19718 entries, 0 to 19717 Columns: 246 entries, Time from Start to Finish (seconds) to Q34_OTHER_TEXT dtypes: object(246) memory usage: 37.0+ MB 

Finally we can get more details information about the data values by method describe() . This method will generate descriptive statistics (summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values).

Time from Start to Finish (seconds) Q1 Q2 Q2_OTHER_TEXT Q3
count 19718 19718 19718 19718 19718
unique 4169 12 5 46 60
top 450 25-29 Male -1 India
freq 42 4458 16138 19668 4786

Step 2: First rows as header read_csv in Pandas

So far we saw that the first row contains data which belongs to the header. We need to change how we read the data with header=[0,1] :

df = pd.read_csv('data/multiple_choice_responses.csv.zip', low_memory=False, header=[0,1]) 

In order to simplify the reading of the data we can drop single level from the multi-index by:

Step 3: Data tidying in Pandas

Next we can do data tidying because tidy data helps Pandas’s vectorized
operations.

For example column ‘Q1’ looks like — we need to use the multi-index in order to read the column:

df[('Q1', 'What is your age (# years)?')] 
0 22-24 1 40-44 2 55-59 3 40-44 4 22-24 

Can we split that into two columns? It looks like that all values are two numbers separated by ‘-‘ hyphen. The best is to confirm that observation by:

df[('Q1', 'What is your age (# years)?')].value_counts() 

The last rows shows us one record — 70+ which needs special attention

45-49 949 50-54 692 55-59 422 60-69 338 70+ 100 dtype: int64 

If we perform split operation on rows containing 70+ will result into:

0 70+ 1 None Name: 182, dtype: object 

Step 4: Correcting and replacing data in Pandas

Next we can see how to correct the data above. We can do data correction of cases 70+ in two ways:

Читайте также:  Public static string array java

4.1. Replace values in column — Pandas

To replace the values in the column we can use method .str.replace(’70+’, ’70-120′, regex=False) as follows:

df['Q1'].str.replace('70+', '70-120', regex=False) 

4.2. Fill NaN with string or 0 — Pandas

The other option is to fill the missing values after the split by:

we suppose that after the split we created new column ‘max_age’

Step 5: Detect NaN values in column Pandas

Now let’s see how we can detect NaN values. This will help us drop columns with NaN values.

5.1 Columns which contains only NaN values

To find columns which has only NaN values we can use two methods:

This will give a new Series with column name and True or False — depending on the NaN values. If a column has only NaN values we will get True.

To find columns which contain NaN values we can use:

There’s no column which contains only NaN values

5.2 Detect columns with NaN values

To detect columns which has NaN values we can use:

Time from Start to Finish (seconds) False Q1 False Q2 False Q2_OTHER_TEXT False Q3 False . Q34_Part_9 True Q34_Part_10 True Q34_Part_11 True Q34_Part_12 True Q34_OTHER_TEXT False Length: 246, dtype: bool 

So columns like ‘Q34_Part_9’ have NaN values. Columns like ‘Q1’ don’t have NaN values.

Step 6: Drop columns in Pandas

Let’s say that we would like to drop columns based on name or NaN values. We can do that in several ways:

6.1 Drop one column by name

Parameters needed to drop columns are axis=1 and inplace=True — which means that operation will affect DataFrame.

df.drop('Q1', axis=1, inplace=True) 

6.2 Drop multiple columns by name

We can list several column which to be removed by:

df.drop(['Q1', 'Q2'], axis=1, inplace=True) 

6.3 Drop columns with NaN values

Finally we can drop columns which has NaN values:

We can use parameters like:

  • how — ‘all’ or ‘any’
  • subset — list of columns
  • tresh — the number of NaN values required to remove the column
  • inplace

Step 7: Detect and drop duplicate rows in Pandas

To detect duplicate values in the DataFrame we can use the method duplicated() . To detect duplicate rows in Pandas DataFrame we can use:

This results in 4 duplicated rows:

We can use parameter: keep

For example get indexes of all detected duplications:

df[df.duplicated(keep=False)].index 

Int64Index([11228, 12344, 16413, 16547, 16653, 18705, 19258, 19705], dtype=’int64′)

Since we have 246 columns (answers) it’s pretty suspicious that there are full duplications.

Читайте также:  Python printing strings with variables

We can use method df.drop_duplicates(subset=[‘Q1’]) in order to drop duplicated rows in Pandas:

df.drop_duplicates(subset=['Q1', 'Q2']) 

Step 8: Detect outliers in Pandas

We can detect outliers in Pandas in many ways. Here we will cover basic detection of numeric data:

Suppose we work with column: ‘Time from Start to Finish (seconds)’

We can see the min, max and the percentiles by:

df['Time from Start to Finish (seconds)'].describe() 
count 19717.000000 mean 14341.281027 std 74166.106601 min 23.000000 25% 340.000000 50% 540.000000 75% 930.000000 max 843612.000000 Name: (Time from Start to Finish (seconds), Duration (in seconds)), dtype: float64 

So we have time for the survey from 23 up to 843612 seconds. Probably we can exclude some of them.

Another way to detect outliers is visually by plotting data like:

data-cleaning-steps-python-detect_outliers

From the image above we can decide what is the threshold which makes sense for us.

Step 9: Detect errors, typos and misspelling in Pandas

Finally let’s check how we can detect typos and misspelled words in Pandas DataFrame. This will show how we can work with inconsistent or incomplete data.

For this purpose we are going to read file — ‘other_text_responses.csv’ which will be df_other . The reason is that it contains free text input.

Let’s read the third column of this DataFrame by:

df_other[df_other.columns[3]].value_counts().head(10) 
Excel 865 Microsoft Excel 392 excel 263 MS Excel 67 Google Sheets 61 Google sheets 44 Microsoft excel 38 Excel 33 microsoft excel 27 EXCEL 25 

We can see different variations of the same tool — Excel.

In order to detect similar values we will use Python library difflib :

import difflib difflib.get_close_matches('excl', ['Excel', 'Microsoft Excel ', 'MS Excel', 'excel'], n=1, cutoff=0.7) 

The result of this will be:

So we can use Python in order to detect and fix misspelled words.

Code like the one below can help us create new column with corrected values:

import difflib correct_values = <> words = df_other["Q14_Part_3_TEXT"].value_counts(ascending=True).index for keyword in words: similar = difflib.get_close_matches(keyword, words, n=20, cutoff=0.6) for x in similar: correct_values[x] = keyword df_other["corr"] = df_other["Q14_Part_3_TEXT"].map(correct_values) 

Conclusion

In this article, we learned what is clean data and how to do data cleaning in Pandas and Python.

Some topics which we discussed are NaN values, duplicates, drop columns and rows, outlier detection.

We saw all the steps of the data cleaning process with examples. We covered important topics like tidy data and data quality.

By using DataScientYst — Data Science Simplified, you agree to our Cookie Policy.

Источник

Оцените статью