Handling NaN in Pandas

Handling NaN in Pandas

Dylan | Sep 16, 2019

Post Thumbnail

The Pandas Library uses numpy.nan for representing NaN values in Dataframes and Series. NaN stands for “Not a Number” and interestingly enough belongs to the numpy.float64 data type.

Data in the wild is often messy and incomplete and NaN values are an important tool for reflecting missing data. Depending on the data mining task at hand, there are several statistical approaches to handling missing data; however, this topic could consume several new posts.

For now, let’s focus on the basic coding tools and techniques for manipulating Pandas Dataframes.

Let’s begin with the following Dataframe head.

Example Dataframe Head

Our example Dataframe has four columns and 100 rows. Let’s pretend the values in Column B represent sensor readings and we happen to know that readings of zero are impossible and therefore must be the result of a sensor error.

By feeding our machine learning model erroneous information, it could negatively impact its ability to accurately make predictions. Our model has no idea that zero isn’t a valid feature value and if we attempt to use a classification model such as clustering, it may attempt to use this erroneous value as a cluster basis.

It’s very important that we eliminate this possibility in the data cleaning and preprocessing phase before training our model.

Replacing Specific Values with NaN

The easiest way to swap known values (such as zero) with np.nan is by using Pandas.Dataframe.replace function. If we know that all zero values in column B are errors, converting them to NaN values is simple.

Column B 0 to NaN Code
Column B 0 to NaN Dataframe Head

But how would we accomplish this if errors could occur within a threshold? Let’s now pretend that the values in Column C correspond to another sensor that logs another feature. For this sensor, the values can be 0 but they cannot be negative. Unfortunately, once in a while our sensor malfunctions and returns negative readings.

In this case, the range of possible erroneous readings is an infinite set; all negative numbers. It wouldn’t make sense to have to manually convert each value to NaN. Thankfully, we don’t have to!

By using Pandas.Dataframe.loc[], we can use boolean logic to specify an entire range to convert to NaN values.

Convert
Convert Range of Values to NaN Dataframe Head

Filling NaN Values with the Feature Mean

Converting all NaN values to the feature’s mean value is a popular way of handling missing numeric data. This can be achieved easily in a Pandas.Dataframe with the following single line of code.

Convert NaN to the Column’s Mean Value Code
Convert NaN to the Column’s Mean Value Dataframe Head

However, there’s one thing we must bear in mind. The previous line will replace all NaN values with its column’s mean. In our case if we only want to target the NaN values in Column B, the following one-liner would specifically target only the NaN values in Column B. As you see, the Column C NaN value remains untouched.

Convert NaN in a Specific Column and Convert to the Column’s Mean Value Code
Convert NaN in a Specific Column and Convert to the Column’s Mean Value Dataframe Head

By specifying the inplace parameter, we can determine whether we want to fill all the NaN values in our original Dataframe (inplace=True), or create an entirely new Dataframe object to store our new Dataframe with the NaN values replaced by the feature’s mean value (inplace=False). By default, inplace is set equal to False.

Dropping Rows with NaN Values

In some circumstances, it’s best to remove all instances (rows) with NaN values entirely from our Dataframe. Again, Pandas makes this easy for us!

Remove Rows with NaN Values Code

By changing the axis from 0 to 1, Pandas allows us to drop entire columns if they contain even a single NaN value.

Querying a Column for NaN Values

Sometimes it is useful to query entire columns for NaN values. By using the Pandas.isna() function, we’re able to instantly generate a Boolean Pandas.Series of every NaN and non-NaN value. Every index with True value refers to a NaN value in Column B at the same index and every False value refers to a non-NaN.

Generate Boolean List of each Index if it is NaN Code
Generate Boolean List of each Index if it is NaN Output

In the event that you would need the inversed version of this Boolean series, you can use Pandas.Dataframe.notna() as shown below.

Generate Boolean List of each Index if it is not NaN Code
Generate Boolean List of each Index if it is not NaN Output

Hopefully, this quick introduction of manipulating NaN values in Pandas objects was helpful! As always, if you have any questions, we would all benefit from being able to help answer them in the comments below! Also, if you have anything further to contribute, I look forward to future discussions in the comment section! Thank you for reading!

Comments