Handling NaN in Pandas
Dylan | Sep 16, 2019
The Pandas Library uses numpy.nan for representing NaN values in Dataframes and Series. NaN stands for "Not a Number" and interestingly enough belongs to the numpy.float64 data type.
Data in the wild is often messy and incomplete and NaN values are an important tool for reflecting missing data. Depending on the data mining task at hand, there are several statistical approaches to handling missing data; however, this topic could consume several new posts.
For now, let's focus on the basic coding tools and techniques for manipulating Pandas Dataframes.
Let’s begin with the following Dataframe head.
Our example Dataframe has four columns and 100 rows. Let's pretend the values in Column B represent sensor readings and we happen to know that readings of zero are impossible and therefore must be the result of a sensor error.
By feeding our machine learning model erroneous information, it could negatively impact its ability to accurately make predictions. Our model has no idea that zero isn't a valid feature value and if we attempt to use a classification model such as clustering, it may attempt to use this erroneous value as a cluster basis.
It's very important that we eliminate this possibility in the data cleaning and preprocessing phase before training our model.
Replacing Specific Values with NaNThe easiest way to swap known values (such as zero) with np.nan is by using Pandas.Dataframe.replace function. If we know that all zero values in column B are errors, converting them to NaN values is simple.
df['B'].replace(0, np.nan, inplace=True)
But how would we accomplish this if errors could occur within a threshold? Let's now pretend that the values in Column C correspond to another sensor that logs another feature. For this sensor, the values can be 0 but they cannot be negative. Unfortunately, once in a while our sensor malfunctions and returns negative readings.
In this case, the range of possible erroneous readings is an infinite set; all negative numbers. It wouldn't make sense to have to manually convert each value to NaN. Thankfully, we don’t have to!
By using Pandas.Dataframe.loc, we can use boolean logic to specify an entire range to convert to NaN values.
df['C'].loc[df['C'] < 0] = np.nan
Filling NaN Values with the Feature MeanConverting all NaN values to the feature's mean value is a popular way of handling missing numeric data. This can be achieved easily in a Pandas.Dataframe with the following single line of code.
However, there's one thing we must bear in mind. The previous line will replace all NaN values with its column’s mean. In our case if we only want to target the NaN values in Column B, the following one-liner would specifically target only the NaN values in Column B. As you see, the Column C NaN value remains untouched.
By specifying the inplace parameter, we can determine whether we want to fill all the NaN values in our original Dataframe (inplace=True), or create an entirely new Dataframe object to store our new Dataframe with the NaN values replaced by the feature’s mean value (inplace=False). By default, inplace is set equal to False.
Dropping Rows with NaN ValuesIn some circumstances, it’s best to remove all instances (rows) with NaN values entirely from our Dataframe. Again, Pandas makes this easy for us!
dropped_df = df.dropna(axis=0)
By changing the axis from 0 to 1, Pandas allows us to drop entire columns if they contain even a single NaN value. One easy trick to remember which value responds to rows and columns is that 1 is a vertical line resembling a column.
Querying a Column for NaN ValuesSometimes it is useful to query entire columns for NaN values. By using the Pandas.isna() function, we’re able to instantly generate a Boolean Pandas.Series of every NaN and non-NaN value. Every index with True value refers to a NaN value in Column B at the same index and every False value refers to a non-NaN.
In the event that you would need the inversed version of this Boolean series, you can use Pandas.Dataframe.notna() as shown below.
Hopefully, this quick introduction of manipulating NaN values in Pandas objects was helpful! As always, if you have any questions, we would all benefit from being able to help answer them in the comments below! Also, if you have anything further to contribute, I look forward to future discussions in the comment section! Thank you for reading!