Handling missing values in time series
Gustavo Sánchez Hurtado
Award-Winning Engineer, Researcher & Educator | Digital Transformation: Control Systems, IoT, and Machine Learning | PLC/SCADA programmer | Python/MATLAB | Node Red | Global Speaker, Author & Podcaster
In this note, we explore different methods to handle missing values in time series, represented in this example as Numpy arrays.
Imputing with Mean. One simple option is imputing missing values with the mean of the array.
import numpy as np
# Creating a numpy array with missing values
arr = np.array([1.0, 2.0, np.nan, 4.0, 5.0, np.nan, 7.0])
print("Initial array = ", arr)
# Imputing missing values with mean value of the array
mean = np.nanmean(arr, axis=0)
repeated_avg = np.full_like(arr, mean)
arr[np.isnan(arr)] = repeated_avg[np.isnan(arr)]
print("Imputed with mean average = ", arr)
Output:
Initial array = [ 1. 2. nan 4. 5. nan 7.]
Imputed with mean average = [1. 2. 3.8 4. 5. 3.8 7. ]
We can also impute with a constant value
# Imputing missing values with a constant value
arr = np.array([1.0, 2.0, np.nan, 4.0, 5.0, np.nan, 7.0])
constant = 0.0
arr[np.isnan(arr)] = constant
print("Imputed with constant value = ", arr)
Output:
Imputed with constant value = [1. 2. 0. 4. 5. 0. 7.]
We can also use the SimpleImputer() from Sklearn, in this case with the median vaue, but more options are available:?https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
from sklearn.impute import SimpleImputer
# Creating a numpy array with missing values
arr = np.array([1.0, 2.0, np.nan, 4.0, 5.0, np.nan, 7.0])
arr = arr.reshape(-1, 1)
# Imputing missing values
imputer = SimpleImputer(strategy='median')
arr = imputer.fit_transform(arr)
print("Imputed with SimpleImputer = ", arr.T)
Output:
Imputed with SimpleImputer = [[1. 2. 4. 4. 5. 4. 7.]]
Imputing with First Valid Element to the Left/Right
We can also impute missing values with the first valid element to the left, like in this case.
arr = np.array([1.0, 2.0, np.nan, 4.0, 5.0, np.nan, 7.0])
# Imputing missing values with the first valid element to the left
for i in range(len(arr)):
if np.isnan(arr[i]):
idx = np.where(~np.isnan(arr[:i]))[0]
if len(idx) > 0:
arr[i] = arr[idx[-1]]
print("Imputed with first valid element to the left = ", arr)
Output:
Imputed with first valid element to the left = [1. 2. 2. 4. 5. 5. 7.]
A more sophisticated approach is to use the prediction from an ARMA model
from statsmodels.tsa.arima.model import ARIMA
# Creating a 1D numpy array with missing values
arr = np.array([1.0, 2.0,-1.0, -2.0, np.nan, 1.0, 2.0, np.nan, -1.0])
print("Initial array = ", arr)
# Imputing missing values using an ARMA model
for i in range(arr.shape[0]):
if np.isnan(arr[i]):
# Select past values to fit the ARMA model
past_vals = np.delete(arr, np.where(np.isnan(arr)))[0:i]
# Fit ARMA model to past values
model = ARIMA(past_vals, order=(1,0,1))
model_fit = model.fit()
# Predict the next value in the sequence
arr[i] = model_fit.forecast()[0]
print("Imputed with ARMA model = ", arr)
Output:
Initial array = [ 1. 2. -1. -2. nan 1. 2. nan -1.]
Imputed with ARMA model = [ 1. 2. -1. -2. -0.357 1. 2. 1.20 -1. ]
Imputing missing values for periodic sequence
arr = np.array([1.0, 2.0, np.nan, 4.0, 1.0, np.nan, 3.0, np.nan, 1.0, np.nan, np.nan, 4.0])
print("Initial array = ", arr)
# Creating a numpy array containing the period of the sequence
period = np.array([1, 2, 3, 4])
# Imputing missing values using the corresponding value in the periodic sequence
for i in range(arr.shape[0]):
if np.isnan(arr[i]):
arr[i] = period[i % 4]
print("Imputed for periodic sequence = ", arr)
Output:
Initial array = [ 1. 2. nan 4. 1. nan 3. nan 1. nan nan 4.]
Imputed for periodic sequence = [1. 2. 3. 4. 1. 2. 3. 4. 1. 2. 3. 4.]
Feel free to leave your comments here below, I would be happy to answer.
At MYWAI we promote agile, explainable, reliable and affordable ML at the edge.