Handling missing values in time series

Handling missing values in time series

In this note, we explore different methods to handle missing values in time series, represented in this example as Numpy arrays.

Imputing with Mean. One simple option is imputing missing values with the mean of the array.

import numpy as np

# Creating a numpy array with missing values
arr = np.array([1.0, 2.0, np.nan, 4.0, 5.0, np.nan, 7.0])
print("Initial array = ", arr)

# Imputing missing values with mean value of the array
mean = np.nanmean(arr, axis=0)
repeated_avg = np.full_like(arr, mean)
arr[np.isnan(arr)] = repeated_avg[np.isnan(arr)]
print("Imputed with mean average = ", arr)

Output:
Initial array =  [ 1.  2. nan  4.  5. nan  7.]
Imputed with mean average =  [1.  2.  3.8 4.  5.  3.8 7. ]        

We can also impute with a constant value

# Imputing missing values with a constant value
arr = np.array([1.0, 2.0, np.nan, 4.0, 5.0, np.nan, 7.0])
constant = 0.0
arr[np.isnan(arr)] = constant
print("Imputed with constant value = ", arr)

Output:
Imputed with constant value =  [1. 2. 0. 4. 5. 0. 7.]        

We can also use the SimpleImputer() from Sklearn, in this case with the median vaue, but more options are available:?https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

from sklearn.impute import SimpleImputer

# Creating a numpy array with missing values
arr = np.array([1.0, 2.0, np.nan, 4.0, 5.0, np.nan, 7.0])
arr = arr.reshape(-1, 1)

# Imputing missing values
imputer = SimpleImputer(strategy='median')
arr = imputer.fit_transform(arr)

print("Imputed with SimpleImputer = ", arr.T)

Output:
Imputed with SimpleImputer =  [[1. 2. 4. 4. 5. 4. 7.]]        

Imputing with First Valid Element to the Left/Right

We can also impute missing values with the first valid element to the left, like in this case.

arr = np.array([1.0, 2.0, np.nan, 4.0, 5.0, np.nan, 7.0])

# Imputing missing values with the first valid element to the left
for i in range(len(arr)):
    if np.isnan(arr[i]):
        idx = np.where(~np.isnan(arr[:i]))[0]
        if len(idx) > 0:
            arr[i] = arr[idx[-1]]

print("Imputed with first valid element to the left = ", arr)

Output:
Imputed with first valid element to the left =  [1. 2. 2. 4. 5. 5. 7.]        

A more sophisticated approach is to use the prediction from an ARMA model

from statsmodels.tsa.arima.model import ARIMA
# Creating a 1D numpy array with missing values
arr = np.array([1.0, 2.0,-1.0, -2.0, np.nan, 1.0, 2.0, np.nan, -1.0])
print("Initial array = ", arr)
# Imputing missing values using an ARMA model
for i in range(arr.shape[0]):
    if np.isnan(arr[i]):
        # Select past values to fit the ARMA model
        past_vals = np.delete(arr, np.where(np.isnan(arr)))[0:i]
        # Fit ARMA model to past values
        model = ARIMA(past_vals, order=(1,0,1))
        model_fit = model.fit()
        # Predict the next value in the sequence
        arr[i] = model_fit.forecast()[0]

print("Imputed with ARMA model = ", arr)

Output:
Initial array =  [ 1.  2. -1. -2. nan  1.  2. nan -1.]
Imputed with ARMA model =  [ 1.  2. -1. -2.  -0.357  1. 2.  1.20  -1. ]
        

Imputing missing values for periodic sequence

arr = np.array([1.0, 2.0, np.nan, 4.0, 1.0, np.nan, 3.0, np.nan, 1.0, np.nan, np.nan, 4.0])
print("Initial array = ", arr)
# Creating a numpy array containing the period of the sequence
period = np.array([1, 2, 3, 4])

# Imputing missing values using the corresponding value in the periodic sequence
for i in range(arr.shape[0]):
    if np.isnan(arr[i]):
        arr[i] = period[i % 4]

print("Imputed for periodic sequence = ", arr)

Output:
Initial array =  [ 1.  2. nan  4.  1. nan  3. nan  1. nan nan  4.]
Imputed for periodic sequence =  [1. 2. 3. 4. 1. 2. 3. 4. 1. 2. 3. 4.]        

Feel free to leave your comments here below, I would be happy to answer.

At MYWAI we promote agile, explainable, reliable and affordable ML at the edge.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了