登录查看更多内容

Handling missing values in time series

Gustavo Sánchez Hurtado

Award-Winning Engineer, Researcher & Educator | Digital Transformation: Control Systems, IoT, and Machine Learning | PLC/SCADA programmer | Python/MATLAB | Node Red | Global Speaker, Author & Podcaster

发布日期: 2023年2月25日

In this note, we explore different methods to handle missing values in time series, represented in this example as Numpy arrays.

Imputing with Mean. One simple option is imputing missing values with the mean of the array.

import numpy as np

# Creating a numpy array with missing values
arr = np.array([1.0, 2.0, np.nan, 4.0, 5.0, np.nan, 7.0])
print("Initial array = ", arr)

# Imputing missing values with mean value of the array
mean = np.nanmean(arr, axis=0)
repeated_avg = np.full_like(arr, mean)
arr[np.isnan(arr)] = repeated_avg[np.isnan(arr)]
print("Imputed with mean average = ", arr)

Output:
Initial array =  [ 1.  2. nan  4.  5. nan  7.]
Imputed with mean average =  [1.  2.  3.8 4.  5.  3.8 7. ]

We can also impute with a constant value

# Imputing missing values with a constant value
arr = np.array([1.0, 2.0, np.nan, 4.0, 5.0, np.nan, 7.0])
constant = 0.0
arr[np.isnan(arr)] = constant
print("Imputed with constant value = ", arr)

Output:
Imputed with constant value =  [1. 2. 0. 4. 5. 0. 7.]

We can also use the SimpleImputer() from Sklearn, in this case with the median vaue, but more options are available:?https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

from sklearn.impute import SimpleImputer

# Creating a numpy array with missing values
arr = np.array([1.0, 2.0, np.nan, 4.0, 5.0, np.nan, 7.0])
arr = arr.reshape(-1, 1)

# Imputing missing values
imputer = SimpleImputer(strategy='median')
arr = imputer.fit_transform(arr)

print("Imputed with SimpleImputer = ", arr.T)

Output:
Imputed with SimpleImputer =  [[1. 2. 4. 4. 5. 4. 7.]]

领英推荐

NumPy for Data Science Beginners: 2021

Free Online Courses With Printable Certificates 1 年前

Imputing with First Valid Element to the Left/Right

We can also impute missing values with the first valid element to the left, like in this case.

arr = np.array([1.0, 2.0, np.nan, 4.0, 5.0, np.nan, 7.0])

# Imputing missing values with the first valid element to the left
for i in range(len(arr)):
    if np.isnan(arr[i]):
        idx = np.where(~np.isnan(arr[:i]))[0]
        if len(idx) > 0:
            arr[i] = arr[idx[-1]]

print("Imputed with first valid element to the left = ", arr)

Output:
Imputed with first valid element to the left =  [1. 2. 2. 4. 5. 5. 7.]

A more sophisticated approach is to use the prediction from an ARMA model

from statsmodels.tsa.arima.model import ARIMA
# Creating a 1D numpy array with missing values
arr = np.array([1.0, 2.0,-1.0, -2.0, np.nan, 1.0, 2.0, np.nan, -1.0])
print("Initial array = ", arr)
# Imputing missing values using an ARMA model
for i in range(arr.shape[0]):
    if np.isnan(arr[i]):
        # Select past values to fit the ARMA model
        past_vals = np.delete(arr, np.where(np.isnan(arr)))[0:i]
        # Fit ARMA model to past values
        model = ARIMA(past_vals, order=(1,0,1))
        model_fit = model.fit()
        # Predict the next value in the sequence
        arr[i] = model_fit.forecast()[0]

print("Imputed with ARMA model = ", arr)

Output:
Initial array =  [ 1.  2. -1. -2. nan  1.  2. nan -1.]
Imputed with ARMA model =  [ 1.  2. -1. -2.  -0.357  1. 2.  1.20  -1. ]

Imputing missing values for periodic sequence

arr = np.array([1.0, 2.0, np.nan, 4.0, 1.0, np.nan, 3.0, np.nan, 1.0, np.nan, np.nan, 4.0])
print("Initial array = ", arr)
# Creating a numpy array containing the period of the sequence
period = np.array([1, 2, 3, 4])

# Imputing missing values using the corresponding value in the periodic sequence
for i in range(arr.shape[0]):
    if np.isnan(arr[i]):
        arr[i] = period[i % 4]

print("Imputed for periodic sequence = ", arr)

Output:
Initial array =  [ 1.  2. nan  4.  1. nan  3. nan  1. nan nan  4.]
Imputed for periodic sequence =  [1. 2. 3. 4. 1. 2. 3. 4. 1. 2. 3. 4.]

Feel free to leave your comments here below, I would be happy to answer.

At MYWAI we promote agile, explainable, reliable and affordable ML at the edge.

要查看或添加评论，请登录

Gustavo Sánchez Hurtado的更多文章

Training Restricted Coulomb Energy (RCE) classifiers in Python

2023年5月21日

Training Restricted Coulomb Energy (RCE) classifiers in Python

The RCE (Restricted Coulomb Energy) classifiers rely on the identification of nearest training examples, based on the…

3 条评论
How does the Fourier transform look in 2D?

2023年5月6日

How does the Fourier transform look in 2D?

The Fourier transform can be difficult to understand, especially for those who are not familiar with advanced…
Change point detection based on spectral residual and CNNs

2023年4月30日

Change point detection based on spectral residual and CNNs

In some applications we need to identify instants where the statistical properties of a time series (e.g mean…
Anomaly detection using the Minimum Covariance Determinant (MCD) method

2023年4月23日

Anomaly detection using the Minimum Covariance Determinant (MCD) method

Assume we need to detect anomalies in Gaussian-distributed data or at least with an unimodal, symmetric distribution…
Trajectory prediction using Extended Kalman Filter (EKF) training

2023年4月16日

Trajectory prediction using Extended Kalman Filter (EKF) training

Trajectory prediction is one the classic problems in estimation and control theory. In this note we follow the approach…
Time series classification using LibSVM

2023年4月9日

Time series classification using LibSVM

It is possible to use LibSVM for time series classification, based on the raw previous values or on some set of…

1 条评论
How long should be the sliding window for time series classification?

2023年4月1日

How long should be the sliding window for time series classification?

It is well-known that Sliding Window Length (SWL) directly affects classification performance. However, it is difficult…
Time series anomaly detection based on ARMA model in C#

2023年3月18日

Time series anomaly detection based on ARMA model in C#

In some cases, it can be advantageous to use languages such as C++ or C# for numerical computing, as assigning data…

5 条评论
Bode-like plot for NN classifiers

2023年3月11日

Bode-like plot for NN classifiers

Inspired by papers like this one: I decided to run the following experiment: 1)Train an NN (Sklearn - MLPRegressor)…

1 条评论
Take the Rocket! : classifying time series with Sktime

2023年2月12日

Take the Rocket! : classifying time series with Sktime

In this note, we briefly explain how to use the Rocket algorithm for univariate time series classification, currently…

1 条评论

See all articles

Handling missing values in time series

Gustavo Sánchez Hurtado

Award-Winning Engineer, Researcher & Educator | Digital Transformation: Control Systems, IoT, and Machine Learning | PLC/SCADA programmer | Python/MATLAB | Node Red | Global Speaker, Author & Podcaster

领英推荐

Gustavo Sánchez Hurtado的更多文章

社区洞察

其他会员也浏览了

Data Science #4

Mastering Matplotlib: Easy Plotting Tips and Common Pitfalls Explained

Pandas - GroupBy Practice

Change the data type of columns in Pandas

A Slap in the Face with Pandas

+30 Useful Operations in Pandas ??

Week of June 17th

Statistical functions and methods from NumPy, pandas, and SciPy

Using Multiple Regression To Examine What Variables Are Most Correlated With A Movie’s Box Office Success

What is Central Tendency? Mean,Median & Mode

领英推荐

Gustavo Sánchez Hurtado的更多文章

Training Restricted Coulomb Energy (RCE) classifiers in Python

How does the Fourier transform look in 2D?

Change point detection based on spectral residual and CNNs

Anomaly detection using the Minimum Covariance Determinant (MCD) method

Trajectory prediction using Extended Kalman Filter (EKF) training

Time series classification using LibSVM

How long should be the sliding window for time series classification?

Time series anomaly detection based on ARMA model in C#

Bode-like plot for NN classifiers

Take the Rocket! : classifying time series with Sktime

社区洞察

其他会员也浏览了

Data Science #4

Mastering Matplotlib: Easy Plotting Tips and Common Pitfalls Explained

Pandas - GroupBy Practice

Change the data type of columns in Pandas

A Slap in the Face with Pandas

+30 Useful Operations in Pandas ??

Week of June 17th

Statistical functions and methods from NumPy, pandas, and SciPy

Using Multiple Regression To Examine What Variables Are Most Correlated With A Movie’s Box Office Success

What is Central Tendency? Mean,Median & Mode