Anomaly Detection in Python: Best Practices and Techniques

Dmytro Iakubovskyi

Senior Data Scientist at Velux

发布日期: 2023年7月7日

When writing previous articles related to detailed data analytics, such as this one, I often receive questions about?detecting and removing outliers.

For a?quick look analysis,?excluding the top and bottom percentile of a certain label?(similar to what was done in the above article) may be considered a valid approach. However, for a more detailed analysis, other methods of anomaly detection should be used instead. Here, I speak about a few commonly used methods of anomaly detection and demonstrate how they work using a specific example — the?Weight and Height dataset available on Kaggle. Full details of the analysis can be found?in this public Kaggle notebook.

Interquartile range (IQR) method

Probably, the?most common method?of outlier detection. For our dataset, it gives the following results (note for metric system users —here, the height is measured in inches and weight in pounds):

No alt text provided for this image — Source: author, https://www.kaggle.com/code/dima806/anomaly-detection-methods-overview

The clear drawback of the method is that it does not take into account potential correlations among the features (people’s height and weight, in this example), thus potentially discarding fully valid points.

Isolation forest method

Another common method uses an?isolation forest algorithm?to find the outliers. Here is an example of its?Python (sklearn’s) implementation:

Similar to the above example, this algorithm considers short/light and long/heavy people as potential outliers and does not specifically see a correlation between height and weight.

Local Outlier Factor (LOF) method

The results of applying?this method?look?very different from the previous two plots:

One-class Support Vector Machine (SVM) method

Here is what we see by applying the?One-class SVM method:

Notably, this method also identified some of the middle-weight persons as potential outliers.

Autoencoder method

Finally, let us use the power of a Deep Learning based method (autoencoders). Here is the?sample PyTorch implementation:

class Autoencoder(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(Autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 2*hidden_dim),
            nn.ReLU(),
            nn.Linear(2*hidden_dim, hidden_dim),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Linear(hidden_dim, 2*hidden_dim),
            nn.ReLU(),
            nn.Linear(2*hidden_dim, input_dim),
            nn.Sigmoid()
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

class OutlierDataset(Dataset):
    def __init__(self, data):
        self.data = torch.tensor(data, dtype=torch.float32)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

def detect_outliers_autoencoder(dataframe, hidden_dim=16, num_epochs=10, batch_size=32):
    # Convert dataframe to standard scaled numpy array
    scaler = StandardScaler()
    data = scaler.fit_transform(dataframe[FEATURE_COLUMNS])

    # Create an outlier dataset
    dataset = OutlierDataset(data)

    # Split data into training and validation sets
    val_split = int(0.2 * len(dataset))
    train_set, val_set = torch.utils.data.random_split(dataset, [len(dataset) - val_split, val_split])
    train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_set, batch_size=batch_size)

    # Initialize the autoencoder model
    input_dim = data.shape[1]
    model = Autoencoder(input_dim, hidden_dim)

    # Set the loss function and optimizer
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=3e-2, weight_decay=1e-8)

    # Train the autoencoder
    for epoch in range(num_epochs):
        train_loss = 0.0
        val_loss = 0.0

        # Training
        model.train()
        for batch in train_loader:
            # Zero the gradients
            optimizer.zero_grad()
            # Forward pass
            outputs = model(batch)
            # Compute the reconstruction loss
            loss = criterion(outputs, batch)
            # Backward pass and optimization
            loss.backward()
            optimizer.step()
            train_loss += loss.item() * batch.size(0)

        # Validation
        model.eval()
        with torch.no_grad():
            for batch in val_loader:
                outputs = model(batch)
                loss = criterion(outputs, batch)
                val_loss += loss.item() * batch.size(0)

        train_loss /= len(train_set)
        val_loss /= len(val_set)


        print(f'Epoch {epoch+1}/{num_epochs}: Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}')
        
    # Calculate reconstruction error for each data point
    reconstructed = model(dataset.data)
    mse_loss = nn.MSELoss(reduction='none')
    error = torch.mean(mse_loss(reconstructed, dataset.data), dim=1)

    
    # Define a threshold for outlier detection
    threshold = np.percentile(error.detach().numpy(), 99)

    # Create a boolean mask for outliers
    outliers_mask = (error > threshold)

    # Mark the outliers
    dataframe[f'outliers_autoenc'] = (outliers_mask)
    dataframe[f'outliers_autoenc'] = dataframe[f'outliers_autoenc'].astype(int)
    return dataframe

# Detect outliers using the autoencoder method
df = detect_outliers_autoencoder(df)

resulting in the following outliers:

To conclude, let us not forget about?No Free Lunch?and stay “hungry” for the data collection method and domain knowledge. Indeed, even the fact that your favourite algorithm does the job well for the datasets of your choice does not guarantee it will ideally work with the new dataset.

I hope these results can be useful for you. In case of questions/comments, do not?hesitate to write in the comments below.

查看更多评论

要查看或添加评论，请登录

Dmytro Iakubovskyi的更多文章

What would be the next “big thing” in AI?

2023年7月1日

What would be the next “big thing” in AI?

During the last time, generative AI models such as ChatGPT and Midjourney attracted a lot of our attention. A natural…

3 条评论
Towards constructing an unbiased Machine Learning regression model — an example of Airbnb rental prices

2023年6月24日

Towards constructing an unbiased Machine Learning regression model — an example of Airbnb rental prices

As a follow-up to one of my recent articles, I have received a comment from Eugenia Anello about the usage of label…

11 条评论
Deep Learning or classical Machine Learning — which one to use for your project?

2023年6月11日

Deep Learning or classical Machine Learning — which one to use for your project?

During the last decade, Deep Learning has received a lot of attention throughout the globe. Indeed, given the famous…
Data Engineer, Data Analyst, Data Scientist— which role to choose?

2023年6月6日

Data Engineer, Data Analyst, Data Scientist— which role to choose?

Below, I briefly summarise their main activities, typical yearly gross salaries (interquartile ranges, calculated for…
How to switch to Data Science?

2023年6月4日

How to switch to Data Science?

This article is also published in Medium and Substack. After becoming a data scientist, I have often been asked by my…

7 条评论

See all articles

Interquartile range (IQR) method

Isolation forest method

Local Outlier Factor (LOF) method

One-class Support Vector Machine (SVM) method

Autoencoder method

Dmytro Iakubovskyi的更多文章

What would be the next “big thing” in AI?

Towards constructing an unbiased Machine Learning regression model — an example of Airbnb rental prices

Deep Learning or classical Machine Learning — which one to use for your project?

Data Engineer, Data Analyst, Data Scientist— which role to choose?

How to switch to Data Science?

社区洞察