The Impact of Public Sentiment on Bitcoin Returns: Part 4 - Prediction and Model Explanation

The Impact of Public Sentiment on Bitcoin Returns: Part 4 - Prediction and Model Explanation

Introduction

In this section, we dive deeper into the predictive modeling process, using several machine learning models to forecast Bitcoin price movements based on various features. We will walk through each step, including feature engineering, data splitting, model selection, performance evaluation, and model explanation using SHAP. By the end, we will have a thorough understanding of how different variables impact Bitcoin price predictions.


Step 1: Data Import and Preprocessing

import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns        

The first step in any machine learning workflow is importing and cleaning the data. Here, we begin by loading the dataset, which contains Bitcoin-related features such as tweet counts, sentiment analysis scores, and price data.

# Load preprocessed Bitcoin tweets data
daily_data = pd.read_csv('/content/gdrive/MyDrive/snscrape_clean/03-tweets_by_timestamp-3.csv', 
                         lineterminator='\n', 
                         parse_dates=['timestamp'], 
                         index_col='timestamp')        
# Display the value counts for the next close label
daily_data['next_close_label'].value_counts()        

We load the data from a CSV file, which includes columns such as the number of tweets, sentiment polarity scores, and other Bitcoin-related metrics. We then perform basic data cleaning by dropping unnecessary columns that do not directly contribute to our model's prediction.

After cleaning the data, we create visualizations to understand how key features, such as tweet counts and sentiment analysis scores, vary over time. These visualizations help identify any significant trends or patterns in the data that may affect our predictions. For instance, we plot the number of tweets over time to observe how Twitter activity correlates with Bitcoin price movements.


We also visualize the sentiment scores derived from different libraries (TextBlob, VADER, and Pattern) to assess their individual contributions to the model. These visualizations allow us to see the changes in sentiment over time and whether sentiment shifts correlate with price movements.


Step 2: Feature Engineering and Data Normalization

Feature engineering is a crucial step in transforming raw data into a format suitable for machine learning models. We create additional features that can enhance the predictive power of the models. For example, we calculate the percentage change for certain columns (such as the closing price or sentiment scores) to capture the relative movement of these variables from day to day. These percentage changes provide insights into how fluctuations in these variables might influence future price movements.

# Calculate percent change for relevant features
pct_change_data = daily_data.drop(['Open', 'next_open_label', 'next_close_label', 'next_return', 
                                   'next_close', 'High', 'Low', 'hlco_ratio', 'upper_shadow', 
                                   'lower_shadow', 'is_weekend'], axis=1).pct_change()

# Rename columns to reflect percentage change
for col in pct_change_data.columns:
    pct_change_data.rename(columns={col: f'{col} (change)'}, inplace=True)

# Concatenate the percentage change data with the existing dataset
daily_data = pd.concat([daily_data, pct_change_data], axis=1)        

To ensure that all features are on a comparable scale, we normalize the data using a MinMaxScaler. Normalization is necessary because machine learning models often perform better when the input features are on the same scale. For example, if one feature ranges from 0 to 1, while another ranges from 0 to 1000, the model may give disproportionate weight to the larger-scale feature. Normalizing the data helps avoid this issue, allowing the model to treat all features equally.

from sklearn.preprocessing import MinMaxScaler

# Initialize MinMaxScaler
scl = MinMaxScaler()

# Apply MinMaxScaler to the data (excluding irrelevant columns)
scaled_data = scl.fit_transform(daily_data)        

Step 3: Correlation and Clustering

Understanding the relationships between the variables in the dataset is key to building an effective model. We use correlation clustering to identify which variables are most strongly related to one another. By examining the correlation matrix, we can see which features are highly correlated and might be redundant, and which ones are less correlated and could provide additional unique information.

For instance, we might observe that features like Close and Open prices are highly correlated, suggesting that one might be sufficient to represent price movements. On the other hand, features like sentiment scores from different libraries might be weakly correlated with price data, indicating that they provide distinct information that could be valuable for prediction.

Once we generate the correlation matrix, we perform clustering to group variables that have similar patterns of correlation. This clustering helps us identify which variables should be considered together and which can be treated independently. Visualizing the correlation heatmap and clusters allows us to see the structure of the data and decide which features are most relevant for model training.

import scipy.cluster.hierarchy as sch

def cluster_corr(corr_array):
    # Cluster correlation matrix to group highly correlated variables
    pairwise_distances = sch.distance.pdist(corr_array)
    linkage = sch.linkage(pairwise_distances, method='complete')
    cluster_distance_threshold = pairwise_distances.max() / 2
    idx_to_cluster_array = sch.fcluster(linkage, cluster_distance_threshold, criterion='distance')
    idx = np.argsort(idx_to_cluster_array)
    return corr_array.iloc[idx, :].T.iloc[idx, :]

# Visualize the correlation heatmap after clustering
plt.figure(figsize=(14, 14))
plt.title('Correlation heatmap')
sns.heatmap(cluster_corr(daily_data.corr()), cmap="PiYG", annot=False, center=0)        



Step 4: Model Training and Testing

With the data prepared, we proceed to the model training and testing phase. We split the data into training and testing sets, typically using an 80/20 split. The training set is used to train the model, while the testing set is reserved for evaluating the model’s performance on unseen data.

from sklearn.model_selection import train_test_split

# Split data into training (80%) and testing (20%) sets
X = daily_data.drop('next_close_label', axis=1)  # Independent variables
y = daily_data['next_close_label']  # Dependent variable (target)

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0, shuffle=False)        

We experiment with several machine learning models, including Decision Tree, AdaBoost, XGBoost, and CatBoost. These models are selected for their ability to handle complex, non-linear relationships in the data, which is essential when predicting volatile financial assets like Bitcoin.

  • Decision Tree: This model creates a tree-like structure, where each node represents a decision based on a feature. It is easy to understand and can capture non-linear relationships, but it can be prone to overfitting.
  • AdaBoost: This is an ensemble method that combines the predictions of several weak learners to produce a stronger prediction. AdaBoost focuses on correcting the mistakes made by previous models, leading to improved performance over time.
  • XGBoost: A highly efficient implementation of gradient boosting, XGBoost is known for its speed and performance. It builds an ensemble of decision trees in a sequential manner, optimizing for the model's performance by correcting errors from previous trees.
  • CatBoost: This model is based on gradient boosting but is particularly effective for categorical data. CatBoost automates the handling of categorical variables and is known for its robustness and accuracy.

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

# Initialize models
models = {
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'AdaBoost': AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1), n_estimators=50, random_state=42),
    'XGBoost': XGBClassifier(random_state=42),
    'CatBoost': CatBoostClassifier(iterations=1000, learning_rate=0.05, depth=6, random_state=42, verbose=0)
}

# Train each model and evaluate accuracy
best_accuracy = 0
best_model = None

for name, model in models.items():
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)
    print(f'{name} Accuracy: {accuracy}')
    
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = model

print(f'Best Model: {best_model.__class__.__name__} with accuracy {best_accuracy}')        

After training the models, we evaluate their performance on the test set using accuracy metrics. Among the models tested, AdaBoost emerges as the best performer, with an accuracy of 62%. This indicates that AdaBoost is the most reliable model for predicting Bitcoin's next-day return based on the selected features.



Step 5: Best Performing Model - AdaBoost

AdaBoost’s superior performance makes it the model of choice for this prediction task. Its ensemble approach, which combines multiple weak models to form a stronger predictor, is especially effective when dealing with complex data like Bitcoin prices.

AdaBoost works by iteratively training a sequence of weak classifiers (typically decision trees) and adjusting the weights of misclassified instances. The model then combines the predictions of all these classifiers to produce the final output. AdaBoost’s ability to reduce bias and variance through this iterative correction process allows it to perform well even with noisy data.

In the case of predicting Bitcoin returns, AdaBoost’s ability to correct errors from previous iterations helps it identify subtle patterns in the data that may be missed by simpler models like decision trees.


Step 6: Model Explanation using SHAP

While achieving good performance with a model is important, understanding how the model makes its predictions is equally crucial. For this, we turn to SHAP (Shapley Additive Explanations), a method based on cooperative game theory that provides an explanation for individual predictions made by machine learning models.

SHAP values represent the contribution of each feature to a specific prediction, allowing us to see how much each feature (such as tweet sentiment or closing price) influences the final prediction. We use SHAP to explain the predictions made by the AdaBoost model, helping us identify the most important features driving the model's decision-making process.

By visualizing SHAP values, we can interpret the impact of each feature on the model’s predictions. For example, a positive SHAP value for a particular feature means that the feature contributed positively to the model's prediction (increased the likelihood of a positive return), while a negative SHAP value indicates a negative contribution.

import shap

# Use SHAP for model interpretation
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_test)

# Plot the SHAP summary plot to visualize feature importance
shap.summary_plot(shap_values, X_test)        

Step 7: Interpreting the Results

The SHAP explanation reveals several key findings about which variables significantly influence the AdaBoost model’s predictions:

  1. Fear and Greed Index: The Fear and Greed Index has a significant effect on the model’s predictions. A shift from fear to greed (represented by red dots) often leads to a negative return, while a shift from greed to fear (represented by blue dots) results in a positive return. This suggests that market sentiment, as captured by this index, plays a crucial role in predicting price changes.
  2. Closing Price: The closing price is another critical factor in determining the next day’s return. When the closing price increases (represented by red color), the model predicts a negative return. Conversely, a decrease in the closing price (blue color) predicts a positive return. This highlights the importance of price action in forecasting future movements.
  3. Wikipedia Visits: The number of Wikipedia visits to the Bitcoin page is also a significant predictor. A decrease in visits (blue color) correlates with a negative return, reinforcing previous research that links public interest (as measured by online searches or page views) with Bitcoin price changes.
  4. Negative Tweets (VADER): Negative tweets, as identified by the VADER sentiment analysis tool, have an interesting relationship with Bitcoin returns. When the ratio of negative tweets increases, the model predicts a positive return for the next day, and when the ratio decreases, it predicts a negative return. This suggests that negative sentiment on social media can sometimes signal a rebound in Bitcoin's price.
  5. Candlestick Patterns: Candlestick patterns, such as the lower shadow (suggesting buying pressure) and the upper shadow (indicating selling pressure), also affect predictions. A strong lower shadow generally leads to a positive return, while an upper shadow suggests a negative return. This finding aligns with technical analysis principles that look for price patterns to predict future price movements.


Conclusion

In this section, we explored how machine learning models can be used to predict Bitcoin price movements. By training and evaluating several models, we found that AdaBoost was the best performer, achieving an accuracy of 62%. Using SHAP, we were able to explain the predictions made by the model, identifying key features like the Fear and Greed Index, closing price, Wikipedia visits, and VADER sentiment as important predictors. These findings align with existing research, emphasizing the role of market sentiment and public interest in forecasting Bitcoin prices.

If you're interested in diving deeper into the methods and replicating the analysis, all the codes and notebooks used in this project are available here.

Future work could involve optimizing the models further or integrating additional features, such as macroeconomic indicators or real

Yes i know a website which gives 99% accuracy, you can check previous signals for accuracy. I am intra day crypto trader and earn dissent money by the help of this tool [ https://vpriceprediction.com/ ] . This gives signals for price will go up or down and i uses this signals in my trading.

赞
回复
Hossein Hasanzadeh

Product Design Team Lead | Product Developer

1 个月

niceeee ?? share more

要查看或添加评论,请登录

Hirad Dolatzadeh的更多文章

社区洞察

其他会员也浏览了