Building an LLM-Driven Stock Price Forecast (Prediction) with the S&P 500 Public Dataset
Juliano Souza
Director of Information Technology | Technology Mentor for Startups in the EMEA Region.
Making Money Move: A Practical Guide to Stock Price Prediction
Want to know where stock prices are heading? You're not alone. From Wall Street giants to independent researchers, everyone's trying to crack the code of market movements. I'll show you how to build your own stock prediction system using real S&P 500 data that updates daily.
This hands-on guide walks you through the entire process. We'll start with the basics - grabbing and cleaning up market data. Then we'll create a straightforward machine learning model to spot patterns and predict prices. I'll also share some interesting ways to use AI language models to factor in news and company reports that could affect stock performance.
By the end, you'll have a working system that can forecast stock prices up to two years out. While no prediction is perfect (if it were, we'd all be billionaires!), you'll learn practical techniques that many financial pros use to make more informed investment decisions.
Let's dive in and start building something useful.
Project and File Structure
We will create a single Python file containing all the code. Here is the suggested structure:
Requirements File (Optional)
If you prefer to manage dependencies with a requirements.txt file, you can create it with the following contents:
pip install -r requirements.txt
Main Script: tutorial_sp500_llm.py
Below is the complete code, contained in one single file:
#!/usr/bin/env python3
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from prophet import Prophet
# Lista de símbolos-alvo, adicione os outros papéis aqui
TARGET_SYMBOLS = [
"AAPL", # Apple
"MSFT", # Microsoft
"AMZN", # Amazon
"TSLA", # Tesla
"GOOG", # Alphabet
"META", # Meta
]
def load_full_data():
"""
Carrega o CSV com todos os dados do S&P 500.
Retorna um DataFrame.
Se n?o encontrar o arquivo, faz sys.exit().
"""
csv_path = os.path.join("sp500_stocks", "sp500_stocks.csv")
if not os.path.exists(csv_path):
sys.exit(f"[ERRO] Arquivo CSV n?o encontrado em: {csv_path}")
print(f"[INFO] Carregando dataset de: {csv_path}")
df = pd.read_csv(csv_path)
print(f"[INFO] Formato do DataFrame: {df.shape}")
print(df.head(3))
return df
def filter_by_symbol(df, symbol):
"""
Retorna apenas as linhas para o símbolo desejado.
"""
df_symbol = df[df['Symbol'] == symbol].copy()
if df_symbol.empty:
print(f"[AVISO] N?o há dados para {symbol}.")
else:
print(f"[INFO] {symbol}: {len(df_symbol)} linhas encontradas.")
return df_symbol
def preprocess_data_prophet(df_symbol):
"""
Prepara o DataFrame no formato exigido pelo Prophet:
- 'ds' (datetime) no lugar de 'Date'
- 'y' (valor) no lugar de 'Close'
"""
# Se houver valores faltantes, fa?a forward-fill
df_symbol.ffill(inplace=True)
# Converte 'Date' para datetime
if 'Date' in df_symbol.columns:
df_symbol['Date'] = pd.to_datetime(df_symbol['Date'])
# Ordena por data
df_symbol.sort_values(by='Date', inplace=True)
# Renomeia colunas para Prophet
# Precisamos apenas de 2 colunas: ds e y
# ds = data, y = valor a prever
df_prophet = df_symbol[['Date', 'Close']].rename(
columns={'Date': 'ds', 'Close': 'y'}
)
return df_prophet
def prophet_forecast(df_prophet, periods=365*2):
"""
Treina o Prophet e gera previs?o para 'periods' dias (por padr?o 2 anos).
Retorna:
- forecast: DataFrame com previs?es
- model: objeto Prophet treinado
"""
# Inicializa Prophet
model = Prophet(daily_seasonality=True, yearly_seasonality=True)
model.fit(df_prophet)
# Cria um DataFrame de datas futuras
future = model.make_future_dataframe(periods=periods, freq='D')
forecast = model.predict(future)
return forecast, model
def main():
# 1. Carrega dataset
df = load_full_data()
# 2. Loop pelos símbolos
for symbol in TARGET_SYMBOLS:
print(f"\n[INFO] Processando símbolo: {symbol}")
df_symbol = filter_by_symbol(df, symbol)
if df_symbol.empty:
continue
# 3. Prepara dados no formato Prophet
df_prophet = preprocess_data_prophet(df_symbol)
# Verifica se temos dados suficientes
if len(df_prophet) < 50:
print(f"[AVISO] {symbol}: poucos dados após preprocessamento.")
continue
# 4. Ajusta o modelo Prophet e faz forecast para 2 anos
forecast, model = prophet_forecast(df_prophet, periods=365*2)
print(f"[INFO] Exibindo parte do forecast de {symbol}:")
print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail(5))
# 5. Plot principal
fig1 = model.plot(forecast, xlabel='Data', ylabel='Pre?o (Close)')
fig1.suptitle(f"Previs?o de 2 anos para {symbol}", fontsize=14)
out_file1 = f"{symbol}_forecast.png"
fig1.savefig(out_file1)
print(f"[INFO] Gráfico de previs?o salvo em: {out_file1}")
plt.close(fig1)
# 6. Plot componentes (tendência e sazonalidades)
fig2 = model.plot_components(forecast)
fig2.suptitle(f"Componentes de previs?o para {symbol}", fontsize=14)
out_file2 = f"{symbol}_forecast_components.png"
fig2.savefig(out_file2)
print(f"[INFO] Gráfico de componentes salvo em: {out_file2}")
plt.close(fig2)
if __name__ == "__main__":
main()
python3 -m venv env
source env/bin/activate
3. Install dependencies (via requirements.txt or individually):
pip install -r requirements.txt
4. Ensure that S&P 500 Stocks file are in the current directory:
5. Execute the Python script:
领英推荐
6. Wait for the script to finish. It will:
Generated plots:
Example output for AAPL (Apple):
Here are some components from Kaggle zip file datasets that helps to understand the components and time frame data series like week, month, day, etc:
MSFT (Microsoft):
AMZN (Amazon)
Data has emerged as the new gold
An invaluable resource for organizations that aspire to make high-impact, evidence-based decisions. From financial institutions using historical stock prices to forecast market behavior, to retailers analyzing consumer sentiment on social media to refine product offerings, access to public information empowers businesses to create predictive models that drive strategic outcomes. Healthcare providers, for instance, can combine publicly available patient statistics with broader demographic data to anticipate demand for certain medical services. Similarly, logistics and supply chain managers can incorporate open-source weather and traffic data to streamline route planning and reduce delivery times.
As public datasets continue to grow in both volume and variety, executives who invest in data-driven methodologies will discover a significant competitive edge. By integrating sophisticated analytics with publicly sourced information, organizations can not only gain deeper insights into emerging trends but also proactively adapt their business strategies. In short, the effective utilization of data—particularly publicly available data—has become a critical differentiator for companies seeking to optimize performance, innovate new offerings, and maintain long-term market leadership.