Building an LLM-Driven Stock Price Forecast (Prediction) with the S&P 500 Public Dataset

Building an LLM-Driven Stock Price Forecast (Prediction) with the S&P 500 Public Dataset


Making Money Move: A Practical Guide to Stock Price Prediction

Want to know where stock prices are heading? You're not alone. From Wall Street giants to independent researchers, everyone's trying to crack the code of market movements. I'll show you how to build your own stock prediction system using real S&P 500 data that updates daily.

This hands-on guide walks you through the entire process. We'll start with the basics - grabbing and cleaning up market data. Then we'll create a straightforward machine learning model to spot patterns and predict prices. I'll also share some interesting ways to use AI language models to factor in news and company reports that could affect stock performance.

By the end, you'll have a working system that can forecast stock prices up to two years out. While no prediction is perfect (if it were, we'd all be billionaires!), you'll learn practical techniques that many financial pros use to make more informed investment decisions.

Let's dive in and start building something useful.

Project and File Structure

We will create a single Python file containing all the code. Here is the suggested structure:


Requirements File (Optional)

If you prefer to manage dependencies with a requirements.txt file, you can create it with the following contents:

pip install -r requirements.txt

Main Script: tutorial_sp500_llm.py

Below is the complete code, contained in one single file:

#!/usr/bin/env python3

import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from prophet import Prophet

# Lista de símbolos-alvo, adicione os outros papéis aqui
TARGET_SYMBOLS = [
    "AAPL",  # Apple
    "MSFT",  # Microsoft
    "AMZN",  # Amazon
    "TSLA",  # Tesla
    "GOOG",  # Alphabet
    "META",  # Meta
]

def load_full_data():
    """
    Carrega o CSV com todos os dados do S&P 500.
    Retorna um DataFrame.
    Se n?o encontrar o arquivo, faz sys.exit().
    """
    csv_path = os.path.join("sp500_stocks", "sp500_stocks.csv")
    if not os.path.exists(csv_path):
        sys.exit(f"[ERRO] Arquivo CSV n?o encontrado em: {csv_path}")

    print(f"[INFO] Carregando dataset de: {csv_path}")
    df = pd.read_csv(csv_path)
    print(f"[INFO] Formato do DataFrame: {df.shape}")
    print(df.head(3))
    return df

def filter_by_symbol(df, symbol):
    """
    Retorna apenas as linhas para o símbolo desejado.
    """
    df_symbol = df[df['Symbol'] == symbol].copy()
    if df_symbol.empty:
        print(f"[AVISO] N?o há dados para {symbol}.")
    else:
        print(f"[INFO] {symbol}: {len(df_symbol)} linhas encontradas.")
    return df_symbol

def preprocess_data_prophet(df_symbol):
    """
    Prepara o DataFrame no formato exigido pelo Prophet:
      - 'ds' (datetime) no lugar de 'Date'
      - 'y' (valor) no lugar de 'Close'
    """
    # Se houver valores faltantes, fa?a forward-fill
    df_symbol.ffill(inplace=True)

    # Converte 'Date' para datetime
    if 'Date' in df_symbol.columns:
        df_symbol['Date'] = pd.to_datetime(df_symbol['Date'])

    # Ordena por data
    df_symbol.sort_values(by='Date', inplace=True)

    # Renomeia colunas para Prophet
    # Precisamos apenas de 2 colunas: ds e y
    # ds = data, y = valor a prever
    df_prophet = df_symbol[['Date', 'Close']].rename(
        columns={'Date': 'ds', 'Close': 'y'}
    )

    return df_prophet

def prophet_forecast(df_prophet, periods=365*2):
    """
    Treina o Prophet e gera previs?o para 'periods' dias (por padr?o 2 anos).
    Retorna:
      - forecast: DataFrame com previs?es
      - model: objeto Prophet treinado
    """
    # Inicializa Prophet
    model = Prophet(daily_seasonality=True, yearly_seasonality=True)
    model.fit(df_prophet)

    # Cria um DataFrame de datas futuras
    future = model.make_future_dataframe(periods=periods, freq='D')
    forecast = model.predict(future)

    return forecast, model

def main():
    # 1. Carrega dataset
    df = load_full_data()

    # 2. Loop pelos símbolos
    for symbol in TARGET_SYMBOLS:
        print(f"\n[INFO] Processando símbolo: {symbol}")

        df_symbol = filter_by_symbol(df, symbol)
        if df_symbol.empty:
            continue

        # 3. Prepara dados no formato Prophet
        df_prophet = preprocess_data_prophet(df_symbol)

        # Verifica se temos dados suficientes
        if len(df_prophet) < 50:
            print(f"[AVISO] {symbol}: poucos dados após preprocessamento.")
            continue

        # 4. Ajusta o modelo Prophet e faz forecast para 2 anos
        forecast, model = prophet_forecast(df_prophet, periods=365*2)

        print(f"[INFO] Exibindo parte do forecast de {symbol}:")
        print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail(5))

        # 5. Plot principal
        fig1 = model.plot(forecast, xlabel='Data', ylabel='Pre?o (Close)')
        fig1.suptitle(f"Previs?o de 2 anos para {symbol}", fontsize=14)
        out_file1 = f"{symbol}_forecast.png"
        fig1.savefig(out_file1)
        print(f"[INFO] Gráfico de previs?o salvo em: {out_file1}")
        plt.close(fig1)

        # 6. Plot componentes (tendência e sazonalidades)
        fig2 = model.plot_components(forecast)
        fig2.suptitle(f"Componentes de previs?o para {symbol}", fontsize=14)
        out_file2 = f"{symbol}_forecast_components.png"
        fig2.savefig(out_file2)
        print(f"[INFO] Gráfico de componentes salvo em: {out_file2}")
        plt.close(fig2)

if __name__ == "__main__":
    main()
        

  1. Open Terminal and navigate to the project folder.
  2. (Optional) Create and activate a virtual environment:

python3 -m venv env
source env/bin/activate        

3. Install dependencies (via requirements.txt or individually):

pip install -r requirements.txt        

4. Ensure that S&P 500 Stocks file are in the current directory:

https://www.kaggle.com/datasets/andrewmvd/sp-500-stocks

5. Execute the Python script:


6. Wait for the script to finish. It will:

  • Predict stock prices for 2 years into the future
  • Load and preprocess the data:

  • Train a baseline ML model:

  • Demonstrate a placeholder LLM integration:

  • Generate a plot showing historical data vs. forecasted values for all indexes/papers/symbo, on script:

Generated plots:

Example output for AAPL (Apple):


Here are some components from Kaggle zip file datasets that helps to understand the components and time frame data series like week, month, day, etc:

MSFT (Microsoft):

AMZN (Amazon)

Data has emerged as the new gold

An invaluable resource for organizations that aspire to make high-impact, evidence-based decisions. From financial institutions using historical stock prices to forecast market behavior, to retailers analyzing consumer sentiment on social media to refine product offerings, access to public information empowers businesses to create predictive models that drive strategic outcomes. Healthcare providers, for instance, can combine publicly available patient statistics with broader demographic data to anticipate demand for certain medical services. Similarly, logistics and supply chain managers can incorporate open-source weather and traffic data to streamline route planning and reduce delivery times.

As public datasets continue to grow in both volume and variety, executives who invest in data-driven methodologies will discover a significant competitive edge. By integrating sophisticated analytics with publicly sourced information, organizations can not only gain deeper insights into emerging trends but also proactively adapt their business strategies. In short, the effective utilization of data—particularly publicly available data—has become a critical differentiator for companies seeking to optimize performance, innovate new offerings, and maintain long-term market leadership.

要查看或添加评论,请登录

Juliano Souza的更多文章

社区洞察

其他会员也浏览了