登录查看更多内容

Building an LLM-Driven Stock Price Forecast (Prediction) with the S&P 500 Public Dataset

Juliano Souza

Director of Information Technology | Technology Mentor for Startups in the EMEA Region.

发布日期: 2024年12月22日

Making Money Move: A Practical Guide to Stock Price Prediction

Want to know where stock prices are heading? You're not alone. From Wall Street giants to independent researchers, everyone's trying to crack the code of market movements. I'll show you how to build your own stock prediction system using real S&P 500 data that updates daily.

This hands-on guide walks you through the entire process. We'll start with the basics - grabbing and cleaning up market data. Then we'll create a straightforward machine learning model to spot patterns and predict prices. I'll also share some interesting ways to use AI language models to factor in news and company reports that could affect stock performance.

By the end, you'll have a working system that can forecast stock prices up to two years out. While no prediction is perfect (if it were, we'd all be billionaires!), you'll learn practical techniques that many financial pros use to make more informed investment decisions.

Let's dive in and start building something useful.

Project and File Structure

We will create a single Python file containing all the code. Here is the suggested structure:

Requirements File (Optional)

If you prefer to manage dependencies with a requirements.txt file, you can create it with the following contents:

pip install -r requirements.txt

Main Script: tutorial_sp500_llm.py

Below is the complete code, contained in one single file:

#!/usr/bin/env python3

import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from prophet import Prophet

# Lista de símbolos-alvo, adicione os outros papéis aqui
TARGET_SYMBOLS = [
    "AAPL",  # Apple
    "MSFT",  # Microsoft
    "AMZN",  # Amazon
    "TSLA",  # Tesla
    "GOOG",  # Alphabet
    "META",  # Meta
]

def load_full_data():
    """
    Carrega o CSV com todos os dados do S&P 500.
    Retorna um DataFrame.
    Se n?o encontrar o arquivo, faz sys.exit().
    """
    csv_path = os.path.join("sp500_stocks", "sp500_stocks.csv")
    if not os.path.exists(csv_path):
        sys.exit(f"[ERRO] Arquivo CSV n?o encontrado em: {csv_path}")

    print(f"[INFO] Carregando dataset de: {csv_path}")
    df = pd.read_csv(csv_path)
    print(f"[INFO] Formato do DataFrame: {df.shape}")
    print(df.head(3))
    return df

def filter_by_symbol(df, symbol):
    """
    Retorna apenas as linhas para o símbolo desejado.
    """
    df_symbol = df[df['Symbol'] == symbol].copy()
    if df_symbol.empty:
        print(f"[AVISO] N?o há dados para {symbol}.")
    else:
        print(f"[INFO] {symbol}: {len(df_symbol)} linhas encontradas.")
    return df_symbol

def preprocess_data_prophet(df_symbol):
    """
    Prepara o DataFrame no formato exigido pelo Prophet:
      - 'ds' (datetime) no lugar de 'Date'
      - 'y' (valor) no lugar de 'Close'
    """
    # Se houver valores faltantes, fa?a forward-fill
    df_symbol.ffill(inplace=True)

    # Converte 'Date' para datetime
    if 'Date' in df_symbol.columns:
        df_symbol['Date'] = pd.to_datetime(df_symbol['Date'])

    # Ordena por data
    df_symbol.sort_values(by='Date', inplace=True)

    # Renomeia colunas para Prophet
    # Precisamos apenas de 2 colunas: ds e y
    # ds = data, y = valor a prever
    df_prophet = df_symbol[['Date', 'Close']].rename(
        columns={'Date': 'ds', 'Close': 'y'}
    )

    return df_prophet

def prophet_forecast(df_prophet, periods=365*2):
    """
    Treina o Prophet e gera previs?o para 'periods' dias (por padr?o 2 anos).
    Retorna:
      - forecast: DataFrame com previs?es
      - model: objeto Prophet treinado
    """
    # Inicializa Prophet
    model = Prophet(daily_seasonality=True, yearly_seasonality=True)
    model.fit(df_prophet)

    # Cria um DataFrame de datas futuras
    future = model.make_future_dataframe(periods=periods, freq='D')
    forecast = model.predict(future)

    return forecast, model

def main():
    # 1. Carrega dataset
    df = load_full_data()

    # 2. Loop pelos símbolos
    for symbol in TARGET_SYMBOLS:
        print(f"\n[INFO] Processando símbolo: {symbol}")

        df_symbol = filter_by_symbol(df, symbol)
        if df_symbol.empty:
            continue

        # 3. Prepara dados no formato Prophet
        df_prophet = preprocess_data_prophet(df_symbol)

        # Verifica se temos dados suficientes
        if len(df_prophet) < 50:
            print(f"[AVISO] {symbol}: poucos dados após preprocessamento.")
            continue

        # 4. Ajusta o modelo Prophet e faz forecast para 2 anos
        forecast, model = prophet_forecast(df_prophet, periods=365*2)

        print(f"[INFO] Exibindo parte do forecast de {symbol}:")
        print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail(5))

        # 5. Plot principal
        fig1 = model.plot(forecast, xlabel='Data', ylabel='Pre?o (Close)')
        fig1.suptitle(f"Previs?o de 2 anos para {symbol}", fontsize=14)
        out_file1 = f"{symbol}_forecast.png"
        fig1.savefig(out_file1)
        print(f"[INFO] Gráfico de previs?o salvo em: {out_file1}")
        plt.close(fig1)

        # 6. Plot componentes (tendência e sazonalidades)
        fig2 = model.plot_components(forecast)
        fig2.suptitle(f"Componentes de previs?o para {symbol}", fontsize=14)
        out_file2 = f"{symbol}_forecast_components.png"
        fig2.savefig(out_file2)
        print(f"[INFO] Gráfico de componentes salvo em: {out_file2}")
        plt.close(fig2)

if __name__ == "__main__":
    main()

Open Terminal and navigate to the project folder.
(Optional) Create and activate a virtual environment:

python3 -m venv env
source env/bin/activate

3. Install dependencies (via requirements.txt or individually):

pip install -r requirements.txt

4. Ensure that S&P 500 Stocks file are in the current directory:

https://www.kaggle.com/datasets/andrewmvd/sp-500-stocks

5. Execute the Python script:

领英推荐

What Does Big O(N^2) Complexity Mean?

Tpoint Tech 6 个月前

Understanding Big O Notation, Time and Space Complexity

Softray Solutions 1 年前

?? Massive Progress in Reasoning Models

Pascal Biese 3 周前

6. Wait for the script to finish. It will:

Predict stock prices for 2 years into the future
Load and preprocess the data:

Train a baseline ML model:

Demonstrate a placeholder LLM integration:

Generate a plot showing historical data vs. forecasted values for all indexes/papers/symbo, on script:

Generated plots:

Example output for AAPL (Apple):

Here are some components from Kaggle zip file datasets that helps to understand the components and time frame data series like week, month, day, etc:

MSFT (Microsoft):

AMZN (Amazon)

Data has emerged as the new gold

An invaluable resource for organizations that aspire to make high-impact, evidence-based decisions. From financial institutions using historical stock prices to forecast market behavior, to retailers analyzing consumer sentiment on social media to refine product offerings, access to public information empowers businesses to create predictive models that drive strategic outcomes. Healthcare providers, for instance, can combine publicly available patient statistics with broader demographic data to anticipate demand for certain medical services. Similarly, logistics and supply chain managers can incorporate open-source weather and traffic data to streamline route planning and reduce delivery times.

As public datasets continue to grow in both volume and variety, executives who invest in data-driven methodologies will discover a significant competitive edge. By integrating sophisticated analytics with publicly sourced information, organizations can not only gain deeper insights into emerging trends but also proactively adapt their business strategies. In short, the effective utilization of data—particularly publicly available data—has become a critical differentiator for companies seeking to optimize performance, innovate new offerings, and maintain long-term market leadership.

要查看或添加评论，请登录

Juliano Souza的更多文章

Como Criar um Analisador de Código com DeepSeek-Coder, Ollama e LangChain

2025年2月2日

Como Criar um Analisador de Código com DeepSeek-Coder, Ollama e LangChain

Passo 1: Instale o Ollama e Baixe o ModeloPrimeiro, baixe o Ollama (ollama.ai) e execute no terminal: Passo 2:…

1 条评论
Building Robust and Distributed Applications Using the 12-Factor Principles

2025年1月24日

Building Robust and Distributed Applications Using the 12-Factor Principles

If you really want take advantage of cloud computing and microservices, designing applications that are scalable…
Wine Authentication: A Blockchain Solution using smart contracts

2025年1月17日

Wine Authentication: A Blockchain Solution using smart contracts

The Wine Industry's Trust Problem The wine industry faces a critical challenge: counterfeiting. Premium wines are…
Building a Real-Time Data Pipeline with Apache Kafka, ClickHouseDB, and AWS S3 for Data Integration and Normalization

2024年11月4日

Building a Real-Time Data Pipeline with Apache Kafka, ClickHouseDB, and AWS S3 for Data Integration and Normalization

For businesses dealing with complex data from various databases, implementing an effective real-time data pipeline is…
Como Criar Aplicativos RAG para Tratamento de Documentos

2024年10月23日

Como Criar Aplicativos RAG para Tratamento de Documentos

No mundo corporativo atual, as empresas lidam com volumes gigantescos de dados diariamente. Documentos como contratos…
Streamlit: Simplifying the Creation of Web Apps for AI and Data

2024年10月15日

Streamlit: Simplifying the Creation of Web Apps for AI and Data

What is Streamlit? Streamlit is an open-source Python framework that makes it easy to create interactive, dynamic web…
Blockchain and DeFi: From Scratch to Local Testing with Ganache and Truffle Suite

2024年10月10日

Blockchain and DeFi: From Scratch to Local Testing with Ganache and Truffle Suite

The Disruptive Power of DeFi Decentralized Finance (DeFi) is transforming the financial landscape by providing open…
Comparing Rust, C++, Python, Java, Go, and TypeScript/Node.js for Low-Latency, HFT, and Trading Applications

2024年10月1日

Comparing Rust, C++, Python, Java, Go, and TypeScript/Node.js for Low-Latency, HFT, and Trading Applications

When developing high-performance, low-latency systems such as High-Frequency Trading (HFT) platforms or other financial…

3 条评论
Trading Systems 101

2024年9月25日

Trading Systems 101

Understanding Trading Systems: From Buy/Sell Orders to Market Data Publication In today’s financial markets, the…
Cursor: O Futuro da Programa??o Assistida por IA

2024年9月18日

Cursor: O Futuro da Programa??o Assistida por IA

Introdu??o Com o avan?o da inteligência artificial, ferramentas inovadoras têm surgido para revolucionar a maneira como…

See all articles

Building an LLM-Driven Stock Price Forecast (Prediction) with the S&P 500 Public Dataset

Juliano Souza

Director of Information Technology | Technology Mentor for Startups in the EMEA Region.

Project and File Structure

Requirements File (Optional)

Main Script: tutorial_sp500_llm.py

领英推荐

Data has emerged as the new gold

Juliano Souza的更多文章

社区洞察

其他会员也浏览了

All Hands on Data #95

AIM Weekly for 07 Oct?2024

Data Science #23

#166 Flexing Bricks as Open Weights: The DBRX by Databricks

Using AI and ML for FP&A Forecasts

Ordered sets in GO with treaps

Beginner’s Guide to Algorithms

Stuck in “Dev Hell”? Trust Math – Not Data – and your Senses

Story Telling of Understanding MLOPS Real Time Practical way.

COVID-19 Public Dataset Program: Unleash the Dragon

Project and File Structure

Requirements File (Optional)

Main Script: tutorial_sp500_llm.py

领英推荐

Data has emerged as the new gold

Juliano Souza的更多文章

Como Criar um Analisador de Código com DeepSeek-Coder, Ollama e LangChain

Building Robust and Distributed Applications Using the 12-Factor Principles

Wine Authentication: A Blockchain Solution using smart contracts

Building a Real-Time Data Pipeline with Apache Kafka, ClickHouseDB, and AWS S3 for Data Integration and Normalization

Como Criar Aplicativos RAG para Tratamento de Documentos

Streamlit: Simplifying the Creation of Web Apps for AI and Data

Blockchain and DeFi: From Scratch to Local Testing with Ganache and Truffle Suite

Comparing Rust, C++, Python, Java, Go, and TypeScript/Node.js for Low-Latency, HFT, and Trading Applications

Trading Systems 101

Cursor: O Futuro da Programa??o Assistida por IA

社区洞察

其他会员也浏览了

All Hands on Data #95

AIM Weekly for 07 Oct?2024

Data Science #23

#166 Flexing Bricks as Open Weights: The DBRX by Databricks

Using AI and ML for FP&A Forecasts

Ordered sets in GO with treaps

Beginner’s Guide to Algorithms

Stuck in “Dev Hell”? Trust Math – Not Data – and your Senses

Story Telling of Understanding MLOPS Real Time Practical way.

COVID-19 Public Dataset Program: Unleash the Dragon