Extracting and Transforming Tick Data from B3

Extracting and Transforming Tick Data from B3

Introduction

The Brazilian stock exchange, known as B3 (Brasil, Bolsa, Balc?o), is one of the largest and most sophisticated exchanges in the world. It provides a rich source of financial data that can be leveraged for various analytical purposes. Among the types of data available, tick data is particularly valuable for high-frequency trading, algorithmic strategies, and quantitative analysis. This article will guide you through the process of extracting and transforming tick data from B3, ensuring you can effectively utilize this information for your financial models.

Understanding Tick Data

Tick data records every transaction that occurs in the market, including details such as the price, volume, and timestamp of each trade. This granular level of data is essential for:

  • Analyzing market microstructure
  • Developing and backtesting trading algorithms
  • Conducting detailed market research

Steps to Extract Tick Data from B3

# Import necessary libraries
import os
import wget
import zipfile
import pandas as pd

# Define the URL to download the tick data from B3, where 'date' is a variable holding the desired date
url = 'https://arquivos.b3.com.br/apinegocios/tickercsv/' + date

# Define the file name for the downloaded data, which is a ZIP file
file_name = str(date + '_B3_TickData.zip')

# Download the ZIP file from the URL and save it with the defined file name
wget.download(url, file_name)

# Get the directory where the ZIP file is saved
zip_dir = os.path.dirname(file_name)

# Create a ZipFile object to work with the ZIP file
with zipfile.ZipFile(file_name, 'r') as zip_ref:
    # Extract all the contents of the ZIP file into the directory
    zip_ref.extractall(zip_dir)

# Print a confirmation message after extraction
print('/n')
print("1/8 - Extracted all contents of ", file_name)

# Get the current working directory to locate the extracted files
folder_ref = os.getcwd()

# List all files in the current working directory
files = os.listdir(folder_ref)

# Filter the list to include only the text files with '_NEGOCIOSAVISTA.txt' in their name
files_txt = [i for i in files if i.endswith('_NEGOCIOSAVISTA.txt')]

# Read the first text file in the filtered list into a pandas DataFrame
df = pd.read_csv(files_txt[0], sep=";")        


Explanation

Import Libraries: The script begins by importing the necessary libraries, such as os for directory operations, wget for downloading files, zipfile for handling ZIP files, and pandas for data manipulation.

Define URL and File Name: The url variable is constructed using a base URL and the date variable, which represents the date for which tick data is being downloaded. The file_name variable is the name under which the downloaded ZIP file will be saved.

Download ZIP File: The wget.download() function is used to download the ZIP file from the specified URL and save it locally with the defined file name.

Extract ZIP File:

  • The directory of the ZIP file is obtained using os.path.dirname().
  • A ZipFile object is created to open and manipulate the ZIP file.
  • The extractall() method is used to extract all contents of the ZIP file into the same directory where the ZIP file is located.
  • A confirmation message is printed to indicate successful extraction.

Locate Extracted Files:

  • The current working directory is obtained using os.getcwd().
  • A list of all files in the current directory is created using os.listdir().
  • This list is filtered to include only the text files ending with _NEGOCIOSAVISTA.txt.

Read Data into DataFrame: The first file in the filtered list is read into a pandas DataFrame using pd.read_csv(), with the semicolon (;) as the separator.

This code effectively downloads, extracts, and reads tick data from B3, which can be essential for your quantitative finance analysis and trading strategies. This step-by-step process ensures you can handle the data efficiently, allowing you to focus on building and testing your financial models.

Steps to Transform Tick Data from B3


# Import necessary libraries
import pandas as pd

# Update 'PrecoNegocio' column to replace commas with dots and convert to float
df['PrecoNegocio'] = df.PrecoNegocio.str.replace(",", ".").astype('float')
print('2/8 - PrecoNegocio Updated')

# Fill missing values in 'CodigoParticipanteComprador' and 'CodigoParticipanteVendedor' with 0
# Convert the columns to integer type and then to string type
df[['CodigoParticipanteComprador', 'CodigoParticipanteVendedor']] = df[['CodigoParticipanteComprador', 'CodigoParticipanteVendedor']].fillna(0)
df[['CodigoParticipanteComprador', 'CodigoParticipanteVendedor']] = df[['CodigoParticipanteComprador', 'CodigoParticipanteVendedor']].astype('int').astype('str')
print('3/8 - Codigos Participantes Updated')

# Update 'HoraFechamento' to ensure it is a string and pad with leading zeros to make it 9 characters long
df['HoraFechamento'] = df['HoraFechamento'].astype(str).str.zfill(9)
# Reformat 'HoraFechamento' to the format HH:MM:SS.sss
df['HoraFechamento'] = df['HoraFechamento'].apply(lambda x: f"{x[:2]}:{x[2:4]}:{x[4:6]}.{x[6:9]}")
# Ensure 'HoraFechamento' is a string and convert to datetime.time type
df['HoraFechamento'] = df['HoraFechamento'].astype(str)
df['HoraFechamento'] = pd.to_datetime(df['HoraFechamento'], format='%H:%M:%S.%f').dt.time
print('4/8 - HoraFechamento Updated')

# Create a new index by concatenating 'CodigoInstrumento', 'CodigoIdentificadorNegocio', 'DataReferencia', and 'HoraFechamento'
str1 = df.CodigoInstrumento
str2 = df.CodigoIdentificadorNegocio.astype(str)
str3 = df.DataReferencia.astype(str)
str4 = df.HoraFechamento.astype(str)
newindex = str1 + '_' + str2 + '_' + str3 + '_' + str4
df['Index'] = newindex
# Set the new 'Index' column as the index of the DataFrame
df = df.set_index('Index')
print('5/8 - New_Index Created')

# Remove the specified columns from the DataFrame
df.drop(columns=['AcaoAtualizacao', 'TipoSessaoPregao', 'DataNegocio'], inplace=True)
print('6/8 - Columns Remove Updated')

# Rename the columns using a dictionary to map old names to new names
dicionario = {
    'DataReferencia': 'Dia',
    'CodigoInstrumento': 'Instrumento',
    'PrecoNegocio': 'Preco',
    'QuantidadeNegociada': 'Quantidade',
    'HoraFechamento': 'Hora',
    'CodigoIdentificadorNegocio': 'Cod_Negocio',
    'CodigoParticipanteComprador': 'Comprador',
    'CodigoParticipanteVendedor': 'Vendedor'
}
df.rename(dicionario, axis=1, inplace=True)
print('7/8 - Columns Rename Updated')

# Reorder the columns in the specified new order
new_order = ['Cod_Negocio', 'Instrumento', 'Dia', 'Hora', 'Preco', 'Quantidade', 'Comprador', 'Vendedor']
df = df[new_order]
print('8/8 - Columns New Order Updated')

# Print completion message
print('Data Extraction and Transformation - Done')        

Explanation

Updating ‘PrecoNegocio’ Column:

  • Replaces commas with dots in the PrecoNegocio column to conform to the float format.
  • Converts the column to float type.

Updating Participant Codes:

  • Fills missing values in CodigoParticipanteComprador and CodigoParticipanteVendedor columns with 0.
  • Converts these columns to integer type and then to string type.

Updating ‘HoraFechamento’ Column:

  • Ensures the HoraFechamento column is a string and pads it with leading zeros to make it 9 characters long.
  • Reformats the column to the format HH:MM.sss and converts it to datetime.time type.

Creating a New Index:

  • Concatenates CodigoInstrumento, CodigoIdentificadorNegocio, DataReferencia, and HoraFechamento columns to create a new index.
  • Sets this new index as the index of the DataFrame.

Removing Unnecessary Columns:

  • Drops the AcaoAtualizacao, TipoSessaoPregao, and DataNegocio columns from the DataFrame.

Renaming Columns:

  • Renames the columns using a dictionary that maps old column names to new, more meaningful names.

Reordering Columns:

  • Reorders the columns in the specified order to ensure a logical and consistent structure.

Completion Message:

  • Prints a completion message to indicate that the data extraction and transformation process is done.

This transformation process ensures the tick data from B3 is clean, well-structured, and ready for analysis or further processing.








要查看或添加评论,请登录

Eurico Paes的更多文章

社区洞察

其他会员也浏览了