Basic Data Preparation with Python (Stock Price Data)
Python stock price analysis 03: Learn typical data preparation process with stock prices data
Python Stock Price Data Analysis Series
Introduction
In my previous?article, I talked about extracting a list of stock prices online using Python. At the end of the tutorial, I stored the data in multiple JSON files.
As a data scientist or data engineer, you might be asked to convert hundreds if not thousands of files into 1 table before pushing them into the database. These files can be CSV, JSON, .xlsb, etc.
In this tutorial, we’re going to combine these JSON files that we get in the previous tutorials into one file.
Data & Resources
Data:?GDrive Link
Medium:?medium.com/@foocheechuan
Youtube:?Chee-Chuan
If you didn’t follow the tutorial in my previous article, download the data above to follow along.
Read the files
I used a relative path to navigate to the folder used in my previous article. In your case, you can just use path = ‘your-folder-path’.
Then we use os.listdir to list out all the files in that folder.
import os
path = os.path.abspath(f"../04 Extract Stock Prices in Bulk/JSON")
json_folder = os.listdir(path)
json_folder
Since there are multiple files in the folder, we need to loop through the files and read the file one-by-one.
import json
for files in json_folder:
# Reading from json file
with open(f'{path}\\{files}', 'r') as openfile:
json_object = json.load(openfile)
It’s nothing if we just keep replacing the data every time we read a file. So let’s convert json_object into a pandas data frame before continuing to the next element.
Convert into a Pandas Data?Frame
Convert json_object into a pandas data frame in each loop. I’ll copy the code I used in my previous article about extracting stock prices online.
import pandas as pd
import json
for files in json_folder:
# Reading from json file
with open(f'{path}\\{files}', 'r') as openfile:
json_object = json.load(openfile)
data = json_object['Time Series (Daily)']
# Convert into data frame
df = pd.DataFrame.from_dict({(i): data[i]
for i in data.keys()},
orient='index')
df_date = df.rename_axis('Date').reset_index()
This will help convert each of our JSON files into the following format.
Label with its respective ticker?symbol
There is one problem with the data frame. When we loop through all stock data, we wouldn’t know which stock the data belong to after we append them into one table. With that said, we need to create a column in the data frame to label its respective stock ticker symbol.
In python, a string variable can be treated as an iterable array so we can use a substring method to retrieve the first 4 letters of the file names
# Extract first 4 letters of the file names
for files in json_folder:
print(files[:4])
领英推荐
Add the code into the loop.
import pandas as pd
import json
for files in json_folder:
# Reading from json file
with open(f'{path}\\{files}', 'r') as openfile:
json_object = json.load(openfile)
data = json_object['Time Series (Daily)']
# Convert into data frame
df = pd.DataFrame.from_dict({(i): data[i]
for i in data.keys()},
orient='index')
df_date = df.rename_axis('Date').reset_index()
# Add a column for the stock's ticker
df_date['Ticker'] = files[:4]
Now we can append the data tables into one table.
Append multiple?tables
We need to first create an empty data frame called df_all before we can append all stock price tables into that empty table. Note that this declaration is done before the for loop.
With that, we will replace df and df_date in each of the loops but df_all will not be replaced and will store all tables in the previous loop.
import pandas as pd
import json
df_all = pd.DataFrame()
for files in json_folder:
# Reading from json file
with open(f'{path}\\{files}', 'r') as openfile:
json_object = json.load(openfile)
data = json_object['Time Series (Daily)']
# Convert into data frame
df = pd.DataFrame.from_dict({(i): data[i]
for i in data.keys()},
orient='index')
df_date = df.rename_axis('Date').reset_index()
# Add a column for the stock's ticker
df_date['Ticker'] = files[:4]
df_all = pd.concat([df_all,df_date])
else:
display(df_all)
Check that df_all has all the stocks we required.
df_all['Ticker'].unique()
Reset index
Note that there are 30653 rows of data but the last index is 3142. This happened because the index of each table wasn’t reset when we append the tables together and the index remained as its index in the original table. So we need to reset the index in the combined table.
df_out = df_all.reset_index(drop=True)
Output Data
Output data to the same folder (created a folder called Output to store the final data)
output_path = os.path.abspath(f".\\Output")
df_out.to_csv(f'{output_path}\\FAANG-MT Combined Data.csv',index=False)
Video Tutorial
Conclusion
That’s all for today’s article. Thank you for reading today’s article. See you again in my next article.
About Me
Currently working as a Data Scientist. I provide consultancy, training, and professional services for data analytics problems to my clients worldwide. Would love to share my experience as a consultant so that everyone can learn something from it.
Medium: medium.com/@foocheechuan
Youtube: Chee-Chuan