课程: Hands-On Advanced Python: Data Engineering Basics

Loading and parsing JSON data

- [Instructor] Let's start off by making sure that we can successfully load the dataset in order to work on it. So to do this, we're going to have to load the JSON file and parse it into a Python object. And luckily, Python gives us some high-level functions for doing exactly that. So again, if we look at the data file, we can see that it is an array of objects. So it starts with a bracket and it ends with one all the way down at the end. And each element is contained within these curly braces. So first, let's open the file using Python's built-in method for that. So I'm going to open up in chapter one, the parsedata.py file, and I'm going to do this with a context manager and Python's open function, which is used to open files. So the context manager starts with the with keyword and I'm going to write open and then the path to the file and the path to the file, remember, it's up here in the root directory of the examples. So I've got to work my way back up there. So I've got to go up a directory and then up another directory. And then it's going to be sample-weather-history.json. And I'm going to open that in reading mode. So I'll specify r and then I'll write as weatherfile. So now I'll have a variable named weatherfile when I open this. All right, so once the file is open, we can then use the JSON package to load and parse the data. But to do that, I first have to import the JSON module. So I'm going to import json. And then once I've got the JSON module imported, I can use the load function to load the data file. So I'll create a variable named weatherdata and I'm going to call json.load and I'll pass in the weatherfile that I just opened up. All right, so once the data is loaded, we can get some basic information about it to make sure that we did everything correctly. So for example, I can print out the length of the weather data variable and that should tell me how many records I have and it should be a fairly large number because we've got, you know, five and a half years of data in here. And let's also print out the first item in the array that we just loaded. And to do that, I'm going to import the pretty printer. This is a module that makes it easy to print out information. So let's go ahead and pprint. I'm going to call pp because that's the function and I'll print out the very first item in the weatherdata array. Okay, so let's go ahead and run this. So first, I'm going to save it and then there's a couple of ways I can run this code. I can right click on the file name here and I can open it up in the integrated terminal. I'll get back to that in a second. There's an easier way to do this. If I right click right within the source file, down here, I can choose run Python. And this appears once you have the Python extensions installed, and I'll choose run Python in terminal. When I do that, you can see that the terminal comes up and let me give it some more room. There we go. So you can see that it has executed the Python file and sure enough, when we print out the length of the weatherdata array, we get 1,977 records. So almost 2,000 data points. And we also have the first item in the weatherdata. And sure enough, here that is. So we've loaded our JSON correctly and now we have an array of Python objects that we can work on. All right, let me close this. If you want to open up the terminal just directly, you can either right click here and choose open in integrated terminal, or you can open the terminal manually by typing Control + back tick. And when that happens, you get the terminal opened up. If you right click on a folder and choose open in integrated terminal, you'll get a terminal that's already open to that chapter in the terminal. So if I now get a directory listing, you can see that there's my code and I can run it manually just by typing python and then the name of the file I want to run, python parse. There we go, and you can see we get the same result. It's up to you. You don't have to do it this way. Throughout the course, I'll be right clicking and choosing run in terminal just because it's the easiest way to do it. All right, so now that we have our data successfully loaded, let's see if we can figure out how many days of weatherdata we have for each year. So I'm going to create an empty dictionary named years and then I'm going to loop over all of the data that I just loaded and categorize it by year. So for each day in the weatherdata, I'm going to create a key for my dictionary. And that key is going to be the date field. And I'm only going to get the first four characters of the date because again, if you go back and look at the data, you'll see that the first four characters are the year, right? And this is not already sorted by year. You can see that we're jumping around weeks. And if you look through the data, you'll see that it's not in any particular sorted order. So I'm going to get the key. And then if that key already exists in the year's dictionary, then I'm going to increment the count for that key. Otherwise, I'll just initialize that key's value to be one. And then when we're done, I'm going to pretty print the contents of that dictionary and I'll give it a width of five to make it format nicely. Let's go ahead and comment out the previous code so that it doesn't interfere with our output. All right, and now let's run this again. So I'll go ahead and run the Python file in the terminal and sure enough, we can see now that we have the output here of my dictionary. So we've got full datasets for the years 2017 up to 2021. So 365 data points and there's 366 data points in 2020 because that was a leap year and it looks, like we've got about half that data for the year 2022. All right, so at this point, it looks like we're loading and parsing the data just fine. So we can continue with the rest of the chapter.

内容