课程: Google Cloud Professional Machine Learning Engineer Cert Prep

BigQuery data pipelines with Colab

Let's go through and build a very realistic data cleansing and visualization pipeline using Google BigQuery. This happens quite a bit. You have to go through multiple stages of looking at your data, cleaning your data, and visualizing your data. So let's go ahead and take a look at this page views here. You can see for Wikipedia, it's got datehour, wiki, title, views, and if I wanted to query it, I could just select, you know, "Query in new tab" and it'll give me the beginning of a query. Now, in my scenario here, what I want to do is actually look at some of the top views on Wikipedia. And to start with, I can actually go through here and just throw in a query I've got. So we have SELECT views, title. We're going to look at this BigQuery pageviews_2023 table and I want to look at this particular date range right here, which is 04 -18. And we want to actually look at the top, let's say, 1,000 views here. Now, if I want to format it, we can go ahead and format it just to make sure everything looks good and from here we can actually go through and run. All right. So once we go through here and run this, what we'll see is that the results do come back, but they're not exactly the way I want. You can see here, it's got cookie information, main page. So, of course, people are going to be viewing the main page of Wikipedia. That's nice to know. But look at this, we have some results here, like Noah_Cyrus, for example. I want to dig into that a little bit more. So I need to clean this query up. So how do we do this? Well, let's go ahead and go through and clean it up by making a more complex query. What I do is I say, "Look, I want to look at this date range, but I don't want to see these other types of titles," right? The main page, special Wikipedia data, right? These are some of the things I want to exclude. And if we go through here and we run this, it's actually going to filter out a lot of the things that I didn't like before, right? Great. Now, we again need to look through here and find some ways to filter out the data because look at this, this shows up quite a bit. And my impression when I've gone through and queried it is it's also a main page in another language. So all we need to do is just tweak this query a little bit. We'll just say OR title LIKE and let's go ahead and paste this in. And then from here, let's get rid of this comma right there. Let's go ahead and format the query. Perfect. And let's run it again. So a lot of times you will need to, kind of, iteratively go through here and pick out the query. So this looks a lot better, right? So we see this, we've actually got a lot of results here that look pretty good. And we could even go through and filter in more and more and more. But one of the things that we can do next here, instead of only using SQL is we could actually go to explore the data. So let's go ahead and explore the data, but this time with Colab notebook. And CoLab notebook allows us to query things in Python. So I could actually go through and filter what's happening by looking at it in Python. So the first thing that I need to do is I need to run a setup and this will do some off code that gets me hooking up into here. And then we go through and we do that same query. Now, it just copies that query, which is right here, which looks good. And then we can actually load this result set, which we have the results right here. Here the exact same thing, but this time it's a data frame. And then finally, we can go through here and we can look at the describe. And this is good, but, again, we need to actually go through and dedupe this. So what we can do is I can also take some code here that I've got handy and we can filter this a little bit better. So one of the first things that we can do is we can actually say, let's go ahead and group by title and sum of the view. So all we have to do is go through here and add another piece of code. Right there. We say results, result grouped by, sum it, reset the index, and let's go ahead and review it again. Aha. There we go. So we have a lot better of a deduplication type view now that shows us what's going on. It may not be perfect, but it's a pretty good start and all we need to do now is visualize it. So in order to visualize it, again, we just add a code cell right here and I can just use seaborne to code it up. So first up, I'll go ahead and import seaborne as sns. Let's go ahead and do that. Let's go ahead and get a top 25 result here. So that's the next thing that we'll do. So we'll go ahead and say code. We'll say top 25. Perfect. And then finally, all I need to do is create a bar plot using seaborne. Let's go ahead and run that. Perfect, sns.set "white grid." Let's go ahead and set this title top 25 most viewed Wikipedia pages on 04 -18. Let's go ahead and query it. What do we got? Aha. Perfect. So we have a pretty good view here of what's going on. In the day of 04 -18, we can see Noah_Cyrus is a very big Wikipedia page. We see ChatGPT is a very big Wikipedia page. We see, you know, other people as well and this gives us a nice overview of how to actually look at what's popular in the world by using off the shelf tools from Google BigQuery as well as Colab notebook.

内容