课程: Intermediate SQL for Data Scientists

Overview of data science operations

- [Instructor] Data is an integral part of how many organizations work, and that doesn't matter if those organizations are businesses, governments, research institutes, or other types of organizations. Now, sometimes when we think about data, we think about small amounts of data, say for example, the data that's needed for me to complete an online transaction so I can, for example, purchase a book online. But other times, we're looking at large amounts of data because our interest isn't so much as, say, conducting a business transaction, but analyzing, say, thousands of transactions and trying to gain insights into our customers' behaviors and interests. What we're focused on in this course is that second category of sort of data problem. We want to help our colleagues gather insights from data to help with the operations of our particular organizations. So to do that, we will set up, say, data science operations within our organizations. Well, what kinds of things do we need to do from a data perspective if we want to have a data science operation? Well, first of all, we have to be able to link data from different sources, because no one system, whether it's a transaction processing system or an HR system that tracks employees or a payroll system, no one of them can tell the complete picture of the business, so we often have to pull different pieces together to get a bigger, more comprehensive view of our organization. Now, oftentimes, we're pulling a lot of data together, but we don't always need to work with all of the data, so we have to be able to filter out and focus on particular subsets of data. We also need to reformat data so that we have a consistent representation for data within our data science operations. Now, a key thing that we typically do in data science operations is look at aggregate data. So we want to be able to understand the big picture. So rather than look at a single transaction, we want to look at patterns over the course of thousands of transactions. And we want to be able to answer specific questions about business operations, so for example, we might want to know what are the top 10% most profitable stores in a particular chain of retail stores? Or we might want to know which stores are not as profitable this quarter as they were last quarter. Those are very specific questions that business people may ask, and we, from a data science and analytics perspective, are able to answer those questions with the techniques that we're going to talk about in this course. Now, of course, to do data science, we have to start with data, which means we have to go to various sources. Oftentimes, we'll find the data that we need or the data we're interested in is already in a database, and it could be a relational database or it could be a NoSQL database. Now, relational databases are often used for transaction processing systems and data warehouses. NoSQL databases are typically used in a couple of cases. One is when you have very large volumes of data and very high velocity, so a large amount of data coming in short periods of time typically use NoSQL databases. Also, when the data is semi-structured, it's not really as well structured or as consistent as we typically find in relational databases, that's another time we might use a NoSQL database. Now, we may also get data straight from applications that are running in our data centers or in the cloud. We may also get data streaming in from mobile devices. Now, if you're in an organization that has, say, a fleet of vehicles, those vehicles might be instrumented with IOT, or internet of thing sensors which collect data about the state of the vehicle and send that information for analysis. And we also may be looking at, say, web logs. So applications typically write out information about the state of the application at any point in time, and sometimes, we need to be able to analyze that. Those are examples of the kinds of data sources that might come from application-related data. And then finally, there's a third category I want to talk about, and this is one that's easily missed, and that's manually managed data. We may work with colleagues who collect a lot of data, work with data, and then essentially track their own data in spreadsheets or in a local database, and they derive that data and they work with that data, and that data is only available in the spreadsheet or database that they manage. Now, sometimes we need to pull those into our data analysis as well. How do we get data from these various types of sources basically into a single source that we can work with? Well, the process is known as extraction, transformation, and load. And here, the idea is basically that we extract or we read the data from its source, if that's in, say, a database or a spreadsheet, or we extract it from applications or IOT data streams. And then we do things to transform the data, and basically, we can think of each data source as a puzzle piece, and the transformation is the operations that help us reshape those puzzle pieces so that they all fit together in a logical way. And then finally, once we have the data transformed, then we can load it into a relational database, at which point we can then start using SQL to analyze it. Now, transformations is a very broad topic, so let's look at just some simple examples. When we talk about transformations, it could be something as simple as, say, making sure text values don't have any extra white spaces or extra spaces or tabs at the beginning or end of the text. We also often have to reformat dates. Different systems use different representation for dates, so we want to make sure we have a consistent way of looking at dates. Also, two different systems to talk about the same thing, say, for example, the departments in your organization, they might use different codes to refer to them, so one application might use two letter codes to refer to departments, and another application might spell out the whole name of the department. When we're transforming the data, we'll want to make sure we pick some standard way of categorizing or some standard set of codes and make sure that we consistently use those. And then we may have other kinds of transformations, something as simple as making sure data values are properly cased, or we might need to, say, reformat numeric values into currency values, or vice versa. So those are just some examples of simple transformations. And transformations can get quite complex, especially if there's additional business logic that's involved. And then finally, after the transformation are performed, the data is loaded, then we get to what many of us consider the most interesting part, which is the analysis phase. And here we're going to work with various tools, whether it's SQL or Python or Spark, to do different kinds of analysis. And the goal here is to basically derive insights from the data so we can provide those insights to our colleagues who make decisions about our organization's operations and strategies and tactics. So next, we're going to take a look at some specific SQL commands that we will be using throughout this course.

内容