课程: Intermediate SQL for Data Scientists

Reformatting character data

- [Instructor] In data science and data analysis, we often work with numbers, but we also often work with text and short strings. So it's important to think about how we can use different functions in SQL for manipulating text, such as reformatting text. So in this video, we're going to work with the company_departments table because that has a couple of text columns in there. So let's start by just selecting all of the columns from the data side, company_departments table. And let's just run that, make sure I got everything typed correctly. Okay. Yes, and what we see here is we see all of the data in the company_departments table, and we'll notice, it's all lowercase. So if we want to list all of the department names, we can list just department_name and run that and get just one column back. But the format of the text that we're looking at is however it is stored in the row. Now this may be fine for, you know, we may want all lowercase in the database for some reason, but for some reporting requirement, we really want all of these department names to be uppercase. Well, in that case, we can use the upper function and simply pass in the name of the column we want or the text that we want to uppercase. And when we use the upper function, we'll again, get back a single column, but now everything is uppercase. Now, sometimes, you may just want the first letter or the initial letter capitalized, and you want everything else however it normally is. Well, in that case, we can use the initcap function and pass in department_name. And now if we run that, we'll see only the first name is capitalized. Now one of the nice things about the way SQL works is that you can take the results of a function and pass that in as a parameter and argument to another function. So for example, here, we're taking department_name and then whatever else it is, but we are going to make sure the first letter in the department_name is capitalized. Well, what if we want to then lowercase? So basically undo our upper initial capitalization. We can do that. And so what we've done is we take the department_name, we turn the first or initial letter into a capital, and then we lower it again. So we're getting back to right where we started from. So it's not terribly useful, but I just wanted to demonstrate how we could have a whole series of functions that take as their input, the results of other functions. Because that can be quite handy when you want to do a whole series of operations on a string. Now sometimes, when we work with strings, we can run into some subtle problems with white spaces. This happens more than I would like at least in my work where I might be working with a new data set. It might come out of maybe an older source data system that doesn't have quite the, you know, data quality checks that we might be used to. So for example, let's say we're working with somebody's name. And we want to work with, say the name Kelly. We can for example, type SELECT Kelly. And if we run this, this is simply going to return the word Kelly. Now, if I put a space in front of the word Kelly and run this, I get a result back, which looks surprisingly similar. There's a space before the K in this case, but it looks very similar. Like I couldn't tell if I was just looking at the result whether or not there might be an extra space there. Well, let's test and make sure I'm actually looking at something other than capital K-E-L-L-Y. So let's run this. So I'm basically doing a Boolean test. I am asserting that this string is the same as this string. So character for character, are these two the same character string? And the answer is false. No, they're not. That's because we've got an extra space here. So that's throwing us off. If we remove the extra space and then run this check, of course, we're going to get true again. So sometimes, when we're working with strings, we want to use functions which help us trim off white space from the beginning and the end of a column. So for example, if we use the function called ltrim or left trim, and pass in the word Kelly, and we just apply ltrim, we're going to get just Kelly back. So let's test that and see if this is equal to K-E-L-L-Y. And it is. Now, if we add a space at the beginning before Kelly and run this and do an ltrim, we get a true back. That's because ltrim will remove the extra white space. So let's put a couple more in there. Whoops, don't want a period for sure. Let's put a couple more in there and run this. And what we see here is ltrim will trim up all of the leading white spaces, whether it's one space or two or three spaces, it will remove those. So that is one way of trimming down. Now let's imagine we have an extra space after the word Kelly. So this is on the right side of the string. So let's run that. And now, it turns out the ltrim does not take care of the extra white space on the right side. However, we can do what we were talking about before, which is pass the results of this function into another function. And in this case, I'm going to use rtrim or right trim, and I want to right trim the results that I get back from my ltrim over this string that Kelly with spaces on both sides. So now we're trimming the left side, and then we're taking the result of that and trimming the right side. So let's see if we get a true back. We do. So we can do this kind of trimming both on the left and on the right to help us clean up potentially unseen or non-printing characters. Now in addition to working with things like uppercase and lowercase or initializing capitalization and trimming off leading white spaces from the beginning and ends of strings, sometimes, what we want to do is we want to construct larger strings from component, text columns, or text strings that we have. And to do that, we can use the concatenate operator. So let's look at an example of how we can concatenate or join together two strings. Let's look at the employee's table. And from the employee's table, I want to get the job title. And I want to concatenate it to the last name. And this is going to be from data.sci.employees. So the double pipe is the concatenation operator, so it just runs two strings together. So let's run this and see what we got. And sure enough, what we have is we have a job title, like in this case, in the first result, we have structural engineer immediately followed by the last name, in this case, Kelly. Now what we're doing here is we're missing some kind of separator or delimiter that indicates we have, you know, the end of the job title and the beginning of the last name. Well, one way we can fix that is to add a delimiting character, like a dash sign, and we can concatenate that to the job title and then we can concatenate the results of that concatenation operation to last name. So we'll take job title, we'll append or concatenate a dash, and then we'll append or concatenate last name. So let's see how that looks. Okay, so this is better. This is a little easier to read. So now I can see easily that's structural engineer Kelly and recruiting manager Carr and marketing assistant Alexander. What we can do is use the basically the double pipe operator. Now when we work with the double pipe operator and with column names, we want to think about what happens when a column has a null value. So let's say there might be someone, for some reason, their last name is missing and there's a null in for their last name. What happens when we concatenate a null to the string using the pipe operator? Well, let's see. What we get back, the entire result of the concatenation operator is null. It's not simply the job title, dash, and then nothing else. You might expect that, but that's not how the concat operator works.

内容