课程: AWK Essential Training

Regular expression basics

- [Instructor] If you're going to use AWK for any kind of real world task, you'll very quickly find yourself using regular expressions. Regular expressions are used throughout the Unix world and in many other contexts. They're covered in detail in the course using regular expressions. But for now, I'm just going to give you a quick introduction in case you haven't encountered them before. A regular expression is a special kind of string that defines a pattern which other strings are said to match or not match. Regular expressions in AWK are usually written between slashes, but in some circumstances you can use a string written between quotes as a regular expression. As with the conversion of numbers to strings and vice versa, a string is converted to a regular expression if you use it where a regular expression is expected. The most basic regular expression is a simple string like /abc/. The pattern abc matches any other string which contains the letters A, B, and C in that order with nothing in between them. Regular expressions in AWK are always case sensitive. So here we see that abc matches abc, it also matches anything that begins or ends with anything but has abc somewhere in it, but it does not match an A, and a B, and a C with something in between the B and the C. It doesn't match a subset of itself and it's case sensitive, the lowercase abc does not match the uppercase ABC. One of the most common places to use a regular expression is as the entire pattern part of a pattern action statement. If you do this, it performs the action for every record that matches the pattern. Here's our Duke of York example. We'll just print the entire line for each line of dukeofyork.txt. As you saw before, here's how you can print up before lines containing up and down before lines containing down. We'll add the pattern up and then print up, and then we'll add the pattern down and print. Note that for the last line of the poem, which contains both up and down, it's printed twice, once by the first statement and once by the second. One of the other common uses of a regular expression is with the tilde comparison operator, which evaluates as true if the first argument, a string, matches the second argument, a regular expression. Similarly, the exclamation point tilde comparison operator evaluates as true if the first argument does not match the second. You can use these operators anywhere a comparison can be used including in the pattern of a pattern action statement. For example, although this prints only those lines that contain up anywhere, this prints only those lines for which the fourth field contains up of which there are only two lines. In addition to matching simple strings, you can use meta characters to create more sophisticated patterns. The period matches any single character. For example, a.c matches abc, or axc or indeed a anything c, but it does not match ac because there has to be exactly one character between the a and the c. The backslash meta character removes the special meaning of a following meta character. For example, the backslash before the period matches a literal period. a\.c matches a.c but does not match abc, or axc, or a anything else but a .c. You can use the backslash to remove the special meaning of backslash. So a \\c matches a\c but a nothing other than a backslash. And you can use the backslash to escape the special meaning of the slash which would otherwise end the regular expression. So in this case if you just entered a/c between slashes it would interpret the a as the regular expression and then the c at the end would be extra characters and would cause an error. In this case, you can backslash the slash, generating a pattern which matches a/c. The caret and dollar sign match the beginning and end of the string respectively. abc matches a string that begins in abc with additional letters after. abc does not match a string that begins with something other than abc, even if it contains it. abc with a dollar sign does not match a string that ends with something other than abc, but it does match that dabc that the ^abc did not. If you're familiar with grep, you need to be aware that in AWK the caret and dollar sign matched the beginning and end of the string being compared, not always the whole line. For example, although this AWK command $3 matches the, this AWK command prints all the lines from our Duke of York file in which $3, the third field matches the, including them, they, and neither. If we add a caret to that, the caret matches the beginning of the field, not the line. So now it prints only those lines where the third field begins with the which includes them and they, but not neither.

内容