AWK: the string oriented programming language that maybe you need to know

AWK: the string oriented programming language that maybe you need to know

Hi everyone, I hope you are doing well. My original intention was to write a more concise post about AWK, but due to the LinkedIn length post restrictions, here I am. ???♂?

This time I want to recommend this book called: Sed and Awk written by Dale Dougherty and Arnold Robbins. I'm a huge fan of Arnold's work, for me, he is the resource to go when learning the fundamentals of Unix and it's core principles. You want to learn Classic Shell Scripting, Arnold has a book for that. You need to learn the Korn Shell because, in your work, they use the IBM Connect Direct software through ksh scripts, Arnold has a book for that also. 

AWK, I believe, is a great tool that you are going to use just a couple of times per year. Therefore, maybe you could think that spending precious time learning this old domain-specific full Turing programming language with an arcane syntax is not worth it, well, you are wrong. If you are a Data Engineer, Data Scientist, Data analyst, or have any Data related job, you are dealing already with a kind of flat-file modification or extraction. If you are one of the lucky ones, you can do this manually using a text editor, but if you don't have that luck, maybe you are using a complicated Bash, Python (or any programming language) script with a ton of REGEX expressions. This complexity resides in string manipulation, and sometimes the abstraction required is just to complex. 

For example, let's say (and this example is already fully covered in the book) that you need to extract some item(s) inside a flat-file. Within this flat file, you have multiple items, each item can have multiple descriptions, one per row, and the separation between items is a double carriage return. In the snippet below, we can see that we have four total items, which two of them are Books, one about Sed & AWK, and the other one about Data Mining. They have 4 and 6 descriptions, respectively. Therefore, the amount of descriptions per item is not fixed.

Book:   Sed & AWK
        Author: Dale Douhberty & Arnold Robbins
        Editorial: O'Reilly
        Pages: 407


Lamp:   Phillips Hue Light
        Price: $79.97
        Battery Life: 10 Hours


Desk:   VIVO standing Desk 
        Brand: VIVO
        Type: Standing Desk
        Price: $139.99
        Color: Black
        Rating: 4.7/5


Book:   Data Mining
        Author: Charu C. Aggarwal
        Editorial: Springer
        Pages: 746
        Topic: Data Science Topics
        
        Audience: Graduate students. 
      

How can you extract any of the items with their corresponding variable descriptions? I mean, I bet you already started to think how to solve that with your main favorite programming language and realized that while looking simple is not trivial. AWK facilitates you a complete set of tools to solve problems of such nature and string manipulation. 

The problem above can be solved with just two lines of code ??: 

awk 'BEGIN { FS = "\n"; RS = "" }
$1 ~ search { print $0 }' search=$1 $2

If we generalize a little bit the problem including a REGEX expression to match the item name, we can have the following script: 

No alt text provided for this image

Just to think about how to solve problems of such nature without AWK hurts my head. Running the above script, which I called: awk_search_item.sh, to search all the Book items inside our flat file, shows the following output. There we can see that the script matches the Book item and prints the two coincidences with all of their descriptions. 

No alt text provided for this image

Ok, but what about the pretty but expensive hue lamp?

No alt text provided for this image

Conclusions

Yes, you are probably going to use AWK a few times, but learning AWK is the difference between spending days/hours trying to figure out how to do this apparently simple task and do it so that you can move forward to your more exciting Data related task. 

Regards.

Azeem.






要查看或添加评论,请登录

óscar Azeem B.的更多文章

社区洞察

其他会员也浏览了