Why I love stream editor (sed)
Photo by Guillaume Jaillet on Unsplash

Why I love stream editor (sed)

We have situation here ...

In one of the project that we have been working on, we receive a lot of pipeline side scan sonar data in pdf table, old manually typed table scanned into pdf.

No alt text provided for this image

To cut story short, we manage to digitalized them into csv files

No alt text provided for this image

Yes, we manage to capture everything, but the data was not structured, a lot of whitespace, numeric mixed with alphabet, Dimension (Length and Height) data is a mixed data in one cells, Easting and Northing coordinates is a mixed record, and we have quite a lot of data.

What are the options ? Manual editing with Excel ? Excel automation ? (VBA or recorded macros) and don't forget we have few hundreds of files to be processed.

What if process them as text using modern programming language ? perl ? python ? Yes, why not ?

Wait....remember this....

and this ....

Then why not use sed, lets see what we have from command line terminal :

No alt text provided for this image

We use Regular Expression (Regex) as the pattern recognition and this is beyond the coverage of this article, however I describe the regex used for each step to help those who have not familiar with Regex

Now, only by using sed substitute command, first we eliminate whitespace by replacing / / with nothing //

No alt text provided for this image

Second step is to remove E and N using regex pattern /[NE],/ replaced by /,/ or we search N or E followed by coma and replace only by come (so N and E will be removed). The second command just added into the sed command line after previous command and separated by ; symbol

No alt text provided for this image

Third step we split L x H into two columns, as coma is a record separator or delimiter, then we just have to replace x to comma by using regex /x/ to /,/

No alt text provided for this image

The next step is the most critical, as we will remove unnecessary field and the new line printing sign of \n so the Northing value can be in the same record as other value. We use option -z (zero termination) to allow sed process the input as the whole input across new line (not line by line).

We use regex /X[^,]*\n,,,/ to remove X-... new line and three comas before Northing data on the next line

No alt text provided for this image

and after this step, the data become like this :

No alt text provided for this image

After that, we just trim unnecessary double comma at the end by replacing /,,/ with //

No alt text provided for this image

The last step is to replace the header from Span.....down to H(m) using regex /S.*)/ to replace old header :

No alt text provided for this image

with new header /No,KPStart,KPEnd,EastStart,EastEnd,Length,Depth,NorthStart,NorthEnd/

No alt text provided for this image

Thats all !, we can save as the changes into input file by adding -i (inplace option) and we can do batch operation for all input files (i.e data1.csv, data2.csv.....data1000.csv) using glob pattern data*.csv

sed -z -i "s/ //g; s/[NE],/,/g; s/x/,/g; s/X[^,]*\n,,,//g; s/,,//g; s/Sp.*),/No,KPStart,KPEnd,EastStart,EastEnd,Length,Depth,NorthStart,NorthEnd/" data*.csv        

and the result will be like this :

No alt text provided for this image

All data reformatting done at once in speed of light !





要查看或添加评论,请登录

社区洞察

其他会员也浏览了