登录查看更多内容

Why I love stream editor (sed)

Purnomo Setyawendha

Pipeline Integrity Engineer

发布日期: 2022年2月9日

We have situation here ...

In one of the project that we have been working on, we receive a lot of pipeline side scan sonar data in pdf table, old manually typed table scanned into pdf.

To cut story short, we manage to digitalized them into csv files

Yes, we manage to capture everything, but the data was not structured, a lot of whitespace, numeric mixed with alphabet, Dimension (Length and Height) data is a mixed data in one cells, Easting and Northing coordinates is a mixed record, and we have quite a lot of data.

What are the options ? Manual editing with Excel ? Excel automation ? (VBA or recorded macros) and don't forget we have few hundreds of files to be processed.

What if process them as text using modern programming language ? perl ? python ? Yes, why not ?

Wait....remember this....

and this ....

Then why not use sed, lets see what we have from command line terminal :

We use Regular Expression (Regex) as the pattern recognition and this is beyond the coverage of this article, however I describe the regex used for each step to help those who have not familiar with Regex

Now, only by using sed substitute command, first we eliminate whitespace by replacing / / with nothing //

Second step is to remove E and N using regex pattern /[NE],/ replaced by /,/ or we search N or E followed by coma and replace only by come (so N and E will be removed). The second command just added into the sed command line after previous command and separated by ; symbol

Third step we split L x H into two columns, as coma is a record separator or delimiter, then we just have to replace x to comma by using regex /x/ to /,/

领英推荐

Playing Referee

Helen Wall 2 年前

Ichhadhari Data: A Tester's View of Shape-Shifting…

Rahul Verma 1 年前

Move Faster your ML Pipeline

Lakshminarasimhan S. 2 年前

The next step is the most critical, as we will remove unnecessary field and the new line printing sign of \n so the Northing value can be in the same record as other value. We use option -z (zero termination) to allow sed process the input as the whole input across new line (not line by line).

We use regex /X[^,]*\n,,,/ to remove X-... new line and three comas before Northing data on the next line

and after this step, the data become like this :

After that, we just trim unnecessary double comma at the end by replacing /,,/ with //

The last step is to replace the header from Span.....down to H(m) using regex /S.*)/ to replace old header :

with new header /No,KPStart,KPEnd,EastStart,EastEnd,Length,Depth,NorthStart,NorthEnd/

Thats all !, we can save as the changes into input file by adding -i (inplace option) and we can do batch operation for all input files (i.e data1.csv, data2.csv.....data1000.csv) using glob pattern data*.csv

sed -z -i "s/ //g; s/[NE],/,/g; s/x/,/g; s/X[^,]*\n,,,//g; s/,,//g; s/Sp.*),/No,KPStart,KPEnd,EastStart,EastEnd,Length,Depth,NorthStart,NorthEnd/" data*.csv

and the result will be like this :

All data reformatting done at once in speed of light !

Why I love stream editor (sed)

Purnomo Setyawendha

Pipeline Integrity Engineer

We have situation here ...

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part II

SQL Challenge: Number Of Custom Email Labels

Integrating PyCaret time-series module into Power BI

Creating a Machine Learning App with Power BI and PyCaret

Intelligent Graph = Knowledge Graph + Embedded Analysis

Scrape Your Way to Data

What is Yaml file?

Implicit type casting is an easy way to shoot yourself in the foot

Day 22 : Advanced YAML Syntax #90DaysofDevOps

What is Regex?

We have situation here ...

领英推荐

Minkowski distance in R vs Python

2023年6月4日

Why R is my first language

2022年3月16日

Don't Repeat Yourself (in Command Line)

2022年2月15日

Managing (Technical) Disagreements

2022年2月6日

Introduction to Stream Editor (sed)

2021年7月30日

Starting with a Boring Stuff

2020年8月31日

Geopandas Update for Python 3.8.5

2020年8月26日

A "sed" Story for Pipeline Integrity Engineer

2020年8月9日

How slow R ....U ?

2020年7月10日

Recent changes in R spatial

2020年6月7日

社区洞察

其他会员也浏览了

Analyzing Excel Sales Data with Python Pandas and Seaborn - Part II

SQL Challenge: Number Of Custom Email Labels

Integrating PyCaret time-series module into Power BI

Creating a Machine Learning App with Power BI and PyCaret

Intelligent Graph = Knowledge Graph + Embedded Analysis

Scrape Your Way to Data

What is Yaml file?

Implicit type casting is an easy way to shoot yourself in the foot

Day 22 : Advanced YAML Syntax #90DaysofDevOps

What is Regex?