?? Big Data in Construction. Part 1-2: First Dataset. Tika OCR. Extracting content and metadata.

?? Big Data in Construction. Part 1-2: First Dataset. Tika OCR. Extracting content and metadata.

In this course, we will transform the PDF file into text. The pdf file is our unstructured data that we will transform into text using the python and additional libraries, then into a tabular form and then visualize the received data on the Kaggle platform.

This process can be divided into stages:

  • At the first stage, we will convert PDF files to text using the Apache tika library. Then we split the resulting text into lines. Then with the help of regular expressions we will sort and select only the data we need and then collect this data into an array.
  • And in order to apply these operations to all files that are in the folder, we will create a function.
  • After that, we will save our data in CSV format
  • and upload this resulting file to the Kaggle platform.
No alt text provided for this image

As I work in the construction industry, we will work with PDF files of drawings, but in your case, it will not be necessarily drawings. It can be some kind of accounts, documents, contracts or other PDF documents that you use in your work.

In the first Dataset we will have six PDF files. Each file is a drawing. From these drawings we will take data on the engineers name, the title of the drawing, the creation date, the number of changes, and comments on the changes.

No alt text provided for this image

So, run Visual Code (see video above). Create a new file and save this file to the “Big Data Course” folder. You will need to create a new folder in the Big Data Course folder. Or just give the file a different name so as not to overwrite the finished files. The file name Data from PDF is saved with the extension py. That is, the code will be written in Python.

To get started, we import the OS and Globe library. We need these two built-in libraries to work with file attributes.

First, output our file names. To do this, in the variable input_path - we write the address of our drawings. Then we create a loop for all the files that will be in our folder drawings.

Input_path is our folder path. The second parameter to the function indicates that we take all files with the PDF extension in the folder drawings. Then, in the loop body we write “Print Input File”, with the Print function we output the names of all our files in the folder.

To convert PDF to text, we need a new library from Apache - the Tika library.

No alt text provided for this image

Tika - is a content detection and analysis framework, written in Java. It detects and extracts metadata and text from over a thousand different file types. All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more

No alt text provided for this image

For more information about the Tika library, you can read the article at the link: Apache Tika: What is it and why should I use it?

Requirements:

  • First we need to install the tika-python library, this can be done via pip in the command line. 
  • To allow the library to launch the Tika REST server in the background Java 7 or higher also needs to be installed

The function that allows to import the file into text is called “parser From File” and the name of our file (see video above).

At the beginning of our code, we import the Parser function from the Tika library and now to install the Tika library we write Pip Install Tika in the command line.

(The library is installed, but errors occur. In the error, there is a link to the file that we need to save to the computer. Download the “Jar” file and save it to any folder. We saved the file and now we need to install the Java Environment. In the browser line write Java runtime involvement. Download the latest version to your computer and install it. After installing Java, an error occurs again. This is an error in the latest version of Tika. We will delete the already installed library of the latest version with the command “uninstall”. And install version 1.23)

Installation of this version is error-free.

Now our code runs without errors. In the terminal, the code gives us the names of all the folders that are in the folder. Now for example, we output the text from the first PDF file. To do this, we write the Break command at the end of the loop body, which will stop the loop after its first execution. Now we see that the text is received. This is how our pdf file looks like textually (see video above).

We can now present this text with the Content attribute. This allows us to understand better where and how the lines are in the file.

No alt text provided for this image

Now, with the splitlines function, we will divide the resulting text into lines and delete the empty lines from them.

Next lesson: Part 1-3: Regular Expression in Python. Pattern matching in Python with RegEx.

Links to previous lessons:


***

?? If you don't want to wait for new articles to be released, you can find a complete course on data extraction. For all my readers - 30% discount for all new students: coupon30getoff . Thanks for reading me.

Links to my previous publications:

?? If you like my content, please consider buying me a coffee. Thank you for your support, I really appreciate it! buymeacoffee.com/boikoartem


Sergio Enrique Espinoza Marin

Productivity & Risk Manager at Bechtel

4 年

Good information; thanks ^ n!

Ivan Baranov

Environmental Engineer | Lifecycle Assessment & Data-Driven Sustainability | 3D Modeling

4 年

that one is interesting ! Will tinker

回复

要查看或添加评论,请登录

Artem Boiko的更多文章

社区洞察

其他会员也浏览了