?? Big Data in Construction. Part 1-2: First Dataset. Tika OCR. Extracting content and metadata.
Artem Boiko
DataDrivenConstruction.io | Data Analytics and Data Processing in Construction | Democratize Data Access | AI and LLM with CAD (BIM)
In this course, we will transform the PDF file into text. The pdf file is our unstructured data that we will transform into text using the python and additional libraries, then into a tabular form and then visualize the received data on the Kaggle platform.
This process can be divided into stages:
- At the first stage, we will convert PDF files to text using the Apache tika library. Then we split the resulting text into lines. Then with the help of regular expressions we will sort and select only the data we need and then collect this data into an array.
- And in order to apply these operations to all files that are in the folder, we will create a function.
- After that, we will save our data in CSV format
- and upload this resulting file to the Kaggle platform.
As I work in the construction industry, we will work with PDF files of drawings, but in your case, it will not be necessarily drawings. It can be some kind of accounts, documents, contracts or other PDF documents that you use in your work.
In the first Dataset we will have six PDF files. Each file is a drawing. From these drawings we will take data on the engineers name, the title of the drawing, the creation date, the number of changes, and comments on the changes.
So, run Visual Code (see video above). Create a new file and save this file to the “Big Data Course” folder. You will need to create a new folder in the Big Data Course folder. Or just give the file a different name so as not to overwrite the finished files. The file name Data from PDF is saved with the extension py. That is, the code will be written in Python.
To get started, we import the OS and Globe library. We need these two built-in libraries to work with file attributes.
First, output our file names. To do this, in the variable input_path - we write the address of our drawings. Then we create a loop for all the files that will be in our folder drawings.
Input_path is our folder path. The second parameter to the function indicates that we take all files with the PDF extension in the folder drawings. Then, in the loop body we write “Print Input File”, with the Print function we output the names of all our files in the folder.
To convert PDF to text, we need a new library from Apache - the Tika library.
Tika - is a content detection and analysis framework, written in Java. It detects and extracts metadata and text from over a thousand different file types. All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more
For more information about the Tika library, you can read the article at the link: Apache Tika: What is it and why should I use it?
Requirements:
- First we need to install the tika-python library, this can be done via pip in the command line.
- To allow the library to launch the Tika REST server in the background Java 7 or higher also needs to be installed
The function that allows to import the file into text is called “parser From File” and the name of our file (see video above).
At the beginning of our code, we import the Parser function from the Tika library and now to install the Tika library we write Pip Install Tika in the command line.
(The library is installed, but errors occur. In the error, there is a link to the file that we need to save to the computer. Download the “Jar” file and save it to any folder. We saved the file and now we need to install the Java Environment. In the browser line write Java runtime involvement. Download the latest version to your computer and install it. After installing Java, an error occurs again. This is an error in the latest version of Tika. We will delete the already installed library of the latest version with the command “uninstall”. And install version 1.23)
Installation of this version is error-free.
Now our code runs without errors. In the terminal, the code gives us the names of all the folders that are in the folder. Now for example, we output the text from the first PDF file. To do this, we write the Break command at the end of the loop body, which will stop the loop after its first execution. Now we see that the text is received. This is how our pdf file looks like textually (see video above).
We can now present this text with the Content attribute. This allows us to understand better where and how the lines are in the file.
Now, with the splitlines function, we will divide the resulting text into lines and delete the empty lines from them.
Next lesson: Part 1-3: Regular Expression in Python. Pattern matching in Python with RegEx.
Links to previous lessons:
***
?? If you don't want to wait for new articles to be released, you can find a complete course on data extraction. For all my readers - 30% discount for all new students: coupon30getoff . Thanks for reading me.
Links to my previous publications:
- The ?? Ups and ??Downs of the San Francisco Construction Industry. Trends and History of the Construction
- The Construction Industry is dying out. Covid-19 crisis growing Сhallenges and Opportunities.
- ?? Select the Best Automation Tool for your Work in Revit. Designers vs Programmers.
- Can you earn more in another country? Engineering Salaries Worldwide.
- ? ONE HOUR work - in ONE MINUTE. Revit - Smart Planning with Dynamo.
?? If you like my content, please consider buying me a coffee. Thank you for your support, I really appreciate it! buymeacoffee.com/boikoartem
Productivity & Risk Manager at Bechtel
4 年Good information; thanks ^ n!
Scott M.
Environmental Engineer | Lifecycle Assessment & Data-Driven Sustainability | 3D Modeling
4 年that one is interesting ! Will tinker