登录查看更多内容

?? Big Data in Construction. Part 1-2: First Dataset. Tika OCR. Extracting content and metadata.

Artem Boiko

DataDrivenConstruction.io | Data Analytics and Data Processing in Construction | Democratize Data Access | AI and LLM with CAD (BIM)

发布日期: 2020年10月24日

In this course, we will transform the PDF file into text. The pdf file is our unstructured data that we will transform into text using the python and additional libraries, then into a tabular form and then visualize the received data on the Kaggle platform.

This process can be divided into stages:

At the first stage, we will convert PDF files to text using the Apache tika library. Then we split the resulting text into lines. Then with the help of regular expressions we will sort and select only the data we need and then collect this data into an array.
And in order to apply these operations to all files that are in the folder, we will create a function.
After that, we will save our data in CSV format
and upload this resulting file to the Kaggle platform.

As I work in the construction industry, we will work with PDF files of drawings, but in your case, it will not be necessarily drawings. It can be some kind of accounts, documents, contracts or other PDF documents that you use in your work.

In the first Dataset we will have six PDF files. Each file is a drawing. From these drawings we will take data on the engineers name, the title of the drawing, the creation date, the number of changes, and comments on the changes.

So, run Visual Code (see video above). Create a new file and save this file to the “Big Data Course” folder. You will need to create a new folder in the Big Data Course folder. Or just give the file a different name so as not to overwrite the finished files. The file name Data from PDF is saved with the extension py. That is, the code will be written in Python.

To get started, we import the OS and Globe library. We need these two built-in libraries to work with file attributes.

First, output our file names. To do this, in the variable input_path - we write the address of our drawings. Then we create a loop for all the files that will be in our folder drawings.

Input_path is our folder path. The second parameter to the function indicates that we take all files with the PDF extension in the folder drawings. Then, in the loop body we write “Print Input File”, with the Print function we output the names of all our files in the folder.

To convert PDF to text, we need a new library from Apache - the Tika library.

Tika - is a content detection and analysis framework, written in Java. It detects and extracts metadata and text from over a thousand different file types. All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more

For more information about the Tika library, you can read the article at the link: Apache Tika: What is it and why should I use it?

Requirements:

First we need to install the tika-python library, this can be done via pip in the command line.
To allow the library to launch the Tika REST server in the background Java 7 or higher also needs to be installed

The function that allows to import the file into text is called “parser From File” and the name of our file (see video above).

At the beginning of our code, we import the Parser function from the Tika library and now to install the Tika library we write Pip Install Tika in the command line.

(The library is installed, but errors occur. In the error, there is a link to the file that we need to save to the computer. Download the “Jar” file and save it to any folder. We saved the file and now we need to install the Java Environment. In the browser line write Java runtime involvement. Download the latest version to your computer and install it. After installing Java, an error occurs again. This is an error in the latest version of Tika. We will delete the already installed library of the latest version with the command “uninstall”. And install version 1.23)

Installation of this version is error-free.

Now our code runs without errors. In the terminal, the code gives us the names of all the folders that are in the folder. Now for example, we output the text from the first PDF file. To do this, we write the Break command at the end of the loop body, which will stop the loop after its first execution. Now we see that the text is received. This is how our pdf file looks like textually (see video above).

We can now present this text with the Content attribute. This allows us to understand better where and how the lines are in the file.

Now, with the splitlines function, we will divide the resulting text into lines and delete the empty lines from them.

Next lesson: Part 1-3: Regular Expression in Python. Pattern matching in Python with RegEx.

Links to previous lessons:

Part 1-1: Choosing python IDE. Anaconda. Install Python.

***

?? If you don't want to wait for new articles to be released, you can find a complete course on data extraction. For all my readers - 30% discount for all new students: coupon30getoff . Thanks for reading me.

Links to my previous publications:

?? If you like my content, please consider buying me a coffee. Thank you for your support, I really appreciate it! buymeacoffee.com/boikoartem

Sergio Enrique Espinoza Marin

Productivity & Risk Manager at Bechtel

4 年

Good information; thanks ^ n!

1 次回应

Ivan F.

4 年

Scott M.

1 次回应

Ivan Baranov

Environmental Engineer | Lifecycle Assessment & Data-Driven Sustainability | 3D Modeling

4 年

that one is interesting ! Will tinker

查看更多评论

要查看或添加评论，请登录

Artem Boiko的更多文章

The post-BIM world. Transition to data and processes and whether the construction industry needs semantics, formats and interoperability

2024年12月20日

The post-BIM world. Transition to data and processes and whether the construction industry needs semantics, formats and interoperability

With the advent of digital data in the 1990s, the construction industry began to transform actively. Computer…

92 条评论
Das Zeitalter des Wandels: IFC geh?rt der Vergangenheit an oder warum Autodesk und andere CAD-Anbieter bereit sind, IFC für USD aufzugeben?—?14 wichti

2024年12月3日

Das Zeitalter des Wandels: IFC geh?rt der Vergangenheit an oder warum Autodesk und andere CAD-Anbieter bereit sind, IFC für USD aufzugeben?—?14 wichti

Im Jahr 2024 vollzieht sich in der Planungs- und Baubranche ein bedeutender technologischer Wandel bei der Nutzung und…

39 条评论
An Era of Change: IFC is a thing of the past or why Autodesk and other CAD vendors are willing to give up IFC for USD in 14 key facts

2024年11月25日

An Era of Change: IFC is a thing of the past or why Autodesk and other CAD vendors are willing to give up IFC for USD in 14 key facts

In 2024, the design and construction industry is undergoing a significant technological shift in the use and handling…

68 条评论
Der Kampf um offene Daten in der Bauindustrie. Die Geschichte von AUTOLISP, Autodesk, SDK, intelliCAD, openDWG, ODA und openCASCADE

2024年5月16日

Der Kampf um offene Daten in der Bauindustrie. Die Geschichte von AUTOLISP, Autodesk, SDK, intelliCAD, openDWG, ODA und openCASCADE

Ein Kampf um Daten oder eine gemeinsame Zusammenarbeit zwischen Nemetschek und Autodesk zur F?rderung offener…

10 条评论
The struggle for open data in the construction industry. The history of AUTOLISP, SDKs, Autodesk, intelliCAD, openDWG, ODA and openCASCADE

2024年5月15日

The struggle for open data in the construction industry. The history of AUTOLISP, SDKs, Autodesk, intelliCAD, openDWG, ODA and openCASCADE

Fighting over data or a joint collaboration between Nemetschek and Autodesk to promote open workflows? April 2024…
DataDrivenConstruction: Navigating the Data Age in the Construction Industry

2024年4月2日

DataDrivenConstruction: Navigating the Data Age in the Construction Industry

DataDrivenConstruction, which embodies the research and study of data integration in business cases of companies in the…

2 条评论
Lobbyist Wars and BIM Development. All Parts

2022年5月6日

Lobbyist Wars and BIM Development. All Parts

Lobbyist Wars and BIM Development A series of articles titled "Lobby Wars and the Development of BIM" describes the…

6 条评论
Open Source im Bauwesen. Opazit?t des IFC-Formats und buildingSMART. Lobbyistenkriege und die Entwicklung von BIM. Teil 7

2022年4月8日

Open Source im Bauwesen. Opazit?t des IFC-Formats und buildingSMART. Lobbyistenkriege und die Entwicklung von BIM. Teil 7

Article in English: ?? ENG: Open Source in Construction. Opacity of the IFC Format and buildingSMART.

26 条评论
Open Source in Construction. Opacity of the IFC Format and buildingSMART. Lobbyist Wars and the Development of BIM. Part 7

2021年12月22日

Open Source in Construction. Opacity of the IFC Format and buildingSMART. Lobbyist Wars and the Development of BIM. Part 7

Open Source in Construction. Opacity of the IFC Format and buildingSMART.

13 条评论
Gründe für die Ineffizienz der Bauindustrie und das Datenmonopol der Konzerne. Lobbyistenkriege und die Entwicklung von BIM. Teil 6

2021年12月10日

Gründe für die Ineffizienz der Bauindustrie und das Datenmonopol der Konzerne. Lobbyistenkriege und die Entwicklung von BIM. Teil 6

Der Grund für das fehlende Produktivit?tswachstum und die weit verbreiteten Spekulationen im Baugewerbe liegt in der…

4 条评论

See all articles

?? Big Data in Construction. Part 1-2: First Dataset. Tika OCR. Extracting content and metadata.

Artem Boiko

DataDrivenConstruction.io | Data Analytics and Data Processing in Construction | Democratize Data Access | AI and LLM with CAD (BIM)

Artem Boiko的更多文章

社区洞察

其他会员也浏览了

Boost Your Data Cleaning Workflow with PyJanitor

Z-Order: Visualization and Implementation

No-Code Data Science With Gigasheet

JSON Handling in Data Science

An Introduction to Polars: Python’s Tool for Large-Scale Data Analysis

Part 6: Data Insights

DataFrames and Series in Data Science: A Comprehensive Guide

Mastering Pandas: 10 Must-Know Tips for Data Analysis in Python

Anaconda is bloated - Set up a lean, reliable data science environment with Miniconda

One-Line EDA with Sweetviz Library

Artem Boiko的更多文章

The post-BIM world. Transition to data and processes and whether the construction industry needs semantics, formats and interoperability

Das Zeitalter des Wandels: IFC geh?rt der Vergangenheit an oder warum Autodesk und andere CAD-Anbieter bereit sind, IFC für USD aufzugeben?—?14 wichti

An Era of Change: IFC is a thing of the past or why Autodesk and other CAD vendors are willing to give up IFC for USD in 14 key facts

Der Kampf um offene Daten in der Bauindustrie. Die Geschichte von AUTOLISP, Autodesk, SDK, intelliCAD, openDWG, ODA und openCASCADE

The struggle for open data in the construction industry. The history of AUTOLISP, SDKs, Autodesk, intelliCAD, openDWG, ODA and openCASCADE

DataDrivenConstruction: Navigating the Data Age in the Construction Industry

Lobbyist Wars and BIM Development. All Parts

Open Source im Bauwesen. Opazit?t des IFC-Formats und buildingSMART. Lobbyistenkriege und die Entwicklung von BIM. Teil 7

Open Source in Construction. Opacity of the IFC Format and buildingSMART. Lobbyist Wars and the Development of BIM. Part 7

Gründe für die Ineffizienz der Bauindustrie und das Datenmonopol der Konzerne. Lobbyistenkriege und die Entwicklung von BIM. Teil 6

社区洞察

其他会员也浏览了

Boost Your Data Cleaning Workflow with PyJanitor

Z-Order: Visualization and Implementation

No-Code Data Science With Gigasheet

JSON Handling in Data Science

An Introduction to Polars: Python’s Tool for Large-Scale Data Analysis

Part 6: Data Insights

DataFrames and Series in Data Science: A Comprehensive Guide

Mastering Pandas: 10 Must-Know Tips for Data Analysis in Python

Anaconda is bloated - Set up a lean, reliable data science environment with Miniconda

One-Line EDA with Sweetviz Library