登录查看更多内容

Small tips on the topic of data collecting

Sergey Zhuravlev

Co-Founder @ Phoenix platform | Head of RnD

发布日期: 2023年12月10日

Intro

For the last year or so, I have initiated some projects, and each of them is related to development of algorithms, hypothesis testing and so.

When you have some concept of how your algorithm should work, you normally start the development process and in one point, you have to apply it for some real data. In that article I will show my solution of a problem related to image collection for your own specific needs.

Problem statement

My Master degree is geology, so as any person, who is geologist and has a career in IT, I was thinking about development of an application for detecting geological rocks by photos. I was trying to ask some marked datasets for this purpose from geologist who I know personally, but no one has it. Ok, so I decided to make my own set of images to train the model. I was looking for some open databases or so but could not find anything suitable for my needs.

So, I had a choice to find images of rocks manually or collect them in an automatic way. Several months ago I released a library for data scrapping that we use in Phoenix, but this version of library is developed for collecting of text and numerical data, so I suggest you to look at my solution.

Solution

Firstly, I used ChatGPT to create a list of the most popular rocks. It gave me about hundred of names that I used as an input request for google.

Lets see, how google url looks when you try to find something, for example – granite.

https://www.google.com/search?newwindow=1&client=firefox-b-d&sca_esv=589597305&sxsrf=AM9HkKmoaUyHoXv80c9_BNxyiXCSL0A-_g:1702230884149&q=granite+rock&tbm=isch&source=lnms&sa=X&ved=2ahUKEwjyj7OZuIWDAxVxhv0HHRfkBJIQ0pQJegQIDBAB&biw=1600&bih=867&dpr=2

In the middle of url we see our text that we put in a request area, and if we swap the word “granite” to “limestone” for example and run the search, we will see search results for limestone.?

Perfect, we see a pictures for an updated request. So, now, lets save the page as HTML locally and try to find general structure and name of div classes that are used for framing of images.

领英推荐

Basics of Seismic Interpretation

Petroleum Engineers Association 1 个月前

Fishbone Workflow - A combine approach for improving…

OPA Energy 1 年前

?? Exploring the World with Python: A Geologist’s…

Petroleum Engineers Association 1 年前

We see, that images are represented as a link in img tag with class called “rg_i Q4LuWd”, so first step is to read an HTML page and collect links that are stored in tags with this names.

I wrote write some Python code to load the page as a string and tried to extract links from that page – I got 20 links, so next step is load and save images locally.

Just 3 lines of code and the function is ready. Afterwards I made small coverage of code that runs in loop for every of 100 rocks:

1.?????? Creating google url request

2.?????? Loading page with results

3.?????? Extracting links where images are stored

4.?????? Creating a folder for each rock name

5.?????? Saving images to this certain folder

Discussion

It took me about 30 mins to write the code, but saved at least several days to create dataset from 2000 images. So, I think the result is really good. Of course, you have to make a review of saved results because searching results may be not as accurate as you expect and some downloaded images may be not applicable for your needs, but anyway – 2000 images in one click is something outstanding.

In few next week I am going to clear the code and make it available for you to use, so hope this will be my small contribution for work of anyone, who is going to face the same problem

#phoenix #datamining #ML #datamanagement #scrapping #python #AI

要查看或添加评论，请登录

Sergey Zhuravlev的更多文章

Prediction of Base Parameters for Production Estimation of Greenfield Deposits

2024年6月29日

Prediction of Base Parameters for Production Estimation of Greenfield Deposits

About DotGEO FastCalc In DotGEO I am working for system called FastCalc – the aim of the system is provide a report…
Pocket LLM or how to automate your repetitive tasks with Ollama

2024年5月5日

Pocket LLM or how to automate your repetitive tasks with Ollama

Intro In this post, I will tell you how to use the Ollama library for your routine tasks – let's talk a little about…

Small tips on the topic of data collecting

Sergey Zhuravlev

Co-Founder @ Phoenix platform | Head of RnD

领英推荐

Sergey Zhuravlev的更多文章

社区洞察

其他会员也浏览了

DDE Newsletter - Issue 01, Aug. 2022

From Rocks to Code: Software Developers in Geology

Variogram Analysis Simplified: Part-5: Harnessing Geological Knowledge and Analog Data for Enhanced Variogram Modelling

Precise Handling of Salts and Tackling the Challenges of Triassic and Zechstein Formations: Insights from the Elephant 2.0 Database

Variogram Analysis Simplified: Part-2: Variogram Parameters & How they affect the result

AI and Subsurface Utility Engineering (SUE): Opportunities, Challenges, and the Need for Vigilance

Advanced Techniques in Velocity Modeling for Depth Conversion in Oil & Gas

Automatic Gain Control

Identifying New Mineral Occurrence using Remote Sensing Images

Global Estimation-Polygonal Method

领英推荐

Sergey Zhuravlev的更多文章

Prediction of Base Parameters for Production Estimation of Greenfield Deposits

Pocket LLM or how to automate your repetitive tasks with Ollama

社区洞察

其他会员也浏览了

DDE Newsletter - Issue 01, Aug. 2022

From Rocks to Code: Software Developers in Geology

Variogram Analysis Simplified: Part-5: Harnessing Geological Knowledge and Analog Data for Enhanced Variogram Modelling

Precise Handling of Salts and Tackling the Challenges of Triassic and Zechstein Formations: Insights from the Elephant 2.0 Database

Variogram Analysis Simplified: Part-2: Variogram Parameters & How they affect the result

AI and Subsurface Utility Engineering (SUE): Opportunities, Challenges, and the Need for Vigilance

Advanced Techniques in Velocity Modeling for Depth Conversion in Oil & Gas

Automatic Gain Control

Identifying New Mineral Occurrence using Remote Sensing Images

Global Estimation-Polygonal Method