Small tips on the topic of data collecting


Intro

For the last year or so, I have initiated some projects, and each of them is related to development of algorithms, hypothesis testing and so.

When you have some concept of how your algorithm should work, you normally start the development process and in one point, you have to apply it for some real data. In that article I will show my solution of a problem related to image collection for your own specific needs.

Problem statement

My Master degree is geology, so as any person, who is geologist and has a career in IT, I was thinking about development of an application for detecting geological rocks by photos. I was trying to ask some marked datasets for this purpose from geologist who I know personally, but no one has it. Ok, so I decided to make my own set of images to train the model. I was looking for some open databases or so but could not find anything suitable for my needs.

So, I had a choice to find images of rocks manually or collect them in an automatic way. Several months ago I released a library for data scrapping that we use in Phoenix, but this version of library is developed for collecting of text and numerical data, so I suggest you to look at my solution.

Solution

Firstly, I used ChatGPT to create a list of the most popular rocks. It gave me about hundred of names that I used as an input request for google.

Lets see, how google url looks when you try to find something, for example – granite.

https://www.google.com/search?newwindow=1&client=firefox-b-d&sca_esv=589597305&sxsrf=AM9HkKmoaUyHoXv80c9_BNxyiXCSL0A-_g:1702230884149&q=granite+rock&tbm=isch&source=lnms&sa=X&ved=2ahUKEwjyj7OZuIWDAxVxhv0HHRfkBJIQ0pQJegQIDBAB&biw=1600&bih=867&dpr=2

In the middle of url we see our text that we put in a request area, and if we swap the word “granite” to “limestone” for example and run the search, we will see search results for limestone.?


Perfect, we see a pictures for an updated request. So, now, lets save the page as HTML locally and try to find general structure and name of div classes that are used for framing of images.


We see, that images are represented as a link in img tag with class called “rg_i Q4LuWd”, so first step is to read an HTML page and collect links that are stored in tags with this names.

I wrote write some Python code to load the page as a string and tried to extract links from that page – I got 20 links, so next step is load and save images locally.


Just 3 lines of code and the function is ready. Afterwards I made small coverage of code that runs in loop for every of 100 rocks:

1.?????? Creating google url request

2.?????? Loading page with results

3.?????? Extracting links where images are stored

4.?????? Creating a folder for each rock name

5.?????? Saving images to this certain folder

Discussion

It took me about 30 mins to write the code, but saved at least several days to create dataset from 2000 images. So, I think the result is really good. Of course, you have to make a review of saved results because searching results may be not as accurate as you expect and some downloaded images may be not applicable for your needs, but anyway – 2000 images in one click is something outstanding.

In few next week I am going to clear the code and make it available for you to use, so hope this will be my small contribution for work of anyone, who is going to face the same problem

#phoenix #datamining #ML #datamanagement #scrapping #python #AI

要查看或添加评论,请登录

Sergey Zhuravlev的更多文章

社区洞察

其他会员也浏览了