An introduction to Generic Web Data Extraction

An introduction to Generic Web Data Extraction

Data extraction play vital role in many applications. Web crawling can help in getting data from various sources. There are various web crawling framework available like jsoup, xsoup, scrapy, html unit etc. It again depends on the programming language which you are interested in using to build crawler. Jsoup is a java library which allow you to extract, clean and manipulate data from given html. You can extract data based on html attribute, attribute value, tag, style, text etc. It also allows you to traverse through various html element as it comes with lot of traversing methods. you can even use regular expression to get any html element. There are many real world applications where we really need to build clean collection of specific entity to get proper information on given entity. Ex. Your company sells various brands cars, handphones etc. which again depend on region and many other factors.

Let say you got a new task to do for the same. So now how should you approach to given problem ? Lets say you stay in india, and you found https://www.cardekho.com/ , and you want to build clean collections of cars based on brands and other required factors. You learnt about jsoup in your college days for some exercise, so you are excited to take up this challenge, and started writing crawling for given site. And it is going good. Customers are happy with kind of information they get on your site. After few days you started getting feedback that you are missing few attributes/specifiction, or few models on your site. You see a problem here, and you start your analysis on this, and you somehow end up browsing https://www.zigwheels.com/ which shows missing attributes, and few new car models. You are happy you found solution to given problem. So you need to again write code to crawl this site as html structure would be different here. You are done with writing crawler for zigwheels. So you are again on the track. What if you get same feedback again in future as this site might not give you few other models or few attributes. Now your mind is in dilemma. You spend lot of time on analysing various sites. And you find html structure is different for each site. You discuss this with your colleagues, and trying to get right solution. So you need to make it generic, but how ?

Again you read about various extraction libraries, and xpath, css selector. but the whole point is that you can only make it generic if it is possible to find any given attribute uniquely. When i say uniquely, there should not be any dependency on order of html tag ex. td, tr, span etc. You read about css selector, xpath. You initially give it a try with xpath, but xpath might not work for you as it has again dependency on order of html tag. Ex.

For product url : https://www.cardekho.com/maruti/swift-2018 mileage attribute, you get below given xpath

Xsoup :
Public static String URL = " https://www.cardekho.com/maruti/swift-2018";
String xpathMileage = "https://html/body/main/section[1]/table/tbody/tr/td[1]/table/tbody/tr[2]/td/table/tbody/tr[3]/td[1]"
Document document = Jsoup.parse(URL);

        String result = Xsoup.compile(xpathMileage).evaluate(document).get();
        Assert.assertEquals("https://github.com", result);

if you somehow externalise this configuration, it makes your life easy as tomorrow if you find new source you don’t need to write any extra code. You just need to push configuration to external source ex. db. It works fine for you, again you started thinking it would always work. But after few days you find there are model which don’t have some attributes, and it is giving you different attribute values as order got changed. Now you need to avoid this, and get another solution. You hear about css selector from someone, and do lot of research again to make it generic.

We discussed that if you can identify any attribute uniquely it is possible to make it generic. So jsoup works for you here. You came with an idea which uses text based search. ex.mileage. You have check various sites again, find common pattern which is key — value pair. So if i find key uniquely irrespective of its html order, i can make it generic even if its order get changed. So now here is the solution

Jsoup :
Public static String URL = " https://www.cardekho.com/maruti/swift-2018";
public static String cssSelector = "td.matches(mileage)";
Document document = Jsoup.parse(URL);

        Element type = document.select(cssSelector).first();
return Mileage + " : " + type.siblingElements().first().text();

So the whole idea is to uniquely get key html element, and get its sibling text value. You might encounter few sites where you see nested html tag for key, but that also can be move as configuration to external source. So thats how you can make generic data extraction framework which gets data from various sites.

要查看或添加评论,请登录

Bhagwati Malav的更多文章

  • System Design lessons learned from Apache Kafka

    System Design lessons learned from Apache Kafka

    This article would focus on various design concepts eg: horizontal scaling, vertical scaling, data sharding…

  • Need of messaging queue in Microservices Architecture

    Need of messaging queue in Microservices Architecture

    In my last article, i wrote about “microservices vs monolithic architecture” which emphasis on advantage of using…

    4 条评论
  • Microservices vs Monolithic architecture

    Microservices vs Monolithic architecture

    Microservices architecture is getting lot of attention these day and being used by Uber, Netflix, Linkedin and many…

  • Programming is beyond learning the syntax

    Programming is beyond learning the syntax

    As a programmer, you should know about programming libraries you use in your applications. It doesn’t matter you are…

  • Why should you learn competitive programming ?

    Why should you learn competitive programming ?

    Once someone asked me what inspired me to learn competitive programming ? So i just told why i started it ? how it…

  • Don’t forget yourself

    Don’t forget yourself

    We all are living with our hobbies, interests.We all have a penchant for something like one loves to go on a bike trip,…

    6 条评论
  • Love your failures, Success is on the way

    Love your failures, Success is on the way

    How is your life going on? Are you working on something? Are you wary of your result ? Are you scared of your next…

  • Who are you ?

    Who are you ?

    What is the meaning of Life ? what are we running after ? What do we want to accomplish in our life? Have you ever come…

    2 条评论

社区洞察

其他会员也浏览了