WebScraping: The Art of Extracting Data (Quality Engineering)
Designed by upklyak / Freepik

WebScraping: The Art of Extracting Data (Quality Engineering)

Introduction

Big data is extremely valuable in today's increasingly data-driven society. It is predicted that the big data industry would increase from $162.6 billion in 2021 to $273.4 billion in 2026, according to a recent analysis from Research and Markets.

You must use web scraping to get data from freely accessible sources, such as webpages. Even if there are many online scraping programmes accessible, you may learn a practical programming dialect like Python and create your own custom code to rapidly and correctly scrape webpages.

Web scraping, sometimes referred to as web extraction of data and online harvesting, is the process of obtaining information from a website. Although you can accomplish this manually, automated web scraping solutions can complete tasks more rapidly and effectively when projects call for data to be retrieved from hundreds or even thousands of online pages.

Web scraping technologies gather and export extracted data into a central local database, spreadsheet, or API for in-depth analysis.

Web crawlers and web scrapers work together to extract particular data from the online pages. online scraping software can connect to the internet through HTTP or a web browser. In a subsequent section of this post, we'll go into more depth about web crawlers and web scrapers.

In order to extract data, it must first fetch a website. The procedure of retrieving a web page is referred to as fetching. Whenever?a user accesses a web page, the browser performs this action. The web page's content is then parsed (i.e., its syntax is examined), reformatted, or searched, and the retrieved data?is subsequently entered into a database or put into a spreadsheet.

Quality Engineering Use Cases

DOM Verification

  • DOM verification is an important step in automation and web scrapping will help here
  • It will help in capturing the DOM element and validating it

Business Automation

  • You might occasionally need to gather a lot of data from a collection of websites.
  • You must complete this swiftly, methodically, and consistently. These data sets may be automatically extracted using web scraping technologies.

Understanding the Similar Product:

  • Extraction of web data is essential for doing market research.
  • The resultant information is used by market researchers to inform their research in various areas of study, including competitor analysis, pricing analysis, research and development, and market trend analysis.

How to scrape website data using Java?

  • Step 1: Set up the environment

<dependencies>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.14.3</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.12.0</version>
</dependency>
</dependencies>        

  • Step 2: Inspect the page you want to scrape
  • Step 3: Send an HTTP request and scrape the HTML

    public static String returnResponse(String url){
        try{
            URL obj = new URL(url);
            HttpURLConnection con = (HttpURLConnection) obj.openConnection();
            // optional request header
            con.setRequestProperty("User-Agent", "Mozilla/5.0");
            int responseCode = con.getResponseCode();
            System.out.println("Response code: " + responseCode);
            BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
            String inputLine;
            StringBuilder response = new StringBuilder();
            while ((inputLine = in.readLine()) != null) {
                response.append(inputLine);
            }
            in.close();
            String html = response.toString();
            return html;
        }catch (IOException ie){
            ie.printStackTrace();
            return null;
        }
    }        

  • Step 4: Extracting specific sections

    public static void main(String args[]) {
        String responseBody = returnResponse("https://sourceforge.net/projects/orangehrm/");
        Document doc = Jsoup.parse(responseBody);
        Elements links = doc.select("a[href]");
        for (Element link : links) {
            String href = link.attr("href");
            System.out.println(href);
        }
    }        

How to scrape website data using Selenium?

  • Step 1: Set up the environment

        <!-- https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-java -->
        <dependency>
            <groupId>org.seleniumhq.selenium</groupId>
            <artifactId>selenium-java</artifactId>
            <version>4.12.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/io.github.bonigarcia/webdrivermanager -->
        <dependency>
            <groupId>io.github.bonigarcia</groupId>
            <artifactId>webdrivermanager</artifactId>
            <version>5.5.3</version>
        </dependency>        

  • Step 2: Inspect the page you want to scrape
  • Step 3: Send an HTTP request and scrape the HTML
  • Step 4: Extracting specific sections

    public static void seleniumStart(String url){
        WebDriverManager.chromedriver().setup();
        WebDriver driver=new ChromeDriver();
        try{
            List<WebElement> hrefElements;
            driver.get(url);
             hrefElements=driver.findElements(By.xpath("https://a[@href]"));
             for(WebElement hrefElement:hrefElements){
                 System.out.println(hrefElement.getDomProperty("href"));
             }
             driver.quit();
        }catch (Exception e){
            e.printStackTrace();
            driver.quit();
        }
    }
    public static void main(String args[]){
        seleniumStart("https://sourceforge.net/projects/orangehrm/");
    }        

How to grab console logs of the browser?

  • You can grab the console output of the browser using selenium
  • You need to use LogEntries class to fetch the logs and print it line by line
  • This will help you to parse the same and use it for analysis purposes

    public static void checkConsoleSelenium(String url){
        WebDriverManager.chromedriver().setup();
        WebDriver driver=new ChromeDriver();
        try{
            driver.get(url);
            driver.manage().timeouts().implicitlyWait(Duration.ofSeconds(20));
            LogEntries entry=driver.manage().logs().get(LogType.BROWSER);
            System.out.println("Started capturing the logs");
            System.out.println("==========================================================");

            for(LogEntry e : entry){
                System.out.println(e);
            }
            driver.quit();
        }catch(Exception e){
            e.printStackTrace();
            driver.quit();
        }
    }

    public static void main(String args[]){
        checkConsoleSelenium("https://sourceforge.net/projects/orangehrm/");
    }        

Summary:

  • Webscraping is an important task if you want to leverage it further in your use cases
  • There are several ways you can scrap a web page and couple of them are listed above
  • All the code is available here https://github.com/sohambpatel/WebSrappingTest

Naman Gupta

Founder & CEO, Relu Consultancy | Making Data Accessible

1 年

Web Scraping is definitely an art. It's often mistaken for the rigid structures, procedures and outputs. Only the data community knows how data is just an expression like any other.

Téc. Micaela Rodríguez

QA Automation Engineer @ Infogain | JavaScript | Python | Java | C# | ASP.NET | React | SQL

1 年

Way to go, Soham! ?? It's always great to see your expertise in action. Looking forward to reading it!

要查看或添加评论,请登录

Soham Patel的更多文章

社区洞察

其他会员也浏览了