登录查看更多内容

WebScraping: The Art of Extracting Data (Quality Engineering)

Soham Patel

Product Quality Manager

发布日期: 2023年9月5日

Introduction

Big data is extremely valuable in today's increasingly data-driven society. It is predicted that the big data industry would increase from $162.6 billion in 2021 to $273.4 billion in 2026, according to a recent analysis from Research and Markets.

You must use web scraping to get data from freely accessible sources, such as webpages. Even if there are many online scraping programmes accessible, you may learn a practical programming dialect like Python and create your own custom code to rapidly and correctly scrape webpages.

Web scraping, sometimes referred to as web extraction of data and online harvesting, is the process of obtaining information from a website. Although you can accomplish this manually, automated web scraping solutions can complete tasks more rapidly and effectively when projects call for data to be retrieved from hundreds or even thousands of online pages.

Web scraping technologies gather and export extracted data into a central local database, spreadsheet, or API for in-depth analysis.

Web crawlers and web scrapers work together to extract particular data from the online pages. online scraping software can connect to the internet through HTTP or a web browser. In a subsequent section of this post, we'll go into more depth about web crawlers and web scrapers.

In order to extract data, it must first fetch a website. The procedure of retrieving a web page is referred to as fetching. Whenever?a user accesses a web page, the browser performs this action. The web page's content is then parsed (i.e., its syntax is examined), reformatted, or searched, and the retrieved data?is subsequently entered into a database or put into a spreadsheet.

Quality Engineering Use Cases

DOM Verification

DOM verification is an important step in automation and web scrapping will help here
It will help in capturing the DOM element and validating it

Business Automation

You might occasionally need to gather a lot of data from a collection of websites.
You must complete this swiftly, methodically, and consistently. These data sets may be automatically extracted using web scraping technologies.

Understanding the Similar Product:

Extraction of web data is essential for doing market research.
The resultant information is used by market researchers to inform their research in various areas of study, including competitor analysis, pricing analysis, research and development, and market trend analysis.

How to scrape website data using Java?

Step 1: Set up the environment

领英推荐

A Data Analyst's Guide to Web Scraping With Python

Juliet Ofoegbu 1 个月前

MarkItDown: A Powerful Tool for Converting Data to…

Nik Bear Brown 3 个月前

Web Scraping

Dhanushkumar R 1 年前

<dependencies>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.14.3</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.12.0</version>
</dependency>
</dependencies>

Step 2: Inspect the page you want to scrape
Step 3: Send an HTTP request and scrape the HTML

    public static String returnResponse(String url){
        try{
            URL obj = new URL(url);
            HttpURLConnection con = (HttpURLConnection) obj.openConnection();
            // optional request header
            con.setRequestProperty("User-Agent", "Mozilla/5.0");
            int responseCode = con.getResponseCode();
            System.out.println("Response code: " + responseCode);
            BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
            String inputLine;
            StringBuilder response = new StringBuilder();
            while ((inputLine = in.readLine()) != null) {
                response.append(inputLine);
            }
            in.close();
            String html = response.toString();
            return html;
        }catch (IOException ie){
            ie.printStackTrace();
            return null;
        }
    }

Step 4: Extracting specific sections

    public static void main(String args[]) {
        String responseBody = returnResponse("https://sourceforge.net/projects/orangehrm/");
        Document doc = Jsoup.parse(responseBody);
        Elements links = doc.select("a[href]");
        for (Element link : links) {
            String href = link.attr("href");
            System.out.println(href);
        }
    }

How to scrape website data using Selenium?

Step 1: Set up the environment

        <!-- https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-java -->
        <dependency>
            <groupId>org.seleniumhq.selenium</groupId>
            <artifactId>selenium-java</artifactId>
            <version>4.12.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/io.github.bonigarcia/webdrivermanager -->
        <dependency>
            <groupId>io.github.bonigarcia</groupId>
            <artifactId>webdrivermanager</artifactId>
            <version>5.5.3</version>
        </dependency>

Step 2: Inspect the page you want to scrape
Step 3: Send an HTTP request and scrape the HTML
Step 4: Extracting specific sections

    public static void seleniumStart(String url){
        WebDriverManager.chromedriver().setup();
        WebDriver driver=new ChromeDriver();
        try{
            List<WebElement> hrefElements;
            driver.get(url);
             hrefElements=driver.findElements(By.xpath("https://a[@href]"));
             for(WebElement hrefElement:hrefElements){
                 System.out.println(hrefElement.getDomProperty("href"));
             }
             driver.quit();
        }catch (Exception e){
            e.printStackTrace();
            driver.quit();
        }
    }
    public static void main(String args[]){
        seleniumStart("https://sourceforge.net/projects/orangehrm/");
    }

How to grab console logs of the browser?

You can grab the console output of the browser using selenium
You need to use LogEntries class to fetch the logs and print it line by line
This will help you to parse the same and use it for analysis purposes

    public static void checkConsoleSelenium(String url){
        WebDriverManager.chromedriver().setup();
        WebDriver driver=new ChromeDriver();
        try{
            driver.get(url);
            driver.manage().timeouts().implicitlyWait(Duration.ofSeconds(20));
            LogEntries entry=driver.manage().logs().get(LogType.BROWSER);
            System.out.println("Started capturing the logs");
            System.out.println("==========================================================");

            for(LogEntry e : entry){
                System.out.println(e);
            }
            driver.quit();
        }catch(Exception e){
            e.printStackTrace();
            driver.quit();
        }
    }

    public static void main(String args[]){
        checkConsoleSelenium("https://sourceforge.net/projects/orangehrm/");
    }

Summary:

Webscraping is an important task if you want to leverage it further in your use cases
There are several ways you can scrap a web page and couple of them are listed above
All the code is available here https://github.com/sohambpatel/WebSrappingTest

Naman Gupta

Founder & CEO, Relu Consultancy | Making Data Accessible

1 年

Web Scraping is definitely an art. It's often mistaken for the rigid structures, procedures and outputs. Only the data community knows how data is just an expression like any other.

1 次回应

Téc. Micaela Rodríguez

1 年

Way to go, Soham! ?? It's always great to see your expertise in action. Looking forward to reading it!

1 次回应

查看更多评论

要查看或添加评论，请登录

Soham Patel的更多文章

AI Enabled Test Bed Generator

2024年5月22日

AI Enabled Test Bed Generator

Background: It might take a lot of time and effort to create high-quality test beds for software testing that…
Java Application to Design, Develop and Execute Preventative Tests Based on Synthetic App Monitoring Outcome and GenAI

2024年3月26日

Java Application to Design, Develop and Execute Preventative Tests Based on Synthetic App Monitoring Outcome and GenAI

Please do visit our github repo for more updated code and artifacts on https://github.com/sohambpatel/PreventativeTests…
Unleashing the Power of Test Automation with Low code, no code solution, Katalon and Its comparison with Selenium/Appium/RestAssured

2024年1月8日

Unleashing the Power of Test Automation with Low code, no code solution, Katalon and Its comparison with Selenium/Appium/RestAssured

There is a greater demand than ever for effective and trustworthy testing tools in the fast-paced field of software…

1 条评论
Interview and Survey Coding with QualCoder: Converting Qualitative Data into Quantitative Data

2024年1月3日

Interview and Survey Coding with QualCoder: Converting Qualitative Data into Quantitative Data

First of all, A potent technique for comprehending the subtleties of human experience and behaviour is qualitative…

4 条评论
When To Use What Kind Of Charts To Represent The Data

2023年10月25日

When To Use What Kind Of Charts To Represent The Data

There is good saying “Data is not clever. Nor is it helpful and insightful.

2 条评论
Defining the KPI for Quality Analysis

2023年10月16日

Defining the KPI for Quality Analysis

Introduction Recently I have attended the session given by S Reine De Reanzi IIMB, PhD on “Community Discussion on…

3 条评论
Obtaining Code Coverage Information While running Functional Test Cases

2023年10月13日

Obtaining Code Coverage Information While running Functional Test Cases

Frontend Code Coverage Problem Statement / Requirement: We Required to obtain code coverage of Angular JS/React…
Jenkins Certified Engineer : Journey, Artifacts and Tips

2023年8月21日

Jenkins Certified Engineer : Journey, Artifacts and Tips

Introduction: - I decided to have Jenkins certificate after working on couple of crucial parts of Jenkins functionality…

4 条评论
Setup Dynamic Docker Worker Agent

2023年7月25日

Setup Dynamic Docker Worker Agent

Problem Statement: Jenkins jobs are running based on labels which are specific to the nodes which leads to not using…
Natural Language Processing with Stanford NLP Library

2023年5月12日

Natural Language Processing with Stanford NLP Library

This is library is vast and powerful in terms of handling different operations within NLP, all the other details and…

1 条评论

See all articles

WebScraping: The Art of Extracting Data (Quality Engineering)

Soham Patel

Product Quality Manager

Introduction

Quality Engineering Use Cases

DOM Verification

Business Automation

Understanding the Similar Product:

How to scrape website data using Java?

领英推荐

How to scrape website data using Selenium?

How to grab console logs of the browser?

Summary:

Soham Patel的更多文章

社区洞察

其他会员也浏览了

Web Scraping

Microservices Design IV: Distributed Tracing, Python in Excel and ChatGPT Enterprise

Traditional Web Scraping VS Web Scraping AI

Optimization run reproducibility, improving UX for business users, and model deployment tutorials

How I Saved €297 with Web Scraping Without Coding Using Octoparse

THE DIFFERENCES BETWEEN DATA SCRAPING AND DATA MINING

Mastering Observability with OpenTelemetry and Grafana for FastAPI Applications

Generating High-Quality Synthetic Data with Python Faker

Goodbye Python, Hello AI: Analyze Data Like a Pro Without Writing a Single Line of?Code

How Developers Can Use AI Effectively: The Do’s & Don’ts of AI Prompting

Introduction

Quality Engineering Use Cases

DOM Verification

Business Automation

Understanding the Similar Product:

How to scrape website data using Java?

领英推荐

How to scrape website data using Selenium?

How to grab console logs of the browser?

Summary:

Soham Patel的更多文章

AI Enabled Test Bed Generator

Java Application to Design, Develop and Execute Preventative Tests Based on Synthetic App Monitoring Outcome and GenAI

Unleashing the Power of Test Automation with Low code, no code solution, Katalon and Its comparison with Selenium/Appium/RestAssured

Interview and Survey Coding with QualCoder: Converting Qualitative Data into Quantitative Data

When To Use What Kind Of Charts To Represent The Data

Defining the KPI for Quality Analysis

Obtaining Code Coverage Information While running Functional Test Cases

Jenkins Certified Engineer : Journey, Artifacts and Tips

Setup Dynamic Docker Worker Agent

Natural Language Processing with Stanford NLP Library

社区洞察

其他会员也浏览了

Web Scraping

Microservices Design IV: Distributed Tracing, Python in Excel and ChatGPT Enterprise

Traditional Web Scraping VS Web Scraping AI

Optimization run reproducibility, improving UX for business users, and model deployment tutorials

How I Saved €297 with Web Scraping Without Coding Using Octoparse

THE DIFFERENCES BETWEEN DATA SCRAPING AND DATA MINING

Mastering Observability with OpenTelemetry and Grafana for FastAPI Applications

Generating High-Quality Synthetic Data with Python Faker

Goodbye Python, Hello AI: Analyze Data Like a Pro Without Writing a Single Line of?Code

How Developers Can Use AI Effectively: The Do’s & Don’ts of AI Prompting