The 4 Practice Problems Non-Devs Need to Start Scraping the Web

The 4 Practice Problems Non-Devs Need to Start Scraping the Web

Welcome back! Or, depending on when I have found you on your journey to scrape the web, Welcome!

In this article, we are going to cover four introductory concepts that will allow anyone to understand better how websites are made and how they can extract information from those websites using a web scraper.

Feeling lost? Go read my earlier article, Demystifying Data: Web Scraping for All Professionals. It covers what web scrapers are, how they can be used, and why they are valuable.

Now, let's dive in.

Console

Before discussing the complexities of the Document Object Model (DOM) and how JavaScript manipulates web page content, it's crucial to understand the tool that makes this interaction possible—the browser console.

The console is not just a feature of your web browser; it's a powerful gateway to the inner workings of web pages, providing real-time insights and interaction capabilities with the DOM.

Practice 1: Displaying "Hello, World" on LinkedIn.com

Let's start by opening the console on LinkedIn.com:

1. Navigate to LinkedIn: Open your preferred web browser and go to LinkedIn.com.

2. Accessing the Console:

  • For Google Chrome, Mozilla Firefox, and Edge Users: Right-click anywhere on the LinkedIn webpage (avoid clicking on images or links to prevent unwanted menus from popping up). Select "Inspect" from the context menu that appears. This action opens the developer tools panel. Click on the "Console" tab within the developer tools. This is where you can type and execute your JavaScript code.
  • For Safari Users: First, ensure that the Developer menu is enabled. Go to Safari's Preferences, click on the Advanced tab, and at the bottom, check the option that says "Show Develop menu in the menu bar". Now, navigate to LinkedIn.com, click on the "Develop" menu in the top menu bar, and then select "Show JavaScript Console."

3. Your First Command: With the console open, let's execute a simple JavaScript command to demonstrate how you can interact with the LinkedIn page. Copy and paste the following line into the console at the blinking cursor:

document.write('Hello, World!');        

4. Execute the Command: Press Enter to run the command. The LinkedIn page will be replaced with a blank page displaying "Hello, World!". This command is a basic example of how JavaScript can manipulate a webpage.


This is what your page should look like after completing Practice Problem 1

Fun Fact: "Hello, World!" is one of the most common first tasks someone learning to program will take on (and that now includes you)!


DOM (Document Object Model)

When a web page is loaded in the browser, the Document Object Model (DOM_ is created based on that page’s HTML and XML content, resulting in a tree-like structure of nodes. Each node represents a part of the document, such as an element, attribute, or text content.

DOM; Family Tree

Imagine the DOM as a family tree, where every family member is a part of the tree but in different positions (parents, children, siblings, etc.). Similarly, elements in the DOM have these relationships, and this structure allows JavaScript to interact with, modify, and manipulate web pages in real time.

Pro Tip: For beginners, a good way to visualize the DOM is to right-click on a webpage and select "Inspect" or "Inspect Element." This will open your browser’s Developer Tools, and you can see the DOM in the "Elements" tab. Here, you can explore how HTML elements are nested and how they relate to each other.

JavaScript can manipulate the DOM, which means it can change the document structure, alter styles, and modify content. This is how web pages become interactive and dynamic. For example, when you fill out a form on a website and see instant validation feedback, that's the DOM in action.

Continuing from our initial exploration of JavaScript in the browser console, let's delve deeper and create something a bit more complex—a table. This exercise will further your understanding of the Document Object Model (DOM) and how JavaScript can be used to dynamically generate HTML content.

Practice 2: Dynamically Creating a Table with JavaScript

Now that you're familiar with executing basic JavaScript commands in the console let's create a dynamic table that displays data. This example will help you understand how to manipulate the DOM to add new elements to a webpage.

Note: If you don't feel very comfortable with some of these terms yet, that is ok and expected! We will get you up and running by extracting some data, then come back in later articles to flesh out the concepts we are glossing over.

1. Prepare the JavaScript Code: We'll start by defining the structure and data for our table. The code below creates a table element, a tbody element (which will contain all our table rows), and an array of data that represents the rows and cells of our table:

// Create table and tbody elements 
let table = document.createElement('table'); 
let tbody = document.createElement('tbody'); 

// Set table border for visibility
table.border = '1'; 
 

// Array of table data 
let data = [ ['Name', 'Age', 'City'], ['Alice', '24', 'New York'], ['Bob', '30', 'Los Angeles'], ['Charlie', '28', 'Chicago'], ['Diana', '35', 'Houston'] ];        

2. Populate the Table: We'll loop through the array to create rows (<tr>) and cells (<td> or <th>). The first row will be treated as the table header:

// Loop through each data row
data.forEach((row, index) => {
    let tr = document.createElement('tr'); // Create a new row
    
    // Loop through each cell in the row
    row.forEach(cellText => {
        let cell;
        if (index === 0) {
            cell = document.createElement('th'); // Table header cell for the first row
        } else {
            cell = document.createElement('td'); // Standard table cell
        }
        cell.textContent = cellText; // Set cell text content
        tr.appendChild(cell); // Append cell to row
    });

    tbody.appendChild(tr); // Append row to tbody
});

table.appendChild(tbody); // Append tbody to table
        

3. Display the Table: Finally, we'll add the entire table to the document's body, making it visible on the page:

document.body.appendChild(table); // Append the table to the body
        

4. Execute the Code: Copy the entire block of code provided above. Return to the LinkedIn page, open the console again, paste this code into the console, and press Enter.

You should see a table appear on the LinkedIn page with the data we defined. This table demonstrates how JavaScript can dynamically generate and manipulate HTML content based on data.

This is what the page should look like after Practice 2 has been executed.


Selectors

In JavaScript, selectors are used to target and manipulate HTML elements on a webpage. They are patterns that match against elements in the DOM tree, and they play a crucial role in accessing and modifying the content and structure of a web page.

There are several types of selectors in JavaScript:

  • Element Selector: Selects all elements with the specified tag name.
  • ID Selector: Selects a single element with the specified ID.
  • Class Selector: Selects all elements with the specified class.
  • Attribute Selector: Selects all elements with a specified attribute and value.

By combining these selectors, you can create complex selection patterns, making it easy to target and manipulate elements on your web page precisely.

Practice 3: Selecting and Highlighting Table Data

Now that we have a table filled with data on our LinkedIn page let's use JavaScript selectors to interact with this table. We'll focus on selecting specific elements within the table and applying some changes to them.

Task: Highlight a Specific Row

Your task is to write a JavaScript command that selects the third row of the table (excluding the header) and changes its background color to highlight it.

1. Understand the Structure: Recall that our table is structured with a <tbody> containing several <tr> elements (rows), each with multiple <td> elements (cells). We want to target the third <tr> within <tbody>.

2. Open Your Console: Navigate to the LinkedIn page where you previously added the table, and open your browser's console.

3. Use a Selector: We'll use the querySelectorAll method to select all rows (<tr> elements) within the table's body (<tbody>). Then, we'll target the third row by its index.

let rows = document.querySelectorAll('table tbody tr');        

4. Target the Third Row: In JavaScript, array indexing starts at 0. So, the third row will be at index 2.

let thirdRow = rows[2];        

5. Change the Background Color: Apply a new background color to this row to highlight it.

// A light yellow color for highlighting
thirdRow.style.backgroundColor = '#FFFF99';         

6. Execute the Code: Paste the combined commands into your console and press Enter.

let rows = document.querySelectorAll('table tbody tr'); 
let thirdRow = rows[2]; 
thirdRow.style.backgroundColor = '#FFFF99';        

After running the above commands, the third row of your table should now be highlighted with a light yellow background.

This is what your page should look like after Practice 3 is completed.

Methods

Methods in JavaScript and the DOM are functions attached to objects that perform actions or calculations using that object’s properties. The DOM provides various methods to interact with, manipulate, and retrieve information from HTML elements.

Here is a list of some common DOM methods:

  • getElementById: Selects an element by its ID.
  • getElementsByClassName: Returns a collection of all elements with the specified class name.
  • getElementsByTagName: Returns a collection of all elements with the specified tag name.
  • querySelector: Returns the first element that matches a specified CSS selector.
  • querySelectorAll: Returns all elements that match a specified CSS selector.
  • appendChild: Adds a new child element to an existing element.
  • removeChild: Removes a child element from an existing element.
  • setAttribute: Sets the value of an attribute on the specified element.
  • getAttribute: Returns the value of a specified attribute on the element.
  • removeAttribute: Removes an attribute from an element.
  • addEventListener: Sets up a function to be called whenever the specified event is delivered to the target.
  • removeEventListener: Removes an event listener from an element.
  • innerText: Sets or returns the text content of an element.
  • innerHTML: Sets or returns the HTML content inside an element.
  • style: Changes the style of an element.

Practice 4: Extracting and Logging Data from Your Table

Now that we've successfully highlighted a specific row in our table (Hint! We actually used a method to complete the last practice), let's take it a step further by extracting data from a particular cell within that row and logging it to the console.

Task: Log the Name from the Third Row

Write a JavaScript command that selects the first cell (<td>) of the third row of the table (excluding the header) and logs the name contained in that cell to the console.

Steps to Follow:

1. Review the Structure: Remember that our table is structured with a <tbody> containing several <tr> elements (rows), each with multiple <td> elements (cells). We're interested in the first cell of the third row within <tbody>.

2. Open Your Console: Ensure you're on the LinkedIn page where your table exists, and open your browser's console.

3. Select the Third Row: Use the querySelectorAll method to select all rows within the table's body, and then target the third row by its index.

Don't Forget: In JavaScript, array indexing starts at 0. So, the third row will be at index 2.

let thirdRow = document.querySelectorAll('table tbody tr')[2];        

4. Select the First Cell of the Third Row: Now, target the first cell (<td>) of the selected row. Since cells are direct children of rows, you can use querySelector to select the first cell.

let firstNameCell = thirdRow.querySelector('td');        

5. Log the Cell's Content: Retrieve the text content of the selected cell and log it to the console.

console.log(firstNameCell.textContent);        

6. Execute the Code: Combine and run the commands in your console.

let thirdRow = document.querySelectorAll('table tbody tr')[2]; let firstNameCell = thirdRow.querySelector('td'); console.log(firstNameCell.textContent);        

After executing the above code, the console should display the name from the first cell of the third row of your table.

What you should see in your console after Practice 4.

Final Remarks

And now you've done it! Congratulations, you've successfully:

  • Navigated your console
  • Manipulated the DOM
  • Used a selector
  • Scraped content from a webpage

We breezed through this section, and you didn't even make it look difficult. Pat yourself on the back.

As I mentioned earlier, if you don't feel very comfortable with some of these terms yet, that is anticipated! Now that we have you up and running with a foundation in some introductory topics, I will flesh out some of the concepts we glossed over in future articles.

Let me know in the comments what you struggled with in this tutorial. What topics do you wish I expanded on?


When’s the paid course dropping ??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了