The 4 Practice Problems Non-Devs Need to Start Scraping the Web
Melodie Hays
NWA Fast 15 2024 | SupplierWiki Founder | Advocate for Retail Suppliers and Women in STEM
Welcome back! Or, depending on when I have found you on your journey to scrape the web, Welcome!
In this article, we are going to cover four introductory concepts that will allow anyone to understand better how websites are made and how they can extract information from those websites using a web scraper.
Feeling lost? Go read my earlier article, Demystifying Data: Web Scraping for All Professionals. It covers what web scrapers are, how they can be used, and why they are valuable.
Now, let's dive in.
Console
Before discussing the complexities of the Document Object Model (DOM) and how JavaScript manipulates web page content, it's crucial to understand the tool that makes this interaction possible—the browser console.
The console is not just a feature of your web browser; it's a powerful gateway to the inner workings of web pages, providing real-time insights and interaction capabilities with the DOM.
Practice 1: Displaying "Hello, World" on LinkedIn.com
Let's start by opening the console on LinkedIn.com:
1. Navigate to LinkedIn: Open your preferred web browser and go to LinkedIn.com.
2. Accessing the Console:
3. Your First Command: With the console open, let's execute a simple JavaScript command to demonstrate how you can interact with the LinkedIn page. Copy and paste the following line into the console at the blinking cursor:
document.write('Hello, World!');
4. Execute the Command: Press Enter to run the command. The LinkedIn page will be replaced with a blank page displaying "Hello, World!". This command is a basic example of how JavaScript can manipulate a webpage.
Fun Fact: "Hello, World!" is one of the most common first tasks someone learning to program will take on (and that now includes you)!
DOM (Document Object Model)
When a web page is loaded in the browser, the Document Object Model (DOM_ is created based on that page’s HTML and XML content, resulting in a tree-like structure of nodes. Each node represents a part of the document, such as an element, attribute, or text content.
Imagine the DOM as a family tree, where every family member is a part of the tree but in different positions (parents, children, siblings, etc.). Similarly, elements in the DOM have these relationships, and this structure allows JavaScript to interact with, modify, and manipulate web pages in real time.
Pro Tip: For beginners, a good way to visualize the DOM is to right-click on a webpage and select "Inspect" or "Inspect Element." This will open your browser’s Developer Tools, and you can see the DOM in the "Elements" tab. Here, you can explore how HTML elements are nested and how they relate to each other.
JavaScript can manipulate the DOM, which means it can change the document structure, alter styles, and modify content. This is how web pages become interactive and dynamic. For example, when you fill out a form on a website and see instant validation feedback, that's the DOM in action.
Continuing from our initial exploration of JavaScript in the browser console, let's delve deeper and create something a bit more complex—a table. This exercise will further your understanding of the Document Object Model (DOM) and how JavaScript can be used to dynamically generate HTML content.
Practice 2: Dynamically Creating a Table with JavaScript
Now that you're familiar with executing basic JavaScript commands in the console let's create a dynamic table that displays data. This example will help you understand how to manipulate the DOM to add new elements to a webpage.
Note: If you don't feel very comfortable with some of these terms yet, that is ok and expected! We will get you up and running by extracting some data, then come back in later articles to flesh out the concepts we are glossing over.
1. Prepare the JavaScript Code: We'll start by defining the structure and data for our table. The code below creates a table element, a tbody element (which will contain all our table rows), and an array of data that represents the rows and cells of our table:
// Create table and tbody elements
let table = document.createElement('table');
let tbody = document.createElement('tbody');
// Set table border for visibility
table.border = '1';
// Array of table data
let data = [ ['Name', 'Age', 'City'], ['Alice', '24', 'New York'], ['Bob', '30', 'Los Angeles'], ['Charlie', '28', 'Chicago'], ['Diana', '35', 'Houston'] ];
2. Populate the Table: We'll loop through the array to create rows (<tr>) and cells (<td> or <th>). The first row will be treated as the table header:
// Loop through each data row
data.forEach((row, index) => {
let tr = document.createElement('tr'); // Create a new row
// Loop through each cell in the row
row.forEach(cellText => {
let cell;
if (index === 0) {
cell = document.createElement('th'); // Table header cell for the first row
} else {
cell = document.createElement('td'); // Standard table cell
}
cell.textContent = cellText; // Set cell text content
tr.appendChild(cell); // Append cell to row
});
tbody.appendChild(tr); // Append row to tbody
});
table.appendChild(tbody); // Append tbody to table
3. Display the Table: Finally, we'll add the entire table to the document's body, making it visible on the page:
document.body.appendChild(table); // Append the table to the body
4. Execute the Code: Copy the entire block of code provided above. Return to the LinkedIn page, open the console again, paste this code into the console, and press Enter.
You should see a table appear on the LinkedIn page with the data we defined. This table demonstrates how JavaScript can dynamically generate and manipulate HTML content based on data.
Selectors
In JavaScript, selectors are used to target and manipulate HTML elements on a webpage. They are patterns that match against elements in the DOM tree, and they play a crucial role in accessing and modifying the content and structure of a web page.
There are several types of selectors in JavaScript:
By combining these selectors, you can create complex selection patterns, making it easy to target and manipulate elements on your web page precisely.
领英推荐
Practice 3: Selecting and Highlighting Table Data
Now that we have a table filled with data on our LinkedIn page let's use JavaScript selectors to interact with this table. We'll focus on selecting specific elements within the table and applying some changes to them.
Task: Highlight a Specific Row
Your task is to write a JavaScript command that selects the third row of the table (excluding the header) and changes its background color to highlight it.
1. Understand the Structure: Recall that our table is structured with a <tbody> containing several <tr> elements (rows), each with multiple <td> elements (cells). We want to target the third <tr> within <tbody>.
2. Open Your Console: Navigate to the LinkedIn page where you previously added the table, and open your browser's console.
3. Use a Selector: We'll use the querySelectorAll method to select all rows (<tr> elements) within the table's body (<tbody>). Then, we'll target the third row by its index.
let rows = document.querySelectorAll('table tbody tr');
4. Target the Third Row: In JavaScript, array indexing starts at 0. So, the third row will be at index 2.
let thirdRow = rows[2];
5. Change the Background Color: Apply a new background color to this row to highlight it.
// A light yellow color for highlighting
thirdRow.style.backgroundColor = '#FFFF99';
6. Execute the Code: Paste the combined commands into your console and press Enter.
let rows = document.querySelectorAll('table tbody tr');
let thirdRow = rows[2];
thirdRow.style.backgroundColor = '#FFFF99';
After running the above commands, the third row of your table should now be highlighted with a light yellow background.
Methods
Methods in JavaScript and the DOM are functions attached to objects that perform actions or calculations using that object’s properties. The DOM provides various methods to interact with, manipulate, and retrieve information from HTML elements.
Here is a list of some common DOM methods:
Practice 4: Extracting and Logging Data from Your Table
Now that we've successfully highlighted a specific row in our table (Hint! We actually used a method to complete the last practice), let's take it a step further by extracting data from a particular cell within that row and logging it to the console.
Task: Log the Name from the Third Row
Write a JavaScript command that selects the first cell (<td>) of the third row of the table (excluding the header) and logs the name contained in that cell to the console.
Steps to Follow:
1. Review the Structure: Remember that our table is structured with a <tbody> containing several <tr> elements (rows), each with multiple <td> elements (cells). We're interested in the first cell of the third row within <tbody>.
2. Open Your Console: Ensure you're on the LinkedIn page where your table exists, and open your browser's console.
3. Select the Third Row: Use the querySelectorAll method to select all rows within the table's body, and then target the third row by its index.
Don't Forget: In JavaScript, array indexing starts at 0. So, the third row will be at index 2.
let thirdRow = document.querySelectorAll('table tbody tr')[2];
4. Select the First Cell of the Third Row: Now, target the first cell (<td>) of the selected row. Since cells are direct children of rows, you can use querySelector to select the first cell.
let firstNameCell = thirdRow.querySelector('td');
5. Log the Cell's Content: Retrieve the text content of the selected cell and log it to the console.
console.log(firstNameCell.textContent);
6. Execute the Code: Combine and run the commands in your console.
let thirdRow = document.querySelectorAll('table tbody tr')[2]; let firstNameCell = thirdRow.querySelector('td'); console.log(firstNameCell.textContent);
After executing the above code, the console should display the name from the first cell of the third row of your table.
Final Remarks
And now you've done it! Congratulations, you've successfully:
We breezed through this section, and you didn't even make it look difficult. Pat yourself on the back.
As I mentioned earlier, if you don't feel very comfortable with some of these terms yet, that is anticipated! Now that we have you up and running with a foundation in some introductory topics, I will flesh out some of the concepts we glossed over in future articles.
Let me know in the comments what you struggled with in this tutorial. What topics do you wish I expanded on?
cto @supplypike
9 个月When’s the paid course dropping ??