XPath vs CSS selectors: a comparison

XPath vs CSS selectors: a comparison

When creating a web scraper, one of the first decisions is to choose which type of selector to use.

But what is a selector and which type of them can you choose? Let’s see it together in this article by? The Web Scraping Club .

What are selectors?

To gather data in your web scrapers, one of the first tasks is to find out where the data we’re interested in and to do this, we need selectors.

Basically, a selector is an object that, given a query, returns a portion of a web page. And the language we write this query can be XPATH or CSS.

What type of selectors there are?

Independently from the syntax we'll use, according to the structure of the web page, we have different types of selectors.

  • ID-based selectors, where our target object is defined by a unique ID

<span id="searchToggle" tabindex="0">Search</span>        

  • Attribute-based selectors, where our target object is defined by its attribute (like a class of objects in this case).

<span class="search-toggle-icon">Search</span>         

While the first is the best for determining our target object, they are often not available for every case, so we need to use attribute-based selectors using a unique path to our desired object.

How to choose a good selector?

There are some best practices to use when choosing a selector in our web scraping project:

  • The selector should determine a unique and unambiguous path to the target element or group of elements.
  • It should be clear which element the locator refers to without examining it in the code.
  • In our projects, especially larger ones where more people are involved, only one type of selector should be used in every scraper (Xpath or CSS)
  • Your locator should be as universal or more generic as possible, remaining accurate, so that if there are changes to the website, it remains relevant.

How do XPATH selectors work?

XPath (XML path) is a query language for locating nodes in an XML document. Since many browsers support XHTML, we can use XPath to locate elements in web pages.

Each element has a path from the start to the element itself, so by navigating?the DOM of the page?we can point exactly to our target.

XPATH expressions are composed of three parts:

  • Axis
  • Node tests
  • Predicates

Axis

Axis refer to the navigation direction of the expression. The most common is ‘//’ to state that we’re descending in the tree, or ‘..’ to refer to the parent node.

Node tests

Node tests may consist of specific node names or more general expressions, like ad example .text() for when we need to extract the text of a node

Predicates

They can be seen as filters, so when we specify in the following expression that we need only the items belonging to the class “class1” we are using a predicate.

string= xpath('//a[@class="class1"]/).text()        

Here we used all the axis ‘//’, the predicate ‘@class=”class1”’, and the node test .text

For a deeper understanding,?there’s a great Wikipedia page about it, where the general concepts of XPATH are explained regardless of the programming language where it is used.

How do CSS selectors work?

Most HTML pages are built to integrate graphic layouts based on CSS and these style elements can be used to locate DOM elements.

Just as for the XPATH selectors, they can be divided in:

  • Universal selectors, which select all the elements on a page
  • Type selectors, which select all elements that have the given node name.
  • Class selectors, which select all elements that have the given?class?attribute.
  • ID selectors, which select an element based on the value of its?id?attribute. There should be only one element with a given ID in a document.
  • Attribute selectors, which selects all elements that have the given attribute.

A practical comparison

Let’s compare the syntax of XPATH and CSS selectors on the same page, in this case, Hacker News home page. Let’s say we want to locate the title of the first news ranked.

Let’s open the inspect windows and in the Elements tab, let’s right-click on the string of the first title and select “copy → copy selectors”.

Hacker news page
Hacker news page

Here’s the result for the CSS selectors

td:nth-child(3) > span > a        

While for XPATH we have

//td[3]/span/a        

In this case, the syntax is pretty similar and clear for both of them.

Here’s a syntax comparison table between the two, extracted from the Slotix GitHub repository.

No alt text provided for this image
Xpath and CSS comparison

Final remarks

Which one to use between XPATH and CSS selectors?

There’s no easy answer here and it depends on the habits and coding style of each person.

I’ve read on several websites that CSS selectors are faster but, unless you’re building an ultra-high frequency scraper, this should not matter.

The main difference between the two is that XPATH is a little more flexible since you can select items on both of the directions (descendant and parent items) while with CSS only descendants.

I’m personally a big fan of XPATH, I usually find its expression clearer than CSS but it’s mostly a personal taste, but I hope this article helped you in choosing which one to adopt in your scraping projects.

Kaio Mano

CTO | Data Engineer

1 年

I've always opted to write code using Xpath, but I was never aware of the difference in speed. I'm going to do some tests on my own, because I was quite curious...

Ghulam Hamza

Python Developer || TechXL Solutions || Cyber Security ||| Type Script || Web Automation || Web Scraping || Data Science

1 年

Python Web Scraping , Web Automation , Pip Installation , Python Code Chacking https://www.fiverr.com/anonymouscod670/web-automation-pip-fixing-web-scraping-python-scripting

回复
Aleksandras ?ul?enko

Parsing the reality into value

1 年

xpath all the way!

Muhmmad Danial

Python | Scrapy Developer

1 年

Both selectors have their use-case. For me,in multiple cases,combination of css().xpath works perfect.

Mohammad Faheem

Data Scraping Engineer @ M3 Hive | Python, Web Scraping

1 年

My first preference are CSS selectors because they are faster than XPATHs.

要查看或添加评论,请登录

Pierluigi Vinciguerra的更多文章

  • The new OpenAI User Agent and its consequences

    The new OpenAI User Agent and its consequences

    The latest post by Gergely Orosz from The Pragmatic Engineer put the focus on one of the major concerns about the…

    1 条评论
  • What is device fingerprinting?

    What is device fingerprinting?

    This is a post from The Web Scraping Club newsletter, if you don't want to miss other posts about web scraping tools…

    12 条评论
  • Web Scraping news recap - April 2023

    Web Scraping news recap - April 2023

    Hi everyone and welcome back to The Web Scraping Club, this post is our monthly review of what happened in the web…

    2 条评论
  • Web scraping and alternative data for financial markets

    Web scraping and alternative data for financial markets

    We have seen in many posts how to scrape the web under several circumstances, like when there’s a Cloudflare-protected…

    3 条评论
  • Writing a web scraper with ChatGPT. Is it a good idea?

    Writing a web scraper with ChatGPT. Is it a good idea?

    In November, after OpenAI released ChatGPT, based on GPT-3, the news was literally everywhere. I also wrote on that…

    4 条评论
  • How to scrape Datadome protected websites (early 2023 version)

    How to scrape Datadome protected websites (early 2023 version)

    Let’s continue our journey on the tackle of antibot systems. Today, after seeing Kasada and Cloudflare, it’s the turn…

    4 条评论
  • Bypass Cloudflare with these web scraping tools

    Bypass Cloudflare with these web scraping tools

    In this article of The Web Scraping Club we see the Python tools we can use to bypass Cloudflare protected websites…

    11 条评论
  • Bypass Cloudflare Bot Protection with GoLogin

    Bypass Cloudflare Bot Protection with GoLogin

    Here's an abstract from the latest post on The Web Scraping Club substack, where we tackle Cloudflare anti-bot solution…

    12 条评论
  • How I've built my home made mobile proxy

    How I've built my home made mobile proxy

    This article is published on The Web Scraping Club substack. If you liked it and don't want to miss other updates on…

    10 条评论
  • Scraping OpenSea data to analyze NFT collections

    Scraping OpenSea data to analyze NFT collections

    This article is extracted from The Web Scraping Club newsletter, a substack about web scraping with examples…

    4 条评论

社区洞察

其他会员也浏览了