登录查看更多内容

XPath vs CSS selectors: a comparison

Pierluigi Vinciguerra

Co Founder and CTO at Databoutique.com | Writing on The Web Scraping Club

发布日期: 2023年4月2日

When creating a web scraper, one of the first decisions is to choose which type of selector to use.

But what is a selector and which type of them can you choose? Let’s see it together in this article by? The Web Scraping Club .

What are selectors?

To gather data in your web scrapers, one of the first tasks is to find out where the data we’re interested in and to do this, we need selectors.

Basically, a selector is an object that, given a query, returns a portion of a web page. And the language we write this query can be XPATH or CSS.

What type of selectors there are?

Independently from the syntax we'll use, according to the structure of the web page, we have different types of selectors.

ID-based selectors, where our target object is defined by a unique ID

<span id="searchToggle" tabindex="0">Search</span>

Attribute-based selectors, where our target object is defined by its attribute (like a class of objects in this case).

<span class="search-toggle-icon">Search</span>

While the first is the best for determining our target object, they are often not available for every case, so we need to use attribute-based selectors using a unique path to our desired object.

How to choose a good selector?

There are some best practices to use when choosing a selector in our web scraping project:

The selector should determine a unique and unambiguous path to the target element or group of elements.
It should be clear which element the locator refers to without examining it in the code.
In our projects, especially larger ones where more people are involved, only one type of selector should be used in every scraper (Xpath or CSS)
Your locator should be as universal or more generic as possible, remaining accurate, so that if there are changes to the website, it remains relevant.

How do XPATH selectors work?

XPath (XML path) is a query language for locating nodes in an XML document. Since many browsers support XHTML, we can use XPath to locate elements in web pages.

Each element has a path from the start to the element itself, so by navigating?the DOM of the page?we can point exactly to our target.

XPATH expressions are composed of three parts:

Axis
Node tests
Predicates

Axis

Axis refer to the navigation direction of the expression. The most common is ‘//’ to state that we’re descending in the tree, or ‘..’ to refer to the parent node.

Node tests

Node tests may consist of specific node names or more general expressions, like ad example .text() for when we need to extract the text of a node

领英推荐

Materialize CSS Helpers

Diwakar Chauhan 1 年前

Use Tailwind and the Tailwind workflow with TMS WEB…

tmssoftware.com 1 年前

CSS(Cascading Style Sheets)

Sivaganesh M 3 周前

Predicates

They can be seen as filters, so when we specify in the following expression that we need only the items belonging to the class “class1” we are using a predicate.

string= xpath('//a[@class="class1"]/).text()

Here we used all the axis ‘//’, the predicate ‘@class=”class1”’, and the node test .text

For a deeper understanding,?there’s a great Wikipedia page about it, where the general concepts of XPATH are explained regardless of the programming language where it is used.

How do CSS selectors work?

Most HTML pages are built to integrate graphic layouts based on CSS and these style elements can be used to locate DOM elements.

Just as for the XPATH selectors, they can be divided in:

Universal selectors, which select all the elements on a page
Type selectors, which select all elements that have the given node name.
Class selectors, which select all elements that have the given?class?attribute.
ID selectors, which select an element based on the value of its?id?attribute. There should be only one element with a given ID in a document.
Attribute selectors, which selects all elements that have the given attribute.

A practical comparison

Let’s compare the syntax of XPATH and CSS selectors on the same page, in this case, Hacker News home page. Let’s say we want to locate the title of the first news ranked.

Let’s open the inspect windows and in the Elements tab, let’s right-click on the string of the first title and select “copy → copy selectors”.

Here’s the result for the CSS selectors

td:nth-child(3) > span > a

While for XPATH we have

//td[3]/span/a

In this case, the syntax is pretty similar and clear for both of them.

Here’s a syntax comparison table between the two, extracted from the Slotix GitHub repository.

No alt text provided for this image — Xpath and CSS comparison

Final remarks

Which one to use between XPATH and CSS selectors?

There’s no easy answer here and it depends on the habits and coding style of each person.

I’ve read on several websites that CSS selectors are faster but, unless you’re building an ultra-high frequency scraper, this should not matter.

The main difference between the two is that XPATH is a little more flexible since you can select items on both of the directions (descendant and parent items) while with CSS only descendants.

I’m personally a big fan of XPATH, I usually find its expression clearer than CSS but it’s mostly a personal taste, but I hope this article helped you in choosing which one to adopt in your scraping projects.

Kaio Mano

CTO | Data Engineer

1 年

I've always opted to write code using Xpath, but I was never aware of the difference in speed. I'm going to do some tests on my own, because I was quite curious...

1 次回应

Ghulam Hamza

Python Developer || TechXL Solutions || Cyber Security ||| Type Script || Web Automation || Web Scraping || Data Science

1 年

Python Web Scraping , Web Automation , Pip Installation , Python Code Chacking https://www.fiverr.com/anonymouscod670/web-automation-pip-fixing-web-scraping-python-scripting

Aleksandras ?ul?enko

Parsing the reality into value

1 年

xpath all the way!

1 次回应

Muhmmad Danial

Python | Scrapy Developer

1 年

Both selectors have their use-case. For me,in multiple cases,combination of css().xpath works perfect.

1 次回应

Mohammad Faheem

Data Scraping Engineer @ M3 Hive | Python, Web Scraping

1 年

My first preference are CSS selectors because they are faster than XPATHs.

3 次回应

查看更多评论

要查看或添加评论，请登录

Pierluigi Vinciguerra的更多文章

The new OpenAI User Agent and its consequences

2023年8月9日

The new OpenAI User Agent and its consequences

The latest post by Gergely Orosz from The Pragmatic Engineer put the focus on one of the major concerns about the…

1 条评论
What is device fingerprinting?

2023年5月21日

What is device fingerprinting?

This is a post from The Web Scraping Club newsletter, if you don't want to miss other posts about web scraping tools…

12 条评论
Web Scraping news recap - April 2023

2023年5月1日

Web Scraping news recap - April 2023

Hi everyone and welcome back to The Web Scraping Club, this post is our monthly review of what happened in the web…

2 条评论
Web scraping and alternative data for financial markets

2023年4月25日

Web scraping and alternative data for financial markets

We have seen in many posts how to scrape the web under several circumstances, like when there’s a Cloudflare-protected…

3 条评论
Writing a web scraper with ChatGPT. Is it a good idea?

2023年4月16日

Writing a web scraper with ChatGPT. Is it a good idea?

In November, after OpenAI released ChatGPT, based on GPT-3, the news was literally everywhere. I also wrote on that…

4 条评论
How to scrape Datadome protected websites (early 2023 version)

2023年4月14日

How to scrape Datadome protected websites (early 2023 version)

Let’s continue our journey on the tackle of antibot systems. Today, after seeing Kasada and Cloudflare, it’s the turn…

4 条评论
Bypass Cloudflare with these web scraping tools

2023年2月14日

Bypass Cloudflare with these web scraping tools

In this article of The Web Scraping Club we see the Python tools we can use to bypass Cloudflare protected websites…

11 条评论
Bypass Cloudflare Bot Protection with GoLogin

2023年1月19日

Bypass Cloudflare Bot Protection with GoLogin

Here's an abstract from the latest post on The Web Scraping Club substack, where we tackle Cloudflare anti-bot solution…

12 条评论
How I've built my home made mobile proxy

2023年1月15日

How I've built my home made mobile proxy

This article is published on The Web Scraping Club substack. If you liked it and don't want to miss other updates on…

10 条评论
Scraping OpenSea data to analyze NFT collections

2023年1月6日

Scraping OpenSea data to analyze NFT collections

This article is extracted from The Web Scraping Club newsletter, a substack about web scraping with examples…

4 条评论

See all articles

XPath vs CSS selectors: a comparison

Pierluigi Vinciguerra

Co Founder and CTO at Databoutique.com | Writing on The Web Scraping Club

What are selectors?

What type of selectors there are?

How to choose a good selector?

How do XPATH selectors work?

Axis

Node tests

领英推荐

Predicates

How do CSS selectors work?

A practical comparison

Final remarks

Which one to use between XPATH and CSS selectors?

Pierluigi Vinciguerra的更多文章

社区洞察

其他会员也浏览了

Pseudo Element in CSS

Next.js 14: Optimize Performance with Deferred CSS Loading

XPath VS CSS

HTML

HTML Tag List

HTML

Introduction to CSS

CSS Custom Highlight API

HTML

Types of DOM

What are selectors?

What type of selectors there are?

How to choose a good selector?

How do XPATH selectors work?

Axis

Node tests

领英推荐

Predicates

How do CSS selectors work?

A practical comparison

Final remarks

Which one to use between XPATH and CSS selectors?

Pierluigi Vinciguerra的更多文章

The new OpenAI User Agent and its consequences

What is device fingerprinting?

Web Scraping news recap - April 2023

Web scraping and alternative data for financial markets

Writing a web scraper with ChatGPT. Is it a good idea?

How to scrape Datadome protected websites (early 2023 version)

Bypass Cloudflare with these web scraping tools

Bypass Cloudflare Bot Protection with GoLogin

How I've built my home made mobile proxy

Scraping OpenSea data to analyze NFT collections

社区洞察

其他会员也浏览了

Pseudo Element in CSS

Next.js 14: Optimize Performance with Deferred CSS Loading

XPath VS CSS

HTML

HTML Tag List

HTML

Introduction to CSS

CSS Custom Highlight API

HTML

Types of DOM