Sitecore Search Multi-language Web Crawler (Advanced). How to extract all content text for search.

Yauheni Tryputsko

Lead .Net and Sitecore developer, Sitecore Technology MVP 2025

发布日期: 2025年1月2日

In this article, I will guide you through the process of creating an advanced Web Crawler capable of handling multiple languages in Sitecore Search. Additionally, I’ll demonstrate how to extract all text content from <p> tags and utilize it effectively for search functionality.

Let’s get started! To create an advanced Web Crawler, begin by navigating to the Source page and clicking Add Sources. For this article, select Web Crawler (Advanced) as the source type. This setup will allow us to extract all the necessary information from our pages and leverage it for content search.

After creating the Web Crawler, the first step is to configure its settings. The key setting to focus on is Max URLs. For this example, I’ve set the value to 1000, which corresponds to the approximate number of pages on my site. Be sure to adjust this value based on the total number of pages you need to crawl.

Once the Web Crawler settings are configured, the next step is to select a Trigger Type and configure the trigger. Sitecore Search offers five trigger types: JS, RSS, Request, Sitemap, and Sitemap Index. For my Web Crawler, I chose Sitemap as the trigger type. I find this to be a straightforward option for retrieving all pages without the need for complex request logic.

The next important setting is Available Locales. Since my site supports two languages—en (English) and ar-sa (Arabic-Saudi Arabia)—I selected en_us and ar_sa for this configuration. If your desired languages are not already available in Sitecore Search, you can add them by navigating to Administrative -> Domain Setting -> Locale. It’s worth noting that my site uses English without a specified region, but Sitecore Search requires languages to include a region code. In the following steps, I’ll explain how to map Sitecore languages to the corresponding Sitecore Search languages.

When your Web Crawler supports two or more languages, Sitecore Search provides an additional setting called Locale Extractor. This setting allows you to configure the logic that determines which content is associated with each language. The Locale Extractor feature supports three types: URL, Header, and JS. Since I’m using the Sitemap trigger type to crawl all pages from the site map and extract data directly from the HTML of those pages, the most suitable locale types are JS or URL. I chose JS, as it provides the flexibility needed to handle the differences between my Sitecore languages and the languages defined in Sitecore Search.

Code for mapping language from html tag to Sitecore Search.

function extract(request, response) {
     $ = response.body;
    let langValue = $('html').attr('lang');
   
    if (langValue === 'en') {
        langValue = 'en_US';
    }
  
    return langValue;
}

As you can see, my logic includes a condition: if the langValue is en, it will be mapped to en_us (the Sitecore Search locale). This ensures that the language code from the HTML tag aligns with the format expected by Sitecore Search.

At this stage, we have configured the Web Crawler to retrieve pages from the Sitemap and associate the content with the correct locales. The final step is to extract the page data (HTML content) and map it to Sitecore Search attributes. To achieve this, we need to configure the Document Extractor. The Document Extractor supports three types: CSS, XPath, and JS. In most cases, XPath is the preferred choice due to its simplicity—it allows you to configure queries without writing code by mapping XPath expressions to available attributes. However, I encountered a limitation with XPath. As you may know, Sitecore pages are composed of multiple components, each potentially containing text information. For a comprehensive search that includes all text from these components, it’s ideal to create a single attribute (e.g., Full Text) to store all textual content, such as text from <p> tags. While XPath supports queries to extract text from all <p> tags, the returned result is an array. Unfortunately, Sitecore Search’s XPath extractor only processes the first element of the array. I also tried creating the Full Text attribute as an array, but this didn’t resolve the issue. To overcome this, I switched to the JS type for the Document Extractor. This allows me to write custom logic to retrieve all <p> tags and combine their text content into a single string attribute. Below is an example of the JavaScript logic used to achieve this:

function extract(request, response) {
    $ = response.body;


    return [{
        'id': $('meta[name="page-id"]').attr('content'),
        'name': $('title').text(),
        'url': $('link[rel="canonical"]').attr('href'),
        'type': $('meta[name="page-type"]').attr('content'),
        'title': $('meta[property="og:title"]').attr('content'),
        'description': $('meta[property="og:description"]').attr('content'),
        //full text logic for get all <p>
        'full_text': $('p').map(function() {
            return $(this).text().trim();
        }).get().join(' ')
    }];
}

One important detail to keep in mind for multi-language setups is the Localized checkbox. Don’t forget to enable it, as this ensures that the extracted data is correctly associated with the appropriate language.

Now you have a clear understanding of how to create an Advanced Web Crawler for multiple languages and how to extract text content from specific HTML tags. I hope this guide helps you effectively utilize Sitecore Search for your projects.

要查看或添加评论，请登录

Yauheni Tryputsko的更多文章

Sitecore Search Split Testing: How to Improve Your Search Results

2025年2月24日

Sitecore Search Split Testing: How to Improve Your Search Results

A good search experience helps users find what they need quickly. In Sitecore Search, you can improve search results by…
How to create custom hostname for Sitecore Edge

2025年1月20日

How to create custom hostname for Sitecore Edge

When working with Sitecore XM Cloud, we use Edge to retrieve data. Every time the front end makes a request, it calls…

3 条评论
Sitecore Search: How to Build Components

2024年10月16日

Sitecore Search: How to Build Components

Today, I’d like to walk you through the process of creating a widget and implementing a search component using React…
Configuring the Sitecore Search API Crawler to Retrieve Data from XMCloud Edge.

2024年7月15日

Configuring the Sitecore Search API Crawler to Retrieve Data from XMCloud Edge.

In this article, I will guide you through the process of setting up an API crawler to transfer data from XMCloud Edge…
Two issues with Sitrecore XMCloud Edge and how to resolve this with Sitecore support.

2024年6月23日

Two issues with Sitrecore XMCloud Edge and how to resolve this with Sitecore support.

The old data in Edge. Some time ago, when I was working with content, I encountered an issue where the published…
Sitecore Search and Custom Entity

2024年5月18日

Sitecore Search and Custom Entity

In this article, I will review how to create a custom entity in Sitecore Search and explore the various possibilities…
Instaling Sitecore 10.4 through containers.

2024年5月10日

Instaling Sitecore 10.4 through containers.

If you still prefer containers and are a lucky one working on Windows 11 or another LTSC2022-powered machine, here's a…
Getting started with Sitecore 10.4

2024年5月2日

Getting started with Sitecore 10.4

It's been a significant interval since Sitecore last unveiled a comprehensive version of their XM/XP platform, with the…
Sitecore Search and Ingestion API.

2024年4月30日

Sitecore Search and Ingestion API.

Sitecore Search serves as a headless content discovery platform, empowering users to craft predictive and personalized…
Brightcove is like video hosting. Integration of Brightcove into Sitecore.

2024年4月21日

Brightcove is like video hosting. Integration of Brightcove into Sitecore.

In one project, a customer wanted to create pages with video components to show teasers, clips, and other video types…

See all articles

Yauheni Tryputsko的更多文章

Sitecore Search Split Testing: How to Improve Your Search Results

How to create custom hostname for Sitecore Edge

Sitecore Search: How to Build Components

Configuring the Sitecore Search API Crawler to Retrieve Data from XMCloud Edge.

Two issues with Sitrecore XMCloud Edge and how to resolve this with Sitecore support.

Sitecore Search and Custom Entity

Instaling Sitecore 10.4 through containers.

Getting started with Sitecore 10.4

Sitecore Search and Ingestion API.

Brightcove is like video hosting. Integration of Brightcove into Sitecore.

社区洞察