Sitecore Search Multi-language Web Crawler (Advanced). How to extract all content text for search.
In this article, I will guide you through the process of creating an advanced Web Crawler capable of handling multiple languages in Sitecore Search. Additionally, I’ll demonstrate how to extract all text content from <p> tags and utilize it effectively for search functionality.
Let’s get started! To create an advanced Web Crawler, begin by navigating to the Source page and clicking Add Sources. For this article, select Web Crawler (Advanced) as the source type. This setup will allow us to extract all the necessary information from our pages and leverage it for content search.
After creating the Web Crawler, the first step is to configure its settings. The key setting to focus on is Max URLs. For this example, I’ve set the value to 1000, which corresponds to the approximate number of pages on my site. Be sure to adjust this value based on the total number of pages you need to crawl.
Once the Web Crawler settings are configured, the next step is to select a Trigger Type and configure the trigger. Sitecore Search offers five trigger types: JS, RSS, Request, Sitemap, and Sitemap Index. For my Web Crawler, I chose Sitemap as the trigger type. I find this to be a straightforward option for retrieving all pages without the need for complex request logic.
The next important setting is Available Locales. Since my site supports two languages—en (English) and ar-sa (Arabic-Saudi Arabia)—I selected en_us and ar_sa for this configuration. If your desired languages are not already available in Sitecore Search, you can add them by navigating to Administrative -> Domain Setting -> Locale. It’s worth noting that my site uses English without a specified region, but Sitecore Search requires languages to include a region code. In the following steps, I’ll explain how to map Sitecore languages to the corresponding Sitecore Search languages.
When your Web Crawler supports two or more languages, Sitecore Search provides an additional setting called Locale Extractor. This setting allows you to configure the logic that determines which content is associated with each language. The Locale Extractor feature supports three types: URL, Header, and JS. Since I’m using the Sitemap trigger type to crawl all pages from the site map and extract data directly from the HTML of those pages, the most suitable locale types are JS or URL. I chose JS, as it provides the flexibility needed to handle the differences between my Sitecore languages and the languages defined in Sitecore Search.
Code for mapping language from html tag to Sitecore Search.
function extract(request, response) {
$ = response.body;
let langValue = $('html').attr('lang');
if (langValue === 'en') {
langValue = 'en_US';
}
return langValue;
}
As you can see, my logic includes a condition: if the langValue is en, it will be mapped to en_us (the Sitecore Search locale). This ensures that the language code from the HTML tag aligns with the format expected by Sitecore Search.
At this stage, we have configured the Web Crawler to retrieve pages from the Sitemap and associate the content with the correct locales. The final step is to extract the page data (HTML content) and map it to Sitecore Search attributes. To achieve this, we need to configure the Document Extractor. The Document Extractor supports three types: CSS, XPath, and JS. In most cases, XPath is the preferred choice due to its simplicity—it allows you to configure queries without writing code by mapping XPath expressions to available attributes. However, I encountered a limitation with XPath. As you may know, Sitecore pages are composed of multiple components, each potentially containing text information. For a comprehensive search that includes all text from these components, it’s ideal to create a single attribute (e.g., Full Text) to store all textual content, such as text from <p> tags. While XPath supports queries to extract text from all <p> tags, the returned result is an array. Unfortunately, Sitecore Search’s XPath extractor only processes the first element of the array. I also tried creating the Full Text attribute as an array, but this didn’t resolve the issue. To overcome this, I switched to the JS type for the Document Extractor. This allows me to write custom logic to retrieve all <p> tags and combine their text content into a single string attribute. Below is an example of the JavaScript logic used to achieve this:
function extract(request, response) {
$ = response.body;
return [{
'id': $('meta[name="page-id"]').attr('content'),
'name': $('title').text(),
'url': $('link[rel="canonical"]').attr('href'),
'type': $('meta[name="page-type"]').attr('content'),
'title': $('meta[property="og:title"]').attr('content'),
'description': $('meta[property="og:description"]').attr('content'),
//full text logic for get all <p>
'full_text': $('p').map(function() {
return $(this).text().trim();
}).get().join(' ')
}];
}
One important detail to keep in mind for multi-language setups is the Localized checkbox. Don’t forget to enable it, as this ensures that the extracted data is correctly associated with the appropriate language.
Now you have a clear understanding of how to create an Advanced Web Crawler for multiple languages and how to extract text content from specific HTML tags. I hope this guide helps you effectively utilize Sitecore Search for your projects.