登录查看更多内容

Managing Bot indexing with Oauth2 applications

syed shabbir

Senior Back End developer

发布日期: 2024年3月3日

The title is quite generic, however, the article refers to a Java application, implementing Spring Security to handle Oauth2, fronted by a web server.

The application will, initially, redirect to the oauth2 server for a request on a new session, and then redirect to the requested page. This scenario, involving multiple redirects, does not sit too well with bots. The drop in pages being indexed, based on data from the Google indexing console, made it quite clear that the process needed amendment.

In fact, there were several other scenarios, where a redirect would occur, and this would also result in an indexing failure.

So, there were several steps to be implemented. Step 1 was to ensure that any naturally occurring redirects would not be executed for indexing bots. In some respects, Oauth2 also acts as a natural deterrent against unwanted bots i.e. without having to explicitly exclude/handle unwanted bots, it's possible to manage allowed bots. Where the bot User-agent is one of the allowed bots, there is a session variable created, and the session value is used for processing the request, in a particular manner. It now meant that the application could filter the request to ensure there was a minimal level of processing, and redirects prevented i.e. just do enough to allow a page to be indexed.

Step 2 required the bot indexing request to bypass any Oauth2 handling. With Springboot applications, it is possible to allow multiple URLs i.e. variations on a URL, to be processed by the same resource. One of these URLs, used for bots, is set to be ignored by Oauth2; ensuring that there was no authentication redirect. There was also some additional Spring Security configuration implemented to support this.

Step 3 required an update to the robots.txt file. As well as amending to only allow for specific bots, there were also changes to the URLs that were allowed/disallowed, and the addition of new, indexable, URLs, that bypass Oauth2 for the same resource. Adbots were also excluded. However, not all bots will adhere to the set rules, and it is really a guidance for bots that do.

Step 4 required an update to all sitemap URLs. They were amended to use the new bot-indexing URLs, as per the previous step. The new sitemaps were also uploaded through the Google console, ready for use. Google also provides a console-specific user agent, Google-InspectionTool, to allow for page indexing tests. It was possible to ensure that new URLs would be indexed successfully, by using a sample URL, from the submitted in the updated sitemap

Step 5 was to ensure that end-users clicking on a search-engine result do not use the same URL as the bot i.e. users should continue to see no change to the URLs they are use too. This is particularly important where reporting considerations are in place, as well as bookmarks, and ensuring the users never have more than a single URL to a resource. Also, bear in mind, the bot indexing URL behaves differently to the standard URL i.e. augmented for minimal processing and bypass Oauth2. The web server configuration was amended to look for allowed bot user agents, and only amend URLs where the request was not one of them. With Apache this was possible with something similar to below

  #user agent is not 'allowed' indexing bot
  RewriteCond %{HTTP_USER_AGENT}  !(your list of allowed bots) [nc]
  #is a GET
  RewriteCond %{REQUEST_METHOD} (GET) [NC]
  #url starts with 'whatever you set the bot url to'
  RewriteCond %{REQUEST_URI} ^/whatever(.*)$
  #remove /whatever from url
  RewriteRule ^/whatever(.*) $1 [R=301,L,NC,NE,NS,QSA]

These steps ensure that allowed bots will be serviced on URLs that are specific for indexing, and users clicking on a search result will be redirected to the regular version URL.

要查看或添加评论，请登录

syed shabbir的更多文章

Managing HTTP/2(h2) with Springboot and Apache server

2023年8月1日

Managing HTTP/2(h2) with Springboot and Apache server

Using Springboot, there is a simplified approach to implementing HTTP/2, HTTP/2 henceforth referred to as h2. However…
Using Python script to manage Jenkins builds to Nexus

2021年12月28日

Using Python script to manage Jenkins builds to Nexus

The Nexus REST API supports the management of component builds, within your repos. It would be advisable to migrate to…
Using SQL Queries for ElasticSearch

2020年5月26日

Using SQL Queries for ElasticSearch

There has been considerable, and ongoing, effort to make an ElasticSearch(ES) index conform, as closely as possible, to…
API Design and Related Considerations

2020年5月9日

API Design and Related Considerations

As APIs form a greater part of product development, there needs to be an accompanying design strategy. At a lower…
Yet another branch workflow for Git

2020年2月1日

Yet another branch workflow for Git

Continuous deployment requires code to be in a ready-to-release state on a per-story basis. This can be shaped by the…
Using Thymeleaf 3 pre-processor to remove whitespace

2019年8月7日

Using Thymeleaf 3 pre-processor to remove whitespace

With Thymeleaf 3.0 onwards it is possible to use pre/post processors to manage output, prior to content being rendered,…
Using Fulfillment to manage your DialogFlow Chatbot Conversation

2019年6月29日

Using Fulfillment to manage your DialogFlow Chatbot Conversation

The Dialogflow platform enables the creation of AI based chatbots and supports integration onto multiple platforms…
Managing Recursive CTE Queries with PostgreSQL

2018年10月3日

Managing Recursive CTE Queries with PostgreSQL

SQL:1999, aka SQL:99, SQL 3, introduced various new features including recursive common table expressions (RCTEs). All…
Using Gradle to Generate Swagger Asciidocs for PDF/HTML5 Documentation in Offline Mode

2018年5月19日

Using Gradle to Generate Swagger Asciidocs for PDF/HTML5 Documentation in Offline Mode

There are various tools available to generate swagger-based documentation, to a range output formats. A quick look at…
Using Eclipse Memory Analyzer Tool(MAT) to investigate memory heap dumps

2017年12月22日

Using Eclipse Memory Analyzer Tool(MAT) to investigate memory heap dumps

There are occasions where memory usage repeatedly hits the limit, leading to some poor soul having to investigate and…

See all articles

syed shabbir的更多文章

Managing HTTP/2(h2) with Springboot and Apache server

Using Python script to manage Jenkins builds to Nexus

Using SQL Queries for ElasticSearch

API Design and Related Considerations

Yet another branch workflow for Git

Using Thymeleaf 3 pre-processor to remove whitespace

Using Fulfillment to manage your DialogFlow Chatbot Conversation

Managing Recursive CTE Queries with PostgreSQL

Using Gradle to Generate Swagger Asciidocs for PDF/HTML5 Documentation in Offline Mode

Using Eclipse Memory Analyzer Tool(MAT) to investigate memory heap dumps