Managing Bot indexing with Oauth2 applications

The title is quite generic, however, the article refers to a Java application, implementing Spring Security to handle Oauth2, fronted by a web server.

The application will, initially, redirect to the oauth2 server for a request on a new session, and then redirect to the requested page. This scenario, involving multiple redirects, does not sit too well with bots. The drop in pages being indexed, based on data from the Google indexing console, made it quite clear that the process needed amendment.

In fact, there were several other scenarios, where a redirect would occur, and this would also result in an indexing failure.

So, there were several steps to be implemented. Step 1 was to ensure that any naturally occurring redirects would not be executed for indexing bots. In some respects, Oauth2 also acts as a natural deterrent against unwanted bots i.e. without having to explicitly exclude/handle unwanted bots, it's possible to manage allowed bots. Where the bot User-agent is one of the allowed bots, there is a session variable created, and the session value is used for processing the request, in a particular manner. It now meant that the application could filter the request to ensure there was a minimal level of processing, and redirects prevented i.e. just do enough to allow a page to be indexed.

Step 2 required the bot indexing request to bypass any Oauth2 handling. With Springboot applications, it is possible to allow multiple URLs i.e. variations on a URL, to be processed by the same resource. One of these URLs, used for bots, is set to be ignored by Oauth2; ensuring that there was no authentication redirect. There was also some additional Spring Security configuration implemented to support this.

Step 3 required an update to the robots.txt file. As well as amending to only allow for specific bots, there were also changes to the URLs that were allowed/disallowed, and the addition of new, indexable, URLs, that bypass Oauth2 for the same resource. Adbots were also excluded. However, not all bots will adhere to the set rules, and it is really a guidance for bots that do.

Step 4 required an update to all sitemap URLs. They were amended to use the new bot-indexing URLs, as per the previous step. The new sitemaps were also uploaded through the Google console, ready for use. Google also provides a console-specific user agent, Google-InspectionTool, to allow for page indexing tests. It was possible to ensure that new URLs would be indexed successfully, by using a sample URL, from the submitted in the updated sitemap

Step 5 was to ensure that end-users clicking on a search-engine result do not use the same URL as the bot i.e. users should continue to see no change to the URLs they are use too. This is particularly important where reporting considerations are in place, as well as bookmarks, and ensuring the users never have more than a single URL to a resource. Also, bear in mind, the bot indexing URL behaves differently to the standard URL i.e. augmented for minimal processing and bypass Oauth2. The web server configuration was amended to look for allowed bot user agents, and only amend URLs where the request was not one of them. With Apache this was possible with something similar to below

  #user agent is not 'allowed' indexing bot
  RewriteCond %{HTTP_USER_AGENT}  !(your list of allowed bots) [nc]
  #is a GET
  RewriteCond %{REQUEST_METHOD} (GET) [NC]
  #url starts with 'whatever you set the bot url to'
  RewriteCond %{REQUEST_URI} ^/whatever(.*)$
  #remove /whatever from url
  RewriteRule ^/whatever(.*) $1 [R=301,L,NC,NE,NS,QSA]        

These steps ensure that allowed bots will be serviced on URLs that are specific for indexing, and users clicking on a search result will be redirected to the regular version URL.


要查看或添加评论,请登录

syed shabbir的更多文章