How to Fix Common Robot.txt Issues
Muhammad Abubakar
SEO Consultant | Scaled 50+ Brands to $10M+ Through SEO | Helping Founders Scale Their Digital Presence | Business Growth Strategist
What is Robot.txt?
Robots.txt is a useful and powerful tool to instruct search engine crawlers on how you want them to crawl your website.?
It contains instructions for bots that tell them which webpages they can and cannot access. Robots.txt files are most relevant for web crawlers from search engines like Google.?Managing robot.txt is an important component of technical SEO.
If pages or a section of your site are disallowed from crawling through the robots.txt file, then information about indexing or serving directives will not be found and will therefore be ignored.
For example, Googlebot will not see:
Other meta-data content
What Can Robots.txt Do?
Robots.txt can achieve a variety of results across a range of different content types:
They may still appear in search results, but they will not have a text description. Non-HTML content on the page will not be crawled either.
How Dangerous Are Robots.txt Mistakes?
A mistake in robots.txt can have unintended consequences, but it’s often not the end of the world. the good news is that by fixing your robots.txt file, you can recover from any errors quickly and (usually) in full.
7 Common Robots.txt Mistakes
The best way to find robots.txt errors is with a site audit. This lets you uncover technical SEO issues at scale so you can resolve them. Here are common issues with robots.txt specifically:
1. Robots.txt Not In The Root Directory
Search robots can only discover the file if it’s in your root folder. That’s why there should be only a forward slash between the .com (or equivalent domain) of your website, and the ‘robots.txt’ filename, in the URL of your robots.txt file.
If there’s a subfolder in there, your robots.txt file is probably not visible to the search robots, and your website is probably behaving as if there was no robots.txt file at all.
How To Fix: move your robots.txt file to your root directory. It’s worth noting that this will need you to have root access to your server.
Some content management systems will upload files to a “media” subdirectory (or something similar) by default, so you might need to circumvent this to get your robots.txt file in the right place.
2. Poor Use Of Wildcards:
Robots.txt supports two wildcard characters:
Asterisk (*) – represents any instances of a valid character, like a Joker in a deck of cards.
Dollar sign ($) – denotes the end of a URL, allowing you to apply rules only to the final part of the URL, such as the filetype extension.
How To Fix: Test your wildcard rules using a robots.txt testing tool to ensure they behave as expected. Be cautious with wildcard usage to prevent accidentally blocking or allowing too much.
领英推荐
3. Noindex In Robots.txt
This one is more common on websites that are over a few years old.
If your robots.txt file was created before that date or contains noindex instructions, you will likely see those pages indexed in Google’s search results.
How To Fix It: The solution to this problem is to implement an alternative “noindex” method. One option is the robots meta tag, which you can add to the head of any webpage you want to prevent Google from indexing.
4. Blocked Scripts And Stylesheets
It might seem logical to block crawler access to external JavaScripts and cascading stylesheets (CSS). However, remember that Googlebot needs access to CSS and JS files to “see” your HTML and PHP pages correctly.
If your pages are behaving oddly in Google’s results, or it looks like Google is not seeing them correctly, check whether you are blocking crawler access to required external files.
How to Fix: A simple solution to this is to remove the line from your robots.txt file that is blocking access.
5. No XML Sitemap URL
This is more about SEO than anything else because this is the first place Googlebot looks when it crawls your website, this gives the crawler a headstart in knowing the structure and main pages of your site.
While this is not strictly an error – as omitting a sitemap should not negatively affect the actual core functionality and appearance of your website in the search results
How to Fix: You can include the URL of your XML sitemap in your robots.txt file.
6. Access To Development Sites:
Blocking crawlers from your live website is a no-no, but so is allowing them to crawl and index your pages that are still under development. Forgetting to remove this line from robots.txt is one of the most common mistakes among web developers; it can stop your entire website from being crawled and indexed correctly.
User-Agent: *
Disallow: /
How to Fix: It’s best practice to add a disallow instruction to the robots.txt file of a website under construction so the general public doesn’t see it until it’s finished. Equally, it’s crucial to remove the disallow instruction when you launch a completed website.
7. Using Absolute URLs
While using absolute URLs in things like canonicals and hreflang is best practice, for URLs in the robots.txt, the inverse is true.When you use an absolute URL, there’s no guarantee that crawlers will interpret it as intended and that the disallow/allow rule will be followed.
How to fix: Using relative paths in the robots.txt file is the recommended approach for indicating which parts of a site should not be accessed by crawlers.
How To Recover From A Robots.txt Error
If a mistake in robots.txt has unwanted effects on your website’s search appearance, the first step is to correct robots.txt and verify that the new rules have the desired effect.
When you are confident that robots.txt is behaving as desired, you can try to get your site re-crawled as soon as possible.
Submit an updated sitemap and request a re-crawl of any pages that have been inappropriately delisted.
Unfortunately, you are at the whim of Googlebot – there’s no guarantee as to how long it might take for any missing pages to reappear in the Google search index.
Final Thoughts
You should review our optimal implementation steps to ensure that your site follows all best practices for robots.txt files and compare your site with the common errors that we’ve listed above. Where robots.txt errors are concerned, prevention is always better than the cure.
Edits to robots.txt should be made carefully by experienced developers, double-checked, and – where appropriate – subject to a second opinion.
If possible, test in a sandbox editor before pushing live on your real-world server to avoid inadvertently creating availability issues. Make sure your site is not using automatic redirection or varying the robots.txt. Benchmark your site’s performance prior to and after changes.