How Google Interprets robots.txt Definitions #GoogleSeoGuide #s1ep36

How Google Interprets robots.txt Definitions #GoogleSeoGuide #s1ep36

Google's automated?crawlers?support the?Robots Exclusion Protocol (REP). This means that before crawling a site, Google's crawlers download and parse the site's robots.txt file to extract information about which parts of the site may be crawled. The REP isn't applicable to Google's crawlers that are controlled by users (for example, feed subscriptions), or crawlers that are used to increase user safety (for example, malware analysis).

This page describes Google's interpretation of the REP. For the original standard, check?RFC 9309.

What is a robots.txt file

If you don't want crawlers to access sections of your site, you can create a robots.txt file with appropriate rules. A robots.txt file is a simple text file containing rules about which crawlers may access which parts of a site. For example, the robots.txt file for example.com may look like this:

# This robots.txt file controls crawling of URLs under https://example.com.
# All crawlers are disallowed to crawl files in the "includes" directory, such
# as .css, .js, but Google needs them for rendering, so Googlebot is allowed
# to crawl them.
User-agent: *
Disallow: /includes/

User-agent: Googlebot
Allow: /includes/

Sitemap: https://example.com/sitemap.xml        

File location and range of validity

? You must place the robots.txt file in the top-level directory of a site, on a supported protocol.

The URL for the robots.txt file is (like other URLs) case-sensitive. In case of Google Search, the supported protocols are HTTP, HTTPS, and FTP. On HTTP and HTTPS, crawlers fetch the robots.txt file with an HTTP non-conditional?GET?request; on FTP, crawlers use a standard?RETR (RETRIEVE)?command, using anonymous login.

The rules listed in the robots.txt file apply only to the host, protocol, and port number where the robots.txt file is hosted.

Examples of valid robots.txt URLs

The following examples of robots.txt URLs and what URL paths they're valid for.

?? https://example.com/robots.txt

This is the general case. It's not valid for other subdomains, protocols, or port numbers. It's valid for all files in all subdirectories on the same host, protocol, and port number.

? Valid for:

  • https://example.com/
  • https://example.com/folder/file

?? Not valid for:

  • https://other.example.com/
  • https://example.com/
  • https://example.com:8181/

?? https://www.example.com/robots.txt

A robots.txt on a subdomain (for this example on the www subdomain.) is only valid for that subdomain.

? Valid for: https://www.example.com/

?? Not valid for:

  • https://example.com/
  • https://shop.www.example.com/
  • https://www.shop.example.com/

?? https://example.com/folder/robots.txt

Not a valid robots.txt file. Crawlers don't check for robots.txt files in subdirectories.

?? ftp://example.com/robots.txt

? Valid for: ftp://example.com/

?? Not valid for: https://example.com/

?? https://212.96.82.21/robots.txt

A robots.txt with an IP-address as the host name is only valid for crawling of that IP address as host name. It isn't automatically valid for all websites hosted on that IP address (though it's possible that the robots.txt file is shared, in which case it would also be available under the shared host name).

? Valid for: https://212.96.82.21/

?? Not valid for: https://example.com/?(even if hosted on 212.96.82.21)

?? https://example.com:80/robots.txt

Standard port numbers (80?for HTTP,?443?for HTTPS,?21?for FTP) are equivalent to their default host names.

? Valid for:

  • https://example.com:80/
  • https://example.com/

?? Not valid for: https://example.com:81/

?? https://example.com:8181/robots.txt

Robots.txt files on non-standard port numbers are only valid for content made available through those port numbers.

? Valid for: https://example.com:8181/

?? Not valid for: https://example.com/




Handling of errors and HTTP status codes

When requesting a robots.txt file, the HTTP status code of the server's response affects how the robots.txt file will be used by Google's crawlers. The following list summarizes how Googlebot treats robots.txt files for different HTTP status codes.

  • 2xx (success): HTTP status codes that signal success prompt Google's crawlers to process the robots.txt file as provided by the server.
  • 3xx (redirection): Google follows at least five redirect hops as defined by RFC 1945 and then stops and treats it as a 404 for the robots.txt. This also applies to any disallowed URLs in the redirect chain, since the crawler couldn't fetch rules due to the redirects. Google doesn't follow logical redirects in robots.txt files (frames, JavaScript, or meta refresh-type redirects).
  • 4xx (client errors): Google's crawlers treat all?4xx?errors, except?429, as if a valid robots.txt file didn't exist. This means that Google assumes that there are no crawl restrictions. ? Don't use 401 and 403 status codes for limiting the crawl rate. The 4xx status codes, except 429, have no effect on crawl rate. Learn how to limit your crawl rate.
  • 5xx (server errors): Because the server couldn't give a definite response to Google's robots.txt request, Google temporarily interprets?5xx?and?429?server errors as if the site is fully disallowed. Google will try to crawl the robots.txt file until it obtains a non-server-error HTTP status code.?If you need to temporarily suspend crawling, it is recommended serving a 503 HTTP status code for every URL on the site.
  • Other errors: A robots.txt file which cannot be fetched due to DNS or networking issues, such as timeouts, invalid responses, reset or interrupted connections, and HTTP chunking errors, is treated as a server error.




Caching

Google generally caches the contents of robots.txt file for up to 24 hours, but may cache it longer in situations where refreshing the cached version isn't possible (for example, due to timeouts or?5xx?errors). The cached response may be shared by different crawlers. Google may increase or decrease the cache lifetime based on?max-age Cache-Control?HTTP headers.




File format

The robots.txt file must be a?UTF-8?encoded plain text file and the lines must be separated by?CR,?CR/LF, or?LF.

Google ignores invalid lines in robots.txt files, including the Unicode?Byte Order Mark?(BOM) at the beginning of the robots.txt file, and use only valid lines. For example, if the content downloaded is HTML instead of robots.txt rules, Google will try to parse the content and extract rules, and ignore everything else.

Similarly, if the character encoding of the robots.txt file isn't UTF-8, Google may ignore characters that are not part of the UTF-8 range, potentially rendering robots.txt rules invalid.

Google currently enforces a robots.txt file size limit of 500?kibibytes?(KiB). Content which is after the maximum file size is ignored. You can reduce the size of the robots.txt file by consolidating rules that would result in an oversized robots.txt file. For example, place excluded material in a separate directory.




Syntax and Rules

Valid robots.txt lines consist of a field, a colon, and a value. Spaces are optional but recommended to improve readability. Space at the beginning and at the end of the line is ignored. To include comments, precede your comment with the?#?character. Keep in mind that everything after the?#?character will be ignored. The general format is?<field>:<value><#optional-comment>.

Google supports the following fields:

  • user-agent: Identifies which crawler the rules apply to.
  • allow: A URL path that may be crawled.
  • disallow: A URL path that may not be crawled.
  • sitemap: The complete URL of a sitemap.

The?allow?and?disallow?fields are also called rules (also known as directives). These rules are always specified in the form of?rule: [path]?where?[path]?is optional. By default, there are no restrictions for crawling for the designated crawlers. Crawlers ignore rules without a?[path].

The?[path]?value, if specified, is relative to the root of the website from where the robots.txt file was fetched (using the same protocol, port number, host and domain names). The path value must start with?/?to designate the root and the value is case-sensitive. Learn more about?URL matching based on path values.

Grouping of lines and rules

You can group together rules that apply to multiple user agents by repeating?user-agent?lines for each crawler.

? Order of precedence for user agents

Only one group is valid for a particular crawler. Google's crawlers determine the correct group of rules by finding in the robots.txt file the group with the most specific user agent that matches the crawler's user agent. Other groups are ignored. All non-matching text is ignored (for example, both googlebot/1.2 and googlebot* are equivalent to googlebot). The order of the groups within the robots.txt file is irrelevant.

If there's more than one specific group declared for a user agent, all the rules from the groups applicable to the specific user agent are combined internally into a single group. User agent specific groups and global groups (*) are not combined.

Examples

Matching of?user-agent?fields


user-agent: googlebot-news
(group 1)

user-agent: *
(group 2)

user-agent: googlebot
(group 3)        

  • Googlebot News: googlebot-news follows group 1, because group 1 is the most specific group.
  • Googlebot (web): googlebot follows group 3.
  • Googlebot Storebot: Storebot-Google?follows group 2, because there is no specific?Storebot-Google?group.
  • Googlebot News (when crawling images): When crawling images, googlebot-news follows group 1. googlebot-news doesn't crawl the images for Google Images, so it only follows group 1.
  • Otherbot (web): Other Google crawlers follow group 2.
  • Otherbot (news): Other Google crawlers that crawl news content, but don't identify as googlebot-news follow group 2. Even if there is an entry for a related crawler, it is only valid if it's specifically matching.

? Grouping of rules

1?? If there are multiple groups in a robots.txt file that are relevant to a specific user agent, Google's crawlers internally merge the groups. For example:


user-agent: googlebot-news
disallow: /fish

user-agent: *
disallow: /carrots

user-agent: googlebot-news
disallow: /shrimp        

The crawlers internally group the rules based on user agent, for example:


user-agent: googlebot-news
disallow: /fish
disallow: /shrimp

user-agent: *
disallow: /carrots        

2?? Rules other than?allow,?disallow, and?user-agent?are ignored by the robots.txt parser. This means that the below robots.txt snippet is treated as one group, and thus both?user-agent?a?and?b?are affected by the?disallow: /?rule:


user-agent: a
sitemap: https://example.com/sitemap.xml

user-agent: b
disallow: /        

When the crawlers process the robots.txt rules, they ignore the?sitemap?line. For example, this is how the crawlers would understand the previous robots.txt snippet:


user-agent: a
user-agent: b
disallow: /        

URL matching based on path values

Google, Bing, and other major search engines support a limited form of?wildcards?for path values. These wildcard characters are:

  1. *?designates 0 or more instances of any valid character.
  2. $?designates the end of the URL.

Example path matches

  • / Matches the root and any lower level URL.
  • /* Equivalent to /.
  • /$ Matches only the root. Any lower level URL is allowed for crawling.
  • /example Matches any path that starts with?/example. Note that the matching is case-sensitive.
  • /example* Equivalent to /example.
  • /example/ Matches anything in the /example/ folder.
  • /*.php Matches any path that contains?.php.
  • /*.php$ Matches any path that ends with?.php.
  • /example*.php Matches any path that contains?/example?and?.php, in that order.

Order of precedence for rules

When matching robots.txt rules to URLs,

1?? crawlers use the most specific rule based on the length of the rule path.

2?? In case of conflicting rules, including those with wildcards, Google uses the least restrictive rule.

The following examples demonstrate which rule Google's crawlers will apply on a given URL.

Sample situations

?? https://example.com/page


allow: /p
disallow: /        

Applicable rule:?allow: /p, because it's more specific.

?? https://example.com/folder/page


allow: /folder
disallow: /folder        

Applicable rule:?allow: /folder, because in case of conflicting rules, Google uses the least restrictive rule.

?? https://example.com/page.htm


allow: /page
disallow: /*.htm        

Applicable rule:?disallow: /*.htm, because the rule path is longer and it matches more characters in the URL, so it's more specific.

?? https://example.com/page.php5


allow: /page
disallow: /*.ph        

Applicable rule:?allow: /page, because in case of conflicting rules, Google uses the least restrictive rule.

?? https://example.com/


allow: /$
disallow: /        

Applicable rule:?allow: /$, because it's more specific.

?? https://example.com/page.htm


allow: /$
disallow: /        

Applicable rule:?disallow: /, because the?allow?rule only applies on the root URL.


To be continued...

Thank you for learning with us. This episode is based on?the Fundamentals of SEO Starter Guide?by Google.?Remember, you could always follow all of the episodes for "Google SEO Guide" from the document below:

How Will We Do It? ??????

  • We will learn SEO
  • We will learn about Digital Marketing
  • We will not give up
  • We will not stop till we get results

We won’t care what anyone says, and SEO & Digital Marketing will change our lives!!!

What Will We Learn:???????

Click the below link and get the?#LearningSEOsocially?Calendar Season 1&2. Follow this plan and learn with me how to develop?#SEO?skills with free guides, resources, tools, and more.

Follow Eachother to?#LearnSEO?Socially:???????

  • Twitter: https://twitter.com/search_doc/
  • Linkedin: https://tr.linkedin.com/in/nazimkopuz

BONUS:???????

Don't forget to get a list of?29 Must-Have?#SEO?Skills?basic to mastery.

要查看或添加评论,请登录

Naz?m Kopuz ?的更多文章

社区洞察

其他会员也浏览了