robots.txt #GoogleSEOGuide #s1ep33

robots.txt #GoogleSEOGuide #s1ep33

What is a robots.txt file used for?

A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests.

? robots.txt files are not a mechanism for keeping a website out of Google.

?? To keep a website out of Google,?block indexing with?noindex?or password-protect the page.

A robots.txt file is used primarily to manage crawler traffic to your site, and usually to keep a file off Google, depending on the file type:

  • Web page: You can use a robots.txt file for web pages (HTML, PDF, or other?non-media formats that Google can read), to manage crawling traffic if you think your server will be overwhelmed by requests from Google's crawler, or to avoid crawling unimportant or similar pages on your site.

?? Warning: Don't use a robots.txt file as a means to hide your website from Google search results. If other pages point to your website with descriptive text, Google could still index the URL without visiting the page. If you want to block your website from search results, use another method such as password protection or?noindex.

  • Media file: Use a robots.txt file to manage crawl traffic, and also to prevent image, video, and audio files from appearing in Google search results. This won't prevent other pages or users from linking to your image, video, or audio file.
  • Resource file: You can use a robots.txt file to block resource files such as unimportant images, scripts, or style files,?if you think that pages loaded without these resources will not be significantly affected by the loss. However, if the absence of these resources makes the page harder for Google's crawler to understand the page, don't block them, or else Google won't do a good job of analyzing pages that depend on those resources.

?? Understand the limitations of a robots.txt file

Before you create or edit a robots.txt file, you should know the limits of this URL blocking method. Depending on your goals and situation, you might want to consider other mechanisms to ensure your URLs are not findable on the web.

1?? robots.txt rules may not be supported by all search engines:

The instructions in robots.txt files cannot enforce crawler behavior to your site; it's up to the crawler to obey them. While Googlebot and other respectable web crawlers obey the instructions in a robots.txt file, other crawlers might not. Therefore, if you want to keep information secure from web crawlers, it's better to use other blocking methods, such as?password-protecting private files on your server.

2?? Different crawlers interpret syntax differently:

Although respectable web crawlers follow the rules in a robots.txt file, each crawler might interpret the rules differently. You should know the?proper syntax?for addressing different web crawlers as some might not understand certain instructions.

3?? Disallowed pages in robots.txt can still be indexed if linked to from other sites:

While Google won't crawl or index the content blocked by a robots.txt file, they might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google search results,?password-protect the files on your server,?use the?noindex?meta tag or response header, or remove the page entirely.

Create a robots.txt file

A robots.txt file lives at the root of your site. So, for the site?www.example.com, the robots.txt file lives at?www.example.com/robots.txt. robots.txt is a plain text file that follows the?Robots Exclusion Standard.

A robots.txt file consists of one or more rules. Each rule blocks or allows access for a given crawler to a specified file path on the domain or subdomain where the robots.txt file is hosted. Unless you specify otherwise in your robots.txt file, all files are implicitly allowed for crawling.

If you use a site hosting service, such as Wix or Blogger, you might not need to (or be able to) edit your robots.txt file directly. Instead, your provider might expose a search settings page or some other mechanism to tell search engines whether or not to crawl your page

Here is a simple robots.txt file with two rules:

User-agent: Googlebot
Disallow: /checkout/

User-agent: *
Allow: /

Sitemap: https://www.example.com/sitemap.xml        

Here's what that robots.txt file means:

  1. The user agent named Googlebot is not allowed to crawl any URL that starts with?https://example.com/checkout/.
  2. All other user agents are allowed to crawl the entire site. This could have been omitted and the result would be the same; the default behavior is that user agents are allowed to crawl the entire site.
  3. The site's?sitemap file?is located at?https://www.example.com/sitemap.xml.

See the?syntax?section below for more examples.

?? Basic guidelines for creating a robots.txt file

Creating a robots.txt file and making it generally accessible and useful involves four steps:

1?? Create a robots.txt file

You can use almost any text editor to create a robots.txt file. For example, Notepad, TextEdit, vi, and emacs can create valid robots.txt files. Don't use a word processor; word processors often save files in a proprietary format and can add unexpected characters, such as curly quotes, which can cause problems for crawlers. Make sure to save the file with UTF-8 encoding if prompted during the save file dialog.

Format and location rules:

  • The file must be named robots.txt.
  • Your site can have only one robots.txt file.
  • The robots.txt file must be located at the root of the website host to which it applies. For instance, to control crawling on all URLs below?https://www.example.com/, the robots.txt file must be located at?https://www.example.com/robots.txt. It?cannot?be placed in a subdirectory (for example, at?https://example.com/pages/robots.txt). If you're unsure about how to access your website root or need permission to do so, contact your web hosting service provider. If you can't access your website root, use an alternative blocking method such as?meta tags.
  • A robots.txt file can be posted on a subdomain (for example,?https://website.example.com/robots.txt) or on non-standard ports (for example,?https://example.com:8181/robots.txt).
  • A robots.txt file applies only to paths within the protocol, host, and port where it is posted. That is, rules in?https://example.com/robots.txt?apply only to files in?https://example.com/, not to subdomains such as?https://m.example.com/, or alternate protocols, such as?https://example.com/.
  • A robots.txt file must be an UTF-8 encoded text file (which includes ASCII). Google may ignore characters that are not part of the UTF-8 range, potentially rendering robots.txt rules invalid.

1?? Add rules to the robots.txt file

Rules are instructions for crawlers about which parts of your site they can crawl. Follow these guidelines when adding rules to your robots.txt file:

  • A robots.txt file consists of one or more groups.
  • Each group consists of multiple rules (also known as directives), one rule per line. Each group begins with a?User-agent?line that specifies the target of the groups.
  • Who the group applies to (the user agent).
  • Which directories or files that agent?can?access.
  • Which directories or files that agent?cannot?access.
  • Crawlers process groups from top to bottom. A user agent can match only one rule set, which is the first, most specific group that matches a given user agent.
  • The default assumption is that a user agent can crawl any page or directory not blocked by a?disallow?rule.
  • Rules are case-sensitive. For instance,?disallow: /file.asp?applies to?https://www.example.com/file.asp, but not?https://www.example.com/FILE.asp.
  • The?#?character marks the beginning of a comment.

Google's crawlers support the following rules in robots.txt files:

? user-agent:?[Required, one or more per group] The rule specifies the name of the automatic client known as search engine crawler that the rule applies to. This is the first line for any rule group. Google user agent names are listed in the?Google list of user agents. Using an asterisk (*) matches all crawlers except the various AdsBot crawlers, which must be named explicitly. For example:


# Example 1: Block only Googlebot
User-agent: Googlebot
Disallow: /

# Example 2: Block Googlebot and Adsbot
User-agent: Googlebot
User-agent: AdsBot-Google
Disallow: /

# Example 3: Block all crawlers except AdsBot (AdsBot crawlers must be named explicitly)
User-agent: *
Disallow: /        

? disallow:?[At least one or more?disallow?or?allow?entries per rule] A directory or page, relative to the root domain, that you don't want the user agent to crawl. If the rule refers to a page, it must be the full page name as shown in the browser. It must start with a?/?character and if it refers to a directory, it must end with the?/?mark.

? allow:?[At least one or more?disallow?or?allow?entries per rule] A directory or page, relative to the root domain, that may be crawled by the user agent just mentioned. This is used to override a?disallow?rule to allow crawling of a subdirectory or page in a disallowed directory. For a single page, specify the full page name as shown in the browser. In case of a directory, end the rule with a?/?mark.

? sitemap:?[Optional, zero or more per file] The location of a sitemap for this website. The sitemap URL must be a fully-qualified URL; Google doesn't assume or check http/https/www.non-www alternates. Sitemaps are a good way to indicate which content Google should crawl, as opposed to which content it?can?or?cannot?crawl.?Learn more about sitemaps.?Example:


Sitemap: https://example.com/sitemap.xml
Sitemap: https://www.example.com/sitemap.xml        

All rules, except?sitemap, support the?*?wildcard for a path prefix, suffix, or entire string.

Lines that don't match any of these rules are ignored.

2?? Upload the robots.txt file

Once you saved your robots.txt file to your computer, you're ready to make it available to search engine crawlers. There's no one tool that can help you with this, because how you upload the robots.txt file to your site depends on your site and server architecture. Get in touch with your hosting company or search the documentation of your hosting company; for example, search for "upload files hetzner".

3?? Test robots.txt markup

To test whether your newly uploaded robots.txt file is publicly accessible, open a?private browsing window?in your browser and navigate to the location of the robots.txt file. For example,?https://example.com/robots.txt. If you see the contents of your robots.txt file, you're ready to test the markup.

Google offers two options for testing robots.txt markup:

  1. The?robots.txt Tester?in Search Console. You can only use this tool for robots.txt files that are already accessible on your site.
  2. If you're a developer, check out and build?Google's open source robots.txt library, which is also used in Google Search. You can use this tool to test robots.txt files locally on your computer.

4?? Submit robots.txt file to Google

Once you uploaded and tested your robots.txt file, Google's crawlers will automatically find and start using your robots.txt file. You don't have to do anything. If you updated your robots.txt file and you need to refresh Google's cached copy as soon as possible, learn?how to submit an updated robots.txt file below sections.

Useful robots.txt rules

Here are some common useful robots.txt rules:

??? Disallow crawling of the entire website

Keep in mind that in some situations URLs from the website may still be indexed, even if they haven't been crawled.

Note: This does not match the?various AdsBot crawlers, which must be named explicitly.

User-agent: *
Disallow: /        

??? Disallow crawling of a directory and its contents

Append a forward slash to the directory name to disallow crawling of a whole directory.


User-agent: *
Disallow: /calendar/
Disallow: /junk/
Disallow: /books/fiction/contemporary/        

??? Allow access to a single crawler

Only?googlebot-news?may crawl the whole site.


User-agent: Googlebot-news
Allow: /

User-agent: *
Disallow: /        

??? Allow access to all but a single crawler

Unnecessarybot?may not crawl the site, all other bots may.


User-agent: Unnecessarybot
Disallow: /

User-agent: *
Allow: /        

??? Disallow crawling of a single web page

For example, disallow the?useless_file.html?page located at?https://example.com/useless_file.html, and?other_useless_file.html?in the?junk?directory.

??? Disallow crawling of the whole site except a subdirectory

Crawlers may only access the?public?subdirectory.


User-agent: *
Disallow: /
Allow: /public/        

??? Block a specific image from Google Images

For example, disallow the?dogs.jpg?image.


User-agent: Googlebot-Image
Disallow: /images/dogs.jpg        

??? Block all images on your site from Google Images

Google can't index images and videos without crawling them.


User-agent: Googlebot-Image
Disallow: /        

??? Disallow crawling of files of a specific file type

For example, disallow for crawling all?.gif?files.


User-agent: Googlebot
Disallow: /*.gif$        

??? Disallow crawling of an entire site, but allow?Mediapartners-Google

This implementation hides your pages from search results, but the?Mediapartners-Google?web crawler can still analyze them to decide what ads to show visitors on your site.


User-agent: *
Disallow: /

User-agent: Mediapartners-Google
Allow: /        

??? Use the?*?and?$?wildcards to match URLs that end with a specific string

For example, disallow all?.xls?files.


User-agent: Googlebot
Disallow: /*.xls$        

Update your robots.txt file

To update the rules in your existing robots.txt file, download a copy of your robots.txt file from your site and make the necessary edits.

If you use a site hosting service, such as Wix or Blogger, you might not need to (or be able to) edit your robots.txt file directly. Instead, your provider might expose a search settings page or some other mechanism to tell search engines whether or not to crawl your page.

If you want to hide or unhide one of your pages from search engines, search for instructions about modifying your page visibility in search engines on your hosting service, for example, search for "wix hide page from search engines".

Download your robots.txt file

You can download your robots.txt file various ways, for example:

? Navigate to your robots.txt file, for example?https://example.com/robots.txt?and copy its contents into a new text file on computer. Make sure you follow the guidelines related to the?file format?when creating the new local file.

? Download an actual copy of your robots.txt file with a tool like cURL. For example:


$ curl https://example.com/robots.txt -o robots.txt        

? Use the?robots.txt Tester?in Search Console to download a copy of your robots.txt file.

  1. Click?Submit?in the bottom-right corner of the robots.txt editor. This action opens up a Submit dialog.
  2. Download your robots.txt code from the?robots.txt Tester?page by clicking Download in the?Submit?dialog.

Edit your robots.txt file

Open the robots.txt file you downloaded from your site in a text editor and make the necessary edits to the rules. Make sure you use the?correct syntax?and that you save the file with UTF-8 encoding.

Upload your robots.txt file

Upload your new robots.txt file to the root of your domain as a text file named robots.txt. The way you upload a file to your site is highly platform and server dependent. Check out our tips for finding help with?uploading a robots.txt file to your site.

If you do not have permission to upload files to the root of your domain, contact your domain manager to make changes.

Refresh Google's robots.txt cache

During the automatic crawling process, Google's crawlers notice changes you made to your robots.txt file and update the cached version every 24 hours. If you need to update the cache faster, use the?Submit?function of the?robots.txt Tester.

  1. Click?View uploaded version?to see that your live robots.txt is the version that you want Google to crawl.
  2. Click?Submit?to notify Google that changes have been made to your robots.txt file and request that Google crawl it.
  3. Check that your newest version was successfully crawled by Google by refreshing the page in your browser to update the tool's editor and see your live robots.txt code. After you refresh the page, you can also click the dropdown to view the timestamp of when Google first saw the latest version of your robots.txt file.


To be continued...

Thank you for learning with us. This episode is based on?the Fundamentals of SEO Starter Guide?by Google.?Remember, you could always follow all of the episodes for "Google SEO Guide" from the document below:


How Will We Do It? ??????

  • We will learn SEO
  • We will learn about Digital Marketing
  • We will not give up
  • We will not stop till we get results

We won’t care what anyone says, and SEO & Digital Marketing will change our lives!!!

What Will We Learn:???????

Click the below link and get the?#LearningSEOsocially?Calendar Season 1&2. Follow this plan and learn with me how to develop?#SEO?skills with free guides, resources, tools, and more.


Follow Eachother to?#LearnSEO?Socially:???????

  • Twitter: https://twitter.com/search_doc/
  • Linkedin: https://tr.linkedin.com/in/nazimkopuz

BONUS:???????

Don't forget to get a list of?29 Must-Have?#SEO?Skills?basic to mastery.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了