OpenAI outlines details of GPTBot web crawler - but do you REALLY want to BLOCK it? Daily Dose of Digital - 07/08/23
Daily Dose of Digital - OpenAI GPTBot News

OpenAI outlines details of GPTBot web crawler - but do you REALLY want to BLOCK it? Daily Dose of Digital - 07/08/23

OpenAI, the team behind dominant generative AI tool ChatGPT, have recently unveiled the details of their latest web crawler, aptly named GPTBot. This advanced technology promises to revolutionise the way websites are crawled and offer new insights into site activity. The main noise around the digital community today though is the fact that this information means website owners can track GPTBot's crawls, set access permissions, and disallow access to specific parts or their entire site using the robots.txt protocol. Although I would ask, with the future of SEO and AI converging, would you really want to block access in all cases?

Understanding GPTBot's User Agent Token:

To identify GPTBot, you can look for its unique user agent token: "GPTBot."

The full user-agent string is "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)."

This user-agent string will appear in the logs when GPTBot accesses your website.

Stay Updated with GPTBot's IP Range:

Currently, GPTBot operates within the IP range 40.83.2.64/28. It's essential to note that this range might change in the future. To ensure you have the most up-to-date information, regularly check OpenAI's files for any updates.

Granting or Disallowing Access:

Website owners have the flexibility to decide whether to allow or disallow GPTBot from accessing their site, just like they would with any other crawler. OpenAI provides documentation for GPTBot, which enables users to configure access permissions using the robots.txt protocol effectively.

No alt text provided for this image
GPTBot Disallow & Customisation from OpenAI's site

I appreciate there are tonnes of use cases where blocking this would be the safest and most secure approach. However, for those of you fearful and concerned with both attribution and security in mind, I would suggest that there are still some benefits of your content being used by ChatGPT and other ML models, and there may be even more in the future as the scene evolves.

These platforms can reference sources and cite your website as the original location of the information (in fact, some like Perplexity do this very well) - so I would argue that rather than blocking the entire site from being crawled, we should be encouraging users & content creators to be more actively responsible for citing references (perhaps we could be educating prompt writers that they should ask ChatGPT to give references where required). So my view - blocking GPTBot outright might not be the most savvy move by default.

No alt text provided for this image
GPTBot - generated on MidJourney

The Purpose Behind GPTBot's Crawling:

GPTBot's primary purpose is to crawl web pages, with the potential to use the collected data to enhance future AI models. However, OpenAI takes data privacy and ethics seriously. To ensure user protection, sources that require paywall access, collect personally identifiable information (PII), or violate OpenAI's policies are filtered out during the crawling process.

Enhancing AI Models and Safety:

Allowing GPTBot access to your website can play a vital role in improving the accuracy and overall capabilities of AI models. By gathering diverse data from across the web, AI models can be trained to offer more precise and reliable responses, benefiting users worldwide. OpenAI remains dedicated to advancing AI safely and responsibly.

Webmaster Concerns Addressed:

Recently, some webmasters raised concerns about GPTBot's activity on their sites. OpenAI promptly addressed these issues, emphasising that website owners can control GPTBot's access by configuring the robots.txt file. This ensures that only desired areas of the website are accessible to the web crawler.

Moving Beyond ChatGPT Plugins:

With the introduction of GPTBot, OpenAI has expanded its web crawling capabilities beyond ChatGPT plugins. This latest innovation allows for more comprehensive exploration of the web, propelling AI technology forwards.

In Summary:

OpenAI's GPTBot is a "game-changing web crawler that promises to enhance AI capabilities and revolutionise web browsing experiences". Website owners now have the power to manage GPTBot's access, ensuring the protection of sensitive data while contributing to the advancement of AI. As OpenAI continues to refine and evolve GPTBot, the future of AI-driven web crawling holds significant potential.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了