How to block OpenAI’s new AI-training web crawler from ingesting your data

A man is seen using the OpenAI ChatGPT artificial intelligence chat website in this illustration photo on 18 July, 2023. (Photo by Jaap Arriens/NurPhoto via Getty Images) — Jaap Arriens/NurPhoto via Getty Images

ChatGPT creator OpenAI has released a new web crawler — called GPTBot — along with directions on how to block it.

ChatGPT is one of the most capable AI systems ever built, despite recent reports of its wavering intelligence. OpenAI, the company behind the AI chatbot, continues to train its large language models (LLMs), like GPT-3.5 and GPT-4.

Also: ChatGPT is getting a slew of updates this week. Here’s what you need to know

Web crawlers, used by search engines like Google and Bing to scan websites and index content, are also used by AI companies to train LLMs. These models learn from the content of websites and any other data its developers choose to train them on. Using a web crawler expedites this process by enabling the LLMs to train on massive amounts of data.

“Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety,” OpenAI notes in its GPTBot documentation. The company claims it is filtering out web pages that require paywall access, gather personally-identifying information, and have text violating OpenAI’s policies

Developers have the option of blocking the GPTBot from accessing their sites and using their information to train AI systems.

OpenAI explains how to disallow or customize GPTBot access to your site.

Screenshot: OpenAI | Image Composition: Maria Diaz/ZDNET

To block GPTBot from accessing a site altogether, the site owner can add the GPTBot token to the site’s robots.txt and “Disallow: /”.

OpenAI also lets users customize GPTBot’s access by only letting it crawl certain parts of their site. To block GPTBot from accessing parts of a website, add GPTBot to the site’s robots.txt and “Allow: /directory-1/” and “Disallow: /directory-2/” and customize as needed.

Also: Nvidia boosts its ‘superchip’ Grace-Hopper with faster memory for AI

OpenAI had not previously announced the use of web crawlers to train GPT-3.5, the LLM behind the free version of ChatGPT, or GPT-4, its newest LLM available to ChatGPT Plus subscribers and that powers Bing AI.

Though it’s unclear if GPTBot was used to train OpenAI’s currently available LLMs, it could be the web crawler training GPT-5, especially as the company filed to trademark the name in July. While OpenAI has not announced a release date for GPT-5, the new LLM is expected to be more powerful and larger than GPT-4, which is currently the largest LLM available.

Also: AI bots could soon become your new customer service agent

Since the launch of ChatGPT, OpenAI has been hit with several lawsuits alleging that the AI tool is stealing data from users, including a copyright infringement case that made the company the target of an FTC investigation. Websites like Stack Overflow, Reddit, and Twitter have said they plan to begin charging AI companies to access their data.