How to stop AI chatbots from scraping your websites content

As it stands, AI chatbots have a free license to scrape your website and use its content without your permission. Are you worried about your content being deleted by such tools?

videos of the day

Snapmaker Artisan 3-in-1 Maker Machine In-Depth Review A good investment for demanding manufacturers and small workshops with a simple tool change.

The good news is that you can block AI tools from accessing your website, but there are some caveats. Here, we show you how to block bots using the robots.txt file for your website, as well as the pros and cons of doing so.

How do AI chatbots access your web content?

AI chatbots are trained using multiple data sets, some of which are open source and publicly available. For example, GPT3 was trained using five datasets, according to a research paper published by OpenAI:

Common Crawl (60% of training weight)

WebText2 (22% weight in training)

Books1 (weight 8% in training)

Books2 (weight 8% in training)

Wikipedia (weight 3% in training)

Common Crawl includes petabytes (thousands of TB) of data from websites collected since 2008, similar to how Google’s search algorithm crawls web content. WebText2 is a dataset created by OpenAI, containing approximately 45 million web pages linked to Reddit posts with at least three upvotes.

So, in the case of ChatGPT, the AI bot isn’t accessing and crawling your web pages directly, not yet, anyway. However, OpenAI’s announcement of a ChatGPT-hosted web browser has raised concerns that this may be about to change.

In the meantime, website owners should keep an eye out for more AI chatbots, as more of them hit the market. Bard is the other big name in the field, and very little is known about the datasets used to train him. Of course, we know that Google’s search bots are constantly crawling web pages, but that doesn’t necessarily mean that Bard has access to the same data.

Why are some website owners concerned?

The biggest concern for website owners is that AI bots like ChatGPT, Bard and Bing Chat devalue their content. AI bots use existing content to generate their responses, but also reduce the need for users to access the original source. Instead of visiting websites to access information, users can simply have Google or Bing generate a summary of the information they need.

When it comes to AI chatbots in search, the big concern for website owners is traffic loss. In Bard’s case, the AI bot rarely includes quotes in its generative responses, telling users which pages it gets its information from.

So, in addition to replacing website visits with AI responses, Bard eliminates almost any chance of the originating website getting traffic, even if the user wants more information. Bing Chat, on the other hand, more commonly links to sources of information.

In other words, the current fleet of generative AI tools use the work of content creators to systematically replace the need for content creators. Ultimately, you have to ask yourself what incentive it leaves for website owners to keep publishing content. And, by extension, what happens to AI bots when websites stop serving the content they rely on to function?

How to block AI bots from your website

If you don’t want AI bots to use your web content, you can block them from accessing your site using the robots.txt file. Unfortunately, you have to block every single bot and specify it by name.

For example, the Common Crawl bot is called CCBot and you can block it by adding the following code to your robots.txt file:

User-agent: CCBot Disallow: /

This will prevent Common Crawl from crawling your website in the future, but it won’t remove any data already collected from previous crawls.

If you’re concerned about new ChatGPT plugins accessing your web content, OpenAI has already posted instructions for blocking its bot. In this case, the ChatGPT bot is called ChatGPT-User and you can block it by adding the following code to your robots.txt file:

User-agent: ChatGPT-User Disallow: /

However, preventing search engine AI bots from crawling your content is another problem. Because Google is very secretive about the training data it uses, it’s impossible to identify which bots you’ll need to block and whether they’ll even respect commands in your robots.txt file (many crawlers don’t).

How effective is this method?

Block AI bots in your robots.txt file is the most effective method currently available, but it is not particularly reliable.

The first problem is that you have to specify every bot you want to block, but who can keep track of every AI bot that hits the market? The next problem is that the commands in your robots.txt files are non-mandatory instructions. While Common Crawl, ChatGPT, and many other bots respect these commands, many bots do not.

The other big caveat is that you can only block AI bots from doing future scans. You can’t remove data from previous scans or send requests to companies like OpenAI to erase all of your data.

Unfortunately, there is no easy way to block all AI bots from accessing your website, and manually blocking every single bot is next to impossible. Even if you keep up with the latest AI bots roaming the web, there’s no guarantee they’ll respect all commands in your robots.txt file.

The real question here is whether the results are worth the effort, and the short answer is (almost certainly) no.

There are also potential downsides to blocking AI bots from your website. Most importantly, you won’t be able to gather meaningful data to prove whether tools like Bard are benefiting or harming your search engine marketing strategy.

Yes, you can assume that missing mentions is bad, but you’re only guessing if you’re missing data because you’ve blocked AI bots from accessing your content. It was a similar story when Google first introduced featured snippets to Search.

For relevant queries, Google displays a snippet of web page content on the results page, answering the user’s question. This means users don’t have to click through to a website to get the answer they’re looking for. This has caused panic among website owners and SEO experts who rely on traffic generation from search queries.

However, the type of queries that trigger featured snippets are usually low-value searches like “what’s X” or “what’s the weather like in New York.” Anyone who wants in-depth information or a comprehensive weather report will still click through, and those who don’t want it have never been more valuable in the first place.

You might find it’s a similar story with generative AI tools, but you’ll need the data to prove it.

Don’t rush into anything

Website owners and publishers are understandably concerned about AI technology and frustrated with the idea of bots using their content to generate instant responses. However, this is not the time to launch counter-offensive moves. AI technology is a rapidly evolving field and things will continue to evolve at a rapid pace. Take the opportunity to see how things are going and analyze the potential threats and opportunities that AI offers.

The current system of relying on the work of content creators to replace them is not sustainable. Whether companies like Google and OpenAI change their approach or governments introduce new regulations, something has to give. At the same time, the negative implications of AI chatbots on content creation are becoming increasingly apparent, which website owners and content creators can exploit to their advantage.