Blocking AI Bots From Scraping Websites Gets Boost From Cloudflare

Along with adding an “easy button” to stop AI developers from training their models on web content, the security firm shared its findings on AI bot activity.

Jul 5, 2024

5 min read

Image created by Decrypt using AI

Cloudflare, a global internet security firm that claims to protect nearly 20% of the world’s web traffic, has launched what it calls an "easy button" for website owners who want to block AI services from accessing their content. The move comes as demand for content used to train AI models has skyrocketed.

Cloudflare's core service, which serves as an internet proxy, scans and filters web traffic before it reaches websites. On average, the firm says its network sees over 57 million requests per second.

"To help preserve a safe internet for content creators, we've just launched a brand new 'easy button' to block all AI bots," Cloudflare said in its announcement on Wednesday. "We hear clearly that customers don't want AI bots visiting their websites, and especially those that do so dishonestly."

While some AI companies properly identify their web scraping bots and respect website instructions to stay away, not all of them are transparent about their activities.

The new simple setting is being made available to all Cloudflare customers, including those on its free tier.

Dissecting AI bot activity

Along with its announcement, Cloudflare shared a plethora of information about the AI crawler activity it observes across its systems.

According to Cloudflare's data, AI bots accessed around 39% of the top one million “internet properties” using Cloudflare in June. However, only 2.98% of these properties took measures to block or challenge those requests. Cloudflare also mentions that “the higher-ranked (more popular) an internet property is, the more likely it is to be targeted by AI bots.”

The firm said web crawlers operated by TikTok owner ByteDance, Amazon, Anthropic, and OpenAI were the most active. The top crawler was Bytedance's Bytespider, which topped the charts in number of requests, the scope of its activity, and the frequency of being blocked. GPTBot, managed by OpenAI and used to collect training data for products like ChatGPT, ranked second in both crawling activity and blocks.

The web crawler for Perplexity, which has recently drawn controversy for its content crawling practices, was detected visiting a fraction of a percent of the sites Cloudflare protects.

While website owners can implement their own rules to block known web crawlers, Cloudflare also said that most of its clients that do so are only blocking more mainstream AI developers like OpenAI, Google, or Meta, but not the top crawler from Bytedance or other companies.

AI versus AI

Cloudflare's report highlighted how some AI bot operators are resorting to deceptive tactics to sidestep measures to block them, attempting to pass off their crawler activity as legitimate web traffic.

"Sadly, we've observed bot operators attempt to appear as though they are a real browser by using a spoofed user agent," Cloudflare wrote.

As it turns out, AI is a key tool in the company's arsenal to stop automated activity—whether from AI developers, search engines, or malicious attackers. Cloudflare said it uses a machine learning model to assign a “bot score” to each request made to a website protected by its services, with low scores indicating a low likelihood that the activity is legitimate.

With Cloudflare's massive dataset on global internet traffic, the model takes into account a number of signals, including the request's IP address, user agent, and behavior patterns, to determine the bot score.

To illustrate this, Cloudflare said it looked at traffic from a specific bot known for its evasive behavior. The results were telling: all detections were scored below 30 out of 100, with the vast majority falling into the bottom two bands, indicating a score of 9 or less. In other words, even with attempts to obscure its source, the bot's activity patterns gave it away—allowing Cloudflare to block it.

Protecting web content

Generative AI models rely on titanic volumes of existing content, much of it collected from across the web. In order for AI to continue to provide current information, its developers need to continue to collect information on a large scale.

Website owners and content creators are pushing back, with large publishers like news organizations taking legal action against AI companies. In the aforementioned case of Perplexity, publications like Forbes and Wired claim it is taking and republishing content without permission. Music publisher Sony preemptively warned over 700 tech firms to stay away in May, and this week, Warner Music Group has done the same.

The threat can be an existential one for publishers, should AI increasingly provide information to users without referring them to the source. A recent study published by SparkToro’s CEO Rand Fishkin suggested that 60% of people searching for information on Google stopped visiting the websites offering it because Google’s AI provided summarized answers immediately.

Edited by Ryan Ozawa.

Generally Intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.

Recommended News

Here's How All Major AI Platforms Stacked Up in a Harry Potter Sorting Hat Quiz
A computer developer known as Boris the Brave conducted an experiment that placed the 17 major language models through the official Harry Potter house quiz, sampling each question 20 times and calculating the probability of each house assignment. "Perhaps unsurprisingly, the vast majority of models prefer Ravenclaw, with the occasional model branching out to Hufflepuff," Boris wrote in a blog post sharing his results. Eleven out of 17 AI models scored a perfect 100% probability for Ravenclaw—the...
NewsArtificial Intelligence
4 min read
Jose Antonio LanzJul 7, 2025
Create an account to save your articles.
Rescue Drones Deployed Amid Texas, North Carolina Floods—But Hobbyists Are Grounded
As catastrophic floods ravage central Texas and North Carolina, emergency responders are using professional and military-grade drones with infrared and real-time video to map flood zones, locate stranded victims, and direct rescue teams. In Texas, MQ-9 Reapers flying 18,000 feet above the impacted area assisted first responders in locating missing victims of the flooding, including those from Camp Mystic, a summer camp where 27 children and counselors lost their lives. But while drones assist in...
NewsTechnology
3 min read
Jason NelsonJul 7, 2025
Create an account to save your articles.
You Can Buy a Martian Meteorite With Bitcoin—If You Have Upwards of $4 Million
Sotheby’s will auction off the largest known Martian meteorite on Earth later this month—and the iconic auction house is accepting Bitcoin for a piece of the red planet. The meteorite, known as Northwest Africa 16788 or NWA 16788, was discovered in Niger’s Agadez region in 2023 and weighs 54 pounds. It’s expected to fetch between $2 million and $4 million at Sotheby’s Natural History sale on July 16. “Sotheby’s has accepted cryptocurrency for select sales since 2021,” Cassandra Hatton, Vice Chai...
NewsBusiness
3 min read
Jason NelsonJul 4, 2025
Create an account to save your articles.

Coin Prices