IndexNow and the future of web crawling
IndexNow is a new way to alert search engines when new or updated content is available. In this post, I describe how it’s different from web crawling or XML sitemaps and what it means for the future of web crawling.
What IndexNow is, how to use it, and what it means for the future of web crawling
In The Beginning of API Indexing, I explain that search engine crawling is insufficient, outdated, and wasteful. Instead of crawling sites, search engines should enable site owners to bring the content to them. The trend toward Indexing APIs was moreso driven by Bing than Google. Maybe because Google doesn’t want to give up its monopoly on having the largest web index. Maybe there are technical reasons. Either way, improved crawl efficiency is good for the whole web, not just search engines, because they lead to lower server loads and lower energy cost.
Either way, serving content to search engines via APIs comes with four basic benefits: less spam because search engines can simply throttle API access for spammers, fewer or no rendering issues because search engines can ask for the rendered HTML straight from the site, lower resource waste because search engines don’t have to crawl the web anymore, and higher cost efficiency.
Now, Bing launched IndexNow, an open protocol for pinging new content directly to search engines, in collaboration with Yandex and other search engines.
IndexNow vs. XML Sitemaps
IndexNow is not a full indexing API that delivers the whole HTML to search engines but rather an XML sitemap on steroids. According to the official documentation, IndexNow notifies search engines about new URLs. They don’t have to crawl XML sitemaps, which can be limited in size and freshness, anymore. You can still use both, though.
The documentation also states that if a URL changes multiple times a day, say for a news or weather site, IndexNow is not the optimal solution. However, Search engines prioritize URLs submitted through IndexNow over URLs found another way. Submitted links don’t have to return a 200 status code. It can be a 404, for example, to notify search engines about pages no longer available, or redirects to get them crawled faster.
URLs discovered through IndexNow count towards crawl budget (or crawl quota, as Bing calls it). It’s unclear how IndexNow changes crawl budget but I would imagine that not having to discover URLs through links or XML sitemaps is much more efficient and should increase the crawl budget of a site.
How to use IndexNow
Using IndexNow is very straightforward:
Go to the key generator and generate a key to prove ownership of the site
Host the key in a text file in your root directory
Submit new URLs with parameters through a GET request
Monitor crawl rate and indexing through Bing Webmaster Tools
Each host (subdomain) needs its own key and you can use different keys per content management system.
The role of CDNs in indexing the web
Many platforms plan to adopt IndexNow but Cloudflare sticks out for a couple of reasons. First, CDNs have a good “view” of the web, meaning they are best set up to track bot and human traffic because they proxy a lot of sites. 77% of websites don’t use CDNs, according to W3C. But Cloudflare has a decent grasp of when URLs change and can help search engines with change discovery as market leader.
Second, Cloudflare launched a product to help with common indexing problems called Crawler Hints. IndexNow plays right into it. It goes to show that the problem IndexNow tries to solve is a big one.
Cloudflare says about 45% of internet traffic comes from bots, including 5% from “good bots'' like search engine crawlers. But 53% of those 5% are wasted by either recrawling URLs that haven’t changed, crawling spam, or other irrelevant content. That’s where Crawl Hints comes in.
From Cloudflare:
At Cloudflare, we see traffic from all the major search crawlers, and have spent the last year studying how often these bots revisit a page that hasn't changed since they last saw it. Every one of these visits is a waste. And, unfortunately, our observation suggests that 53% of this crawler traffic is wasted.
The position of CDNs in the web’s infrastructure and their wide ranging overview of traffic activity makes them an important partner for IndexNow but also an interesting gateway into more efficient indexing. I expect more movement on that front in the near future.
The growing pains of crawling the web
IndexNow provides many benefits. It allows webmasters to notify all search engines at once. It democratized indexing and leads to fewer resources needed by search engines to crawl the web.
Search Engines have been struggling with crawling for a while. Challenges include spam, javascript rendering, but also the increased use of nofollow tags, which is one of the reasons Google started treating nofollow more as a suggestion than a directive.
I don’t think web discovery through links is the best approach and expect more search engines to lean on webmasters to bring new content to them through APIs like IndexNow.