The end of crawling and the beginning of API indexing

In the future, search engines might not come out to get content. Webmasters might bring it to them with API indexing.

Jul 28, 2020

Google became the most successful startup in history by crawling the web, building an index of web pages, and ranking them based on popularity. Now I see signs for a potential paradigm shift from crawling to indexing APIs. In the future, search engines might not come out to get content. Webmasters might bring it to them.

tl;dr
The exploding growth of the web, recent problems Google had with indexing, and Bing's and Google's development in the space lead me to think that the crawling the web might eventually be replaced by indexing APIs.

To prepare, webmasters should:

Sign up for Bing's indexing API.
Try submitting regular pages via Google's job indexing API.
Try out RankMath's WordPress plugin.
Play with David Sottimano's Node JS template.

The goal of search engines, at least for Google, is to "organize the world's information and make it universally accessible and useful". But the web is exploding in size. Google discovers hundreds of billions of web pages, most of which are spam. It's also tricky to adapt to all the different coding frameworks that are being used and lastly, crawling the web is not cheap.

Google needs to keep the index as small as possible while making sure it includes only the best results. Think about it. Having a huge index is just a vanity goal. The quality of indexed results is what matters. Anything else is inefficient.

That's why Google doesn't like spending much time crawling and rendering low-quality sites. They just fill the index with trash. That's why pruning works well. It's also the basis for the idea of crawl budget.

4 good reasons to use indexing APIs over crawling

Crawling is an essential part of information retrieval and the success of search engines. Why would they change their ways?

Less spam

Spam was a problem for search engines from the start. As I wrote in the problem with spam and search, spam can be a lethal threat for Google because it wastes so many crawl resources, provides a bad experience for searchers, and spammers are getting more sophisticated. Google's algorithms need to keep up.

Link spam was always one of the biggest issues. Now that Google is becoming better at understanding semantics and user-satisfaction, they rely less and less on links for ranking. However, they do still rely very much on links for indexing.

Indexing APIs could solve a big part of the spam issue because they create a bottleneck. Indexing is more controllable. And which spammer in their right mind would submit spam straight to Google? That's like a thief trying to steal from the police.

Search engines could use certain signals to decide which content to accept and which sources to throttle to prevent API spam, such as:

Verified age verification of the account
Site impressions
Quality of submitted content

Fewer rendering issues

One archnemesis of search engines is rendering Javascript. A neat experiment by Bartosz from Onely shows that Google cannot render all frameworks properly and big sites still fail at getting it right. To be fair, Google has made big steps forward on this issue but it's still not solved.

Indexing APIs could be a solution because they provide webmasters the opportunity to submit the fully rendered HTML. Search engines wouldn't have to worry about rendering as much.

This could open a vulnerability for cloaking but in the end, it's the same challenge Google faces with dynamic rendering today. Google seems to be able to solve it.

Resource sparing

Several factors decide how often and what Google crawls, for example, the popularity of a URL and how often it changes (source).

As the web grows to many billions of documents and search engine indexes scale similarly, the cost of re-crawling every document all the time becomes increasingly high.https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/34570.pdf

But since 2019, we've seen reoccurring problems with indexing bugs:

Part of the challenge is Google's transition to mobile-first indexing. Even though Google confirmed that there is one single index, the needed crawl resources must have grown significantly because Google needs to assess both versions (desktop and mobile) of a site.

Indexing APIs would be much more resource-sparing. Google wouldn't have to ping servers, figure out the canonical state of a URL, or follow robots.txt directives. Schedulers wouldn't have to figure out how often to come back to crawl a URL. They just render, index, and rank the content that webmasters want to be indexed.

More cost-efficient

In 2011, Jeff Dean gave a presentation you should watch. He explains the complicated architecture of Google’s indexing services and the challenges of storing the web.

Jeff mentions some factors that impact a search engine’s index size:

# of documents
Queries processed per day
Metadata (additional data about documents)
Update frequency
Average crawl time per document

Instead, I looked at Google's data center CapEx (capital expenditures), meaning how much Google reports to spend on data centers. This gives us a rough estimate on cost for the whole process of building and filing an index, rendering pages, ranking them, and everything that involves images, maps, and other search verticals.

In 2018, Google spent $9b on data centers with a total CapEx of $25.5b (source), so roughly 35% go into data centers. They planned to spend $13b in 2019 but in the first quarter, CapEx turned out to be $21.8b, which means data center cost were around $7.6b - already half of what was planned. To be clear, I'm not sure how much of those $7.6b went into data centers for Google Cloud versus Google Search.

When COVID hit the world, Google announced to slow hiring and spend less on data centers. The announcement came with a little detail that says a lot. Pichai, Alphabet's CEO, said:

"We are reevaluating the pace of our investment plans for the remainder of 2020. Beyond hiring, we continue to invest, but will be recalibrating the focus and pace of our investments in areas like data centers and machines, and non business essential marketing and travel.”

Recalibrating investments in data centers? That’s core infrastructure!

Here’s what I think: COVID provided a good opportunity to announce actions that had to be taken either way. I think data center cost turned out to be closer to $30b in 2019 than the projected $13b. If that holds true, data center cost made up 18% of total revenue (30b / 161b).

Google’s total revenue in 2019 was $161b. Gross profit was $89b, which is only a 16% increase YoY compared to the annual increase of 18% the previous year. Google needs to keep its profitability rate up and saving money by replacing web crawling with indexing APIs would be one way to do that.

What SEOs can do today

The big question is "how can we prepare?" I got 4 recommendations for you.

1. Try out Bing's indexing API

"We believe that enabling this change will trigger a fundamental shift in the way that search engines, such as Bing, retrieve and are notified of new and updated content across the web. Instead of Bing monitoring often RSS and similar feeds or frequently crawling websites to check for new pages, discover content changes and/or new outbound links, websites will notify the Bing directly about relevant URLs changing on their website. This means that eventually search engines can reduce crawling frequency of sites to detect changes and refresh the indexed content."https://blogs.bing.com/webmaster/january-2019/bingbot-Series-Get-your-content-indexed-fast-by-now-submitting-up-to-10,000-URLs-per-day-to-Bing

Bing is a bit ahead of the curve on this one: they launched an indexing API already in March 2019, called the "content submission API", with a 10,000 URL starter limit. You can expand that quota a lot but need permission on an individual basis.

Bing offers two versions of the indexing API: Adaptive URL submission and Batch Adaptive URL submission. The latter allows you to submit URLs in batches.

If you use Botify, you can leverage their partnership with Bing. The platform will submit the content for you, even beyond the 10K URL limit.

Or, as Wordpress user, you should try out Bing's Wordpress plugin for the content submission API. The company also partners with Yoast, which automatically sends your content to Bing.

Interesting: Bing clearly states that you can shoot more URLs through the API the longer your site is verified.

"The daily quota per site will be determined based on the site verified age in Bing Webmaster tool, site impressions and other signals that are available to Bing”https://blogs.bing.com/webmaster/january-2019/bingbot-Series-Get-your-content-indexed-fast-by-now-submitting-up-to-10,000-URLs-per-day-to-Bing

2. Play around with Google's (limited) indexing API

Google limits its indexing API to jobs and streaming events at the moment but specifies that the limitation is momentarily:

"Currently, the Indexing API can only be used to crawl pages with either JobPosting or BroadcastEvent embedded in a VideoObject.”https://developers.google.com/search/apis/indexing-api/v3/quickstart

Google's documentation also makes the difference between XML sitemaps and the indexing API clear: "We recommend using the Indexing API instead of sitemaps because the Indexing API prompts Googlebot to crawl your pages sooner than updating the sitemap and pinging Google."

XML sitemaps tell a search engine that something on a URL has changed but Google still has to visit the URL and crawl it. Content submission APIs (maybe a better term for indexing APIs) allow you to send the full content to search engines.

This touches on an interesting point, which is the partnership between Wordpress and Google. The partnership provides many interesting points and is a good basis for another article but for indexing APIs, it could be a major lever: Wordpress hosts ~37% of the web (source). For Google, it's a great bottleneck to plug into and cover a huge part of the web at once. I could see partnerships with hosters and domain registrars follow.

4. David Sottimano's Node JS template

David Sottimano wrote a cool Node JS template for Google's indexing API.

Google’s new Indexing API support page say “it can only be used to crawl pages with either job posting or livestream structured data”, but of course I was curious and it turns out that we can get regular pages crawled as well, and damn fast.

Crawling the web is not sustainable

XML sitemaps were the first steps toward a less crawl-dependent indexing process. Indexing APIs are the next step. XML sitemaps only tell search engines when a URL changed but not what changed. Indexing APIs take it to the next level by submitting the whole content to search engines.

I can't see search engines would stop crawling completely but reduce it down to a minimum. Google has always been good at incentivizing webmasters to follow their demands, simply by having the finger on the traffic floodgates.