SEOs are underestimating The Yandex leak
Many SEOs dismiss the Yandex source code leak but miss a chance to learn something new.
In late January, hackers retrieved around 45 GB of Yandex source code, including lists of ranking factors and their coefficients (weights). Since then, some members of the SEO scene have worked hard on decoding the material.
However, many SEOs have also publicly dismissed the value of the Yandex leak. The most common arguments are:
Yandex is not Google
We don’t know if the Yandex leak is real
Don’t obsess over ranking factors
Yandex scraped Google; it's just a copy
The leak is just a tiny window into how Yandex ranks websites
"What would you change about the way you optimize anyway?"
The code repo is outdated
I was very surprised by how many SEOs dismissed the value of the documents. I've been making a living with SEO for 13 years, and I've never encountered better insights into how modern search engines work. What better model do we have?
My take: This reaction is mostly based on fear of being wrong, losing work, and having less room for interpretation. There are many benefits of SEO being such a black box but also costs. The timing is also interesting: Google’s Q4 earnings were weak, and Chat GPT is disrupting the search ecosystem. Many fear that SEO could go away or evolve from its current form and leave less room for competition.
The most common objections against the Yandex leak
I want to respectfully object to some of the most common arguments I hear because I think they're holding us back:
"Yandex is not Google." When you compare a few search results, you realize that the overlap is small. Yandex has some overlap with similar results in the top 10 (example in the table below) but not for exact positions.
You could argue about which results are "better" and what "better" even means in the context of search. Is Google the dominant search engine because they deliver the best results? Would we be happy with Yandex if Google and Bing didn't exist?
And yet, Yandex is no hobby project but a public company that made almost $5B in revenue in 2021. It's also a search engine that's censored by the Russian regime and ranks conspiracy content. But in general, it's not as far away from Google as many people make it seem.
Compare the search results between Google and Bing. Most people would probably place Bing as closer to Google than Yandex. And yet, the overlap is equally small (see table below).
"We don’t know if the Yandex leak is real." Yandex officially confirmed the leak (source).
"The code repo is outdated." The leaked files date back to February 2022, so the code repository is not very outdated. It's no accident that former employees leaked the code during that time since that's when Russia started the war against Ukraine. Realistically, the former Yandex employee(s) leaked the code to shine a light on Yandex's widespread censorship and misinformation.
Of course, Yandex has an interest in making it seem like what leaked to the public is mostly not in use anymore to minimize security risks. But developer Arseniy Shestakov "verified that at least some of archives for sure contain modern source code for company services as well as documentation pointing to real intranet URLs." (source)
"The leak is just a tiny window into how Yandex ranks websites." Some people argue only one Yandex code repository has been leaked since not all source code lives in a single repository. However, even Google has most code in one single repository:
Much like the code that underpins Windows, the 2 billion lines that drive Google are one thing. They drive Google Search, Google Maps, Google Docs, Google+, Google Calendar, Gmail, YouTube, and every other Google Internet service, and yet, all 2 billion lines sit in a single code repository available to all 25,000 Google engineers. (source)
"Yandex scraped Google; it's just a copy." Some of the source code shows that Yandex crawled Google, but there is no evidence (so far) that Yandex used the data to rank search results. It's likely that Yandex crawled Google to compare results.
"Don’t obsess over ranking factors." There is a big difference between obsession and curiosity. Just because you're trying to figure out what works in SEO doesn't mean you're obsessing over ranking factors. "Just create good content", "focus on the user experience" or "Google will figure it out" are borderline naive simplifications. You might as well say, "just focus on not running into other cars" when learning to drive.
The most successful SEOs have always had a near-pathological curiosity about how things work. "Don't chase the algo" is another saying that's used to dispel the myth that we could reverse engineer what signals Google rewarded or punished with an algorithm update. But not trying to learn more from updates is equally as ignorant.
"What would you change about the way you optimize anyway?" Ah, one of the better reactions! "We've known these things for a long time" is what I see a lot of SEOs claiming when they read about the Yandex source code. But that's not true. Knowing and assuming are two pairs of shoes. Our SEO knowledge comes from experience, anecdotes, experiments, ranking factor studies and a few signals Google has officially confirmed. We've never seen these signals in the source code of a modern search engine like Yandex before.
The ultimate piece of evidence would be Google confirming the use of signals that were found in Yandex's source code. The fact that Yandex uses a lot of ranking factors we long thought to be true shows that they're not unrealistic to work.
SEO highlights of the Yandex source code
Shoutout to Martin MacDonald, Mike King, Alex Buraks and Dan Taylor for doing the work and sharing useful insights from the Yandex code base leak.
Malte Landwehr found that 19% of the Yandex ranking factors focused on user signals, 6% on links, and 6% on content relevance. Remember when Semrush published a ranking factor study that showed traffic to a site had the highest rank coefficient and was bashed by the SEO community? I hope next time we're more curious.
Back when I worked at Searchmetrics, we found that ranking factors became increasingly category-specific, and today, I'd argue they're query-specific. The Yandex source code includes binary, static and query-specific ranking factors. Static factors apply to the website, dynamic factors to the query, and user factors to the user's location, language, search history, etc.
So far, almost 18,000 ranking factors across different modalities have been found in the source code, but it seems only 1,900 were not deprecated. Just like humans are bad at understanding the impact of compound interest, the outcome of algorithms with many factors is incredibly hard to estimate. Add binary and gradient ranking factors, as explained earlier, and reverse engineering becomes impossible. The fact that so many parts of a website and the web ecosystem impact organic search rankings is astonishing but also encouraging for SEOs because it means there is a margin to compete.
Yandex seems to follow similar information retrieval best practices as Google, like the inverting index or embeddings. Yandex also uses different models like its neural network MatrixNet, which was used to determine rank coefficients before it was replaced by CatBoost in 2017. Knowing where and how MatrixNet was used gives an idea about how modern search engines go about adjusting and fine-tuning ranking factors.
A suggestion for how to think about the Yandex leak
If researchers had the complete DNA sequence of cancer in mice, would they dismiss it because mice are different from humans? No! So, why do we dismiss the Yandex leak so quickly?
The best way to go about this, in my mind, is to use the Yandex ranking factors as a basis for SEO tests. It's hard to isolate single factors in most cases, especially when they have a low coefficient, but at scale, we might gain valuable insights. We can also get a better understanding of what metrics to measure. For example, I never look at link age when analyzing a backlink profile, but I'll certainly do so now and watch out for patterns.
Imagine we got Google’s ranking factors, and then the search experience changed to a Chat GPT-like model. Wouldn’t we still want to understand what the winning formula was all these years?