A massive leak of Google Search documents has revealed the ranking algorithm's inner workings.
Thanks to a cache of leaked Google papers, we now have an unrivaled view into Google Search, highlighting some of the key components Google considers when determining content ranking.
What took place? An automated bot known as Yoshi-code-bot published thousands of documents (2500) on Github on March 13. The documents appear to originate from Google's internal Content API Warehouse. The co-founder of SparkToro, Rand Fishkin, was given access to these records in earlier months.
Why it matters to us. For SEOs who are knowledgeable about it all, this insight into Google's potential ranking system is priceless. One of the greatest stories of 2023 was the unparalleled peek at Yandex Search ranking variables that we were given via a leak.
This leak of a Google document? It is probably going to be one of the most significant stories in Google Search and SEO history.
What’s inside?
Here’s what we know about the internal documents:
Links matter. Shocking, I know. Link diversity and relevance remain key, the documents show. And PageRank is still very much alive within Google’s ranking features. PageRank for a website’s homepage is considered for every document.
Successful clicks matter. This should not be a shocker, but if you want to rank well, you need to keep creating great content and user experiences, based on the documents. Google uses a variety of measurements, including?badClicks, goodClicks, lastLongestClicks, and unsquashedClicks.
Also, longer documents may get truncated, while shorter content gets a score (from 0-512) based on originality. Scores are also given to Your Money Your Life content, like health and news.
What does it all mean?
Brand matters. Fishkin’s big takeaway? Brand matters more than anything else:
Entities matter. Authorship lives. Google stores author information associated with the content and tries to determine whether an entity is the author of the document.
SiteAuthority: Google uses something called “siteAuthority”.
Chrome Data. A module called ChromeInTotal indicates that Google uses data from its Chrome browser for ranking.
Whitelists. A couple of modules indicate Google whitelists certain domains related to elections and COVID-19 – isElectionAuthority and isCovidLocalAuthority. Though we’ve long known Google (and Bing) have “exception lists” when “specific algorithms inadvertently impact websites.”
Small sites. Another feature is a small personal site – for a small personal site or blog. King speculated that Google could boost or demote such sites via a Twiddler. However, that remains an open question. Again, we don’t know for certain how much these features are weighted.
Other interesting findings. According to Google’s internal documents:
The source. Erfan Azimi , CEO and director of SEO for digital marketing agency EA Eagle Digital, posted a video , claiming responsibility for sharing the documents with Fishkin. Azimi is not employed by Google.
What’s in the docs?
Here, you’ll find more than 2,500 pages of API documentation containing 14,014 attributes (API features) that appear to come from Google’s internal Content API Warehouse. Many of these attributes play an important role in Google’s ranking process.
However, this documentation doesn’t show the weight of particular elements in the search ranking algorithm. It also doesn’t indicate which elements are used in the ranking systems. But, it does show incredible details about the data Google collects.
Here’s an example of the document format:
Google myths revealed
To minimize manipulation of search results, the Google team has closely guarded the details of how their algorithms work and what truly influences rankings.
And now, thanks to the leaked information, we’re faced with what we have. Many claims that Google representatives once made about various aspects of search engine optimization have turned out to be untrue. Much of the leaked data directly contradicts Google’s official and public statements.
Let’s take a look at some of the most popular myths debunked by the leaked documentation.
Domain Authority
Google spokespeople have said numerous times that Google doesn’t use domain authority to rank pages.
Sandbox
Google has repeatedly claimed that there is no such thing as a “sandbox” for new sites, meaning their age does not affect their ranking. John Mueller stated this in 2019.
He also said in 2017 that domain age does not influence rankings in Google’s search results.
Leaked documentation mentions a hostAge attribute used “to sandbox fresh spam in serving time.” This fact fully contradicts Google’s denial of a sandbox for new websites.
领英推荐
Chrome data
Matt Cutts claimed previously that Google does not use Chrome data for search ranking or quality purposes.?
Leaked documentation shows that Chrome data is used by Google for ranking. For example, it is used to generate the Sitelinks SERP feature. Another module related to page quality scores includes a site-level measure of views from Chrome.
Links
Links remain important for Google, with metrics like sourceType indicating a loose relationship between the value of a page and its indexing location.
This means the higher the tier, the more valuable the link. Pages considered “fresh” are also high quality. That is, getting rankings from highly ranking pages and new pages yields better ranking performance. This could also be why websites generating links from fresh high-quality pages at scale see more benefit than traditional link earning, where links may come from outdated content.
In this context, it’s also worth mentioning PageRank, which remains relevant, as evidenced by the leaked documentation. The data shows that Google decides how to value a link based on how much they trust the homepage. Homepage PageRank is considered for all pages.
As always, in your link-building strategy, you should focus on the quality and relevance of your links and not just the volume.
Content
As for the content, there are several interesting points in this documentation. Let’s take a quick look at them.
Google evaluates the originality of short content and gives it an OriginalContentScore (from 0 to 512). Therefore, it is likely involved in the GSC functionality of defining thin content, which is not just a matter of content length.
The documents show Google’s attempts to associate dates with pages. The following attributes prove this: bylineDate (the explicitly set date on the page), syntacticDate (an extracted date from the URL or in the title), and semanticDate (date derived from the content of the page).
Authors
Google places heavy emphasis on E-E-A-T. If you ever have any doubts about the importance of content authorship for ranking, this documentation dispels them. It indicates that Google explicitly stores author information.
Panda algorithm
According to the documentation , to determine quality content, Google uses a scoring modifier based on user behavior and external links, applying it at various levels (domain, subdomain, and subdirectory).?
The document pays significant attention to NavBoost’s data (or click data), which focuses on relevancy and user intent. The documentation proves that the search engine uses it in ranking.?
Google’s documentation clarifies that Panda is far simpler than we thought. You just need to create high-quality, relevant content that receives many user clicks. Focusing on getting more relevant traffic and improving user experience will show Google that your page should rank higher.
Demotions
The document also contains information about the reasons for ranking drops. Various demotions are applied for issues like:
This information isn’t groundbreaking, but it will help you confirm that you’re on the right track and remind you what to avoid.
Summary
Numerous skepticisms regarding Google's internal operations were allayed (or proven) by this disclosure.
While Google wants to support and advise webmasters, it's crucial to realize that they also take precautions to prevent providing possibilities for spammers to influence search results.
The greatest approach to obtaining insightful knowledge and a thorough comprehension of SEO is via practice and firsthand experience. It's imperative to thoroughly assess any outside advice, even if it comes from Google.