Turns out you can understand the Internet without reading the words.
Dixon Jones ?
CEO. Board member. NED advisor. Startup veteran in the digital SAAS space. BA(Hons.). MBA. FRSA.
Creating a good search engine is never easy. Just ask Bing (previously Live.com, Previously MSN, Previously Microsoft Search, previously Looksmart... ). But it turns out that knowing what is ON the page is not always as important, oftentimes, as knowing how the pages are grouped together. There are some very powerful research papers that demonstrate this theory (https://maj.to/1yZOdW5 is about TrustRank Paper and https://maj.to/1Bb1lHu is a Stanford Paper), but actually a few good analogies suffice. Think about going into a good old fashioned library to find the mass of the Moon. All those pages of data... where do you start? The natural sciences section of course! Opening random pages would be absurd, but once you have the Astronomy subsection of the natural sciences, the chances are you'll find the answer in every third book on that shelf.
Majestic used this logic, combined with its crawl of three BILLION web pages a day, to categorize the entire internet and build a marketing search engine, without saving a word of online content. 3 Billion is a lot, by the way... Twitter only claims 500 Million unique tweets a day, to get a mental comparison.
It turns out that web pages links just like people do. If you are a mum with a child at school, there is a very strong chance that you will - within two degrees of separation - know every other mum through MULTIPLE paths. By following links on web pages at scale, Majestic has been able to work out what not only what every web page is about, but also how influential it is in that category.
How Did They Do That?
A traditional search engine has these four distinct steps:
1.Data Collection
2.Data Grouping
3.Data Indexing
4.Data Matching
But they have spent billions to achieve results. Majestic may have spent Millions, but how did they get so far on such limited resources?
1.Data Collection
Majestic has been able to be one of the top 10 largest Internet crawlers on the planet (beating Yandex and Baidu outside their home countries) by crawling differently. They crowd sourced the crawl!
2.Data Grouping
This is the magic. Whilst many engines tried to group data based entirely or in part on the on page content, Majestic looked at the links between content. As the image at the top of the page tries to demonstrate, if four pages are all authprities in the subject of (say) blue widgets, then the shaded page close to the authority pages is much more likely to be about blue widgets than the one spatially for away.
3.Data Indexing
Now the cost savings get extreme, because majestic does not need to index all the content. It already knows what a page is about and how influential it is. Majestic just saved billions...
4. Data Matching
By using the same principal of looking at links, Majestic has also been able to categorize keywords in a similar way. So when a user types in a search phrase, then Majestic can match the keyword to the pages.
Think it doesn't work? Well it has SOME legs... its search engine is only in alpha, (as Majestic's core business is the link intelligence database itself) but here's the results for the phrase "Credit Cards"... the results are not bad!
If you would like to get contacted when more research comes out, or get into the beta program, get yourself a free account now at Majestic and register for the beta testing program here.
Where SEARCH Marketing MEETS Artificial Intelligence & Digital Sustainability ★★★★★
9 年I guess the "new Reading" on the web is Scrolling :-)
CEO. Board member. NED advisor. Startup veteran in the digital SAAS space. BA(Hons.). MBA. FRSA.
9 年Depends where you start from Chris - but there are 10 types of people... Those that understand binary and those that don't. I don't... So I start at words.
Murex Developer at ICBC Standard Bank Plc
9 年There are words on t'ínternet?