Navigating the AI Search Revolution: Vector Search and Knowledge Bases

Navigating the AI Search Revolution: Vector Search and Knowledge Bases

In the dynamic world of digital marketing, staying ahead of the curve means continuously learning and adapting. It's something I constantly tell our team that the best thing you can do to get ahead within your industry is to do some reading on it. You'll be surprised how few really have a passion for what they do, and being more informed makes your work more enjoyable and enriching as you uncover patterns, insights and value that you might not have recognised before.

This episode is going to get a little nerdy, but don't fret; there are still some other goodies below, like my response to a proposed solution to automate?Pedestrian?TV's publishing licences with AI and our thoughts on Google choosing to no longer depreciate cookies in Chrome.


Recently, I delved into Marie Haynes' insightful book "SEO in the Gemini Era," which explores how artificial intelligence is reshaping search. In this month's MoM's the Word column, I wanted to unpack some of the key concepts from the book (while still leaving the more meaty stuff for you to go and purchase it!) and highlight what I think are key implications for digital publishers, advertising agencies, marketers, and brands to understand and be across.

In previous episodes of "MoM's the Word," I've shared various ways we utilise AI as a business, as well as providing handy guidance on prompting. Recently, as I delved deeper into the intricacies of search and after reading Marie's book, a significant and satisfying realisation dawned upon me. This book profoundly changed my understanding of how search works and finally put in context why Google has been claiming to be an "AI-first" company since 2016 when CEO Sundar Pichai made this proclamation at the company's I/O developer conference. Let me explain...


Imagine you're studying for your final exams, and you have a group of friends who each excel in different subjects. There's Sam, who is a whiz at maths; Alex, who knows everything about history; and Taylor, who is a science genius. Whenever you have a question, you don’t just ask anyone; you ask the friend who knows the most about that particular subject. This way, you get the best possible answer.

Prompting an AI model is very much the same. It involves guiding the AI to provide expert responses based on specific inputs. This technique is crucial for getting accurate and relevant answers rather than generic or incorrect ones. It’s like steering the AI towards the areas where it has the most expertise, akin to navigating a mind map. By doing this, you ensure the AI accesses the most relevant parts of its knowledge base, producing high-quality responses.

But how does the AI train itself and continuously improve to ensure it's providing you with helpful answers? The process hinges on user interaction data and feedback.

Every interaction you have with the AI contributes to its learning process. When you ask questions and rate the responses, this data is fed back into the system, allowing the AI to identify patterns and preferences. Over time, it refines its algorithms based on this feedback, iterating and evolving to become more accurate and efficient in understanding and responding to your queries.

And this, my friends, is where my epiphany came.


In Marie Haynes' book, she explains that Google's search engine uses a combination of advanced AI models and ranking systems to deliver accurate and relevant search results. These models and systems work together to understand and interpret user queries, determine the quality of web content, and rank search results accordingly.

These models include: RankBrain, a machine learning system introduced in 2015 that helps Google understand the meaning behind search queries by interpreting context and intent using vectors to represent words and phrases; Neural Matching, which uses neural networks to comprehend synonyms and related concepts, ensuring queries match relevant results even if exact keywords aren't present; BERT (Bidirectional Encoder Representations from Transformers), a deep learning algorithm introduced in 2019 that processes words in relation to all other words in a sentence, enhancing Google's understanding of natural language and complex, conversational searches; MUM (Multitask Unified Model), an advanced AI model designed to handle complex search tasks by understanding and generating language across 75 different languages and analyzing text, images, and video; the Knowledge Graph, a database that enhances search results with semantic information gathered from various sources, helping Google provide direct answers and in-depth results by understanding context; E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness), a set of guidelines used to assess content quality, particularly important for health, finance, and other critical areas; user interaction signals, which include click-through rates, time spent on a page, and bounce rates, allowing Google to refine algorithms based on user satisfaction; Mixture of Experts Models (MoE), which uses a network of neural networks, with each of those selectively called upon as it learns to use the part of the data it's trying to process in the most efficient and useful way; and vector search, which converts text into numerical vectors to understand the context and semantic meaning of search queries, helping Google deliver more accurate and relevant results by matching query intent with appropriate content. Finally, we have NavBoost, which we learnt from Google's latest antistrust trial, that leverages 13 months of user interaction data and signals to learn how to predict what the next searchers are likely to find most helpful.

Sounds complicated, doesn't it?

These models and systems are interconnected, creating a robust framework that powers Google's search engine. For instance, RankBrain and Neural Matching work together to understand query intent, while BERT enhances this understanding by providing deeper context. The Knowledge Graph and E-E-A-T guidelines ensure that the information presented is reliable and authoritative. Meanwhile, user interaction signals continuously refine the algorithms to improve user satisfaction. The Mixture of Experts Model and vector search further enhances the system's ability to deliver precise and relevant results, making Google's search engine one of the most advanced and effective tools available. By leveraging these sophisticated AI models and ranking systems, Google can provide search results that are not only accurate and relevant but also contextually rich and user-focused.


Google’s knowledge base is like a massive, well-organised library, where every book (or webpage) is connected by a web of relationships. This web, known as Google’s Knowledge Graph, links people, places, things, and concepts together so that Google can understand how they relate to one another (it also reminded me of this guy who's built a graph of all Wikipedia pages and you can start to see the similarities). It’s akin to creating a mind map for studying, where different ideas and facts are connected, allowing you to see the bigger picture. This all helps to give Google further context about what an article is about.

Now, think of Google as a giant brain made up of thousands of friends, each one an expert in a different topic. When you search for something, Google figures out which "expert" to ask, just like you decide which friend to ask for help with your homework. This system, known as the Mixture of Experts Model (MoE), ensures that your query is routed to the most relevant "expert" (or neural network), providing you with the best possible answer. It's super important to realize that the MoEs are not actual human medical experts, financial and so on, but rather separate neural networks that might be "experts" in understanding how a particular word is used and more. If you'd like to delve more into how this works, Hugging Face has a great breakdown.

Side note: I was interested to hear that Meta chose not to use a mixture of experts (MoE) model architecture for its latest Llama 3.1 model, opting instead for a standard decoder-only transformer architecture with minor adaptations. They prioritised training stability, given the massive scale of the 405B parameter model trained on over 15 trillion tokens. The standard architecture is simpler to implement and optimise compared to MoE models and offers proven stability and performance across various language tasks. Meta found the potential benefits of MoE did not outweigh the advantages of the standard approach, especially in terms of scalability, which required over 16,000 NVIDIA H100 GPUs. This conventional architecture allowed Meta to focus on other innovations, such as improved data quality, advanced post-training procedures, and efficient inference optimizations through quantization, creating a competitive model while maintaining the benefits of an open-source release.

To make this even more effective, Google uses vector searching. Vector searching represents a shift from traditional keyword-based search methods. It involves converting text into numerical vectors, which allows AI models to understand the context and semantic meaning of search queries. This advancement enables search engines to deliver more accurate and relevant results. For instance, Google uses vector search to interpret complex queries and match them with content that might not contain the exact keywords but addresses the search intent. This shift underscores the importance of creating content that is contextually rich and relevant rather than just keyword-stuffed.


By converting text into numerical vectors (and coupling this with other information like your location, time of day, etc.), Google can quickly match the exact intent behind your query with the most appropriate content, making its search results incredibly accurate and relevant. This multidimensional approach is what makes Google so effective at delivering precise answers to even the most complex queries.


So, where does my epiphany come in? I realised that just as we get the best out of AI by prompting it correctly and understanding its knowledge base or what it's capable of, we also need to demonstrate to Google that we are experts in our field, what we are capable of and that we are genuinely helpful to users to make our content surface in its search results.

But it goes even beyond that. From the DOJ's reports, we know that Google uses user interaction data to train and improve its ranking models. This means that every time you perform a search, click on a link, or spend time on a webpage, Google collects this data to understand what is most relevant and useful to users.

Google’s AI-driven algorithms are designed to learn from this vast amount of data. They analyze patterns in user behaviour, such as which links are clicked the most and which search results lead to longer site visits. This helps the algorithms understand the nuances of user intent and preference. For instance, if many users click on a particular result and spend a significant amount of time on that page, Google’s system learns that this page is likely providing valuable content. They can then use this information in machine learning systems that are designed to predict for future searches whether a page is likely to meet their needs and adjusts its rankings accordingly.

Additionally, Google employs advanced techniques like supervised learning, where its AI models are trained on labelled data sets that help them recognise and predict user preferences. It also uses reinforcement learning, a method where algorithms improve their responses based on the rewards and feedback received from user interactions.

By continuously integrating this user interaction data, Google’s AI systems become more adept at delivering relevant search results, enhancing the overall user experience. Essentially, every interaction you have with Google’s search engine helps it to learn and evolve, ensuring it remains an effective and reliable tool for finding information. This feedback loop is a critical component in maintaining and improving the accuracy and efficiency of Google's search rankings.

Now that we know this fact, we can put this into practice as Marie puts it:

"The secret is to have a website and content that people seek out and click on. That usually means having an audience that recognises you as a go-to source on your topic. More importantly than that, it means understanding what that audience’s needs are."

If you’re genuinely a business or personality that people seek out as a source of information on your topics, and if you’re creating content that people find helpful, you are much more likely to align with Google’s Helpful Content Guidelines than an SEO who is attempting to manipulate topical authority.

"You cannot produce original, insightful content that truly demonstrates experience and trustworthiness by outsourcing all of your writing to a copywriter and publishing with minimal editing and no added insight. But for years, many businesses thrived on this model!"

Marie Haynes' book brought about a significant realisation: AI models, like Google's, now constantly improve and predict which content will be most helpful to users based on user interaction data, such as clicks, time spent on a page, and bounce rates. To rank well in today's AI-driven search landscape, it's not enough to produce keyword-rich content. Digital publishers and marketers must demonstrate topical authority by consistently creating high-quality, insightful content that genuinely helps users. This involves deeply understanding user intent and comprehensively addressing their needs.

This is where the concept of Information Gain becomes crucial. Google's guidance on helpful content emphasises the importance of original, high-quality material that showcases expertise, experience, authoritativeness, and trustworthiness (EEAT). This is what users genuinely value and find helpful. By incorporating this EFFORT and expertise, as well as an understanding of vector search techniques, brands can optimise their content to better align with user intent. This means creating content that provides real value and answers the underlying questions users might have, rather than merely targeting specific keywords.

In essence, the future of search is about moving beyond superficial keyword strategies and embracing a deeper, more nuanced approach to content creation. By focusing on genuine expertise and user-centric content, brands can not only improve their search rankings but also build lasting trust and engagement with their audience.


A Note on AI Search Quality & HCU Recoveries

Since Google's launch of its Helpful Content Update (HCU) and AI Overviews, the quality of Google Search results has also come into question. The systems we've explored above are by no means perfect. If you want to see what Google's systems "aim" to achieve, you can read the Google Quality Raters Guidelines, but they've still got a bit of work and iteration to get there as well as to deliver the diversity of results that people expect. The HCU has faced criticism for potentially favouring larger publishers over smaller, independent content creators. Many small publishers have reported significant traffic drops, sometimes as high as 50-80%, even for well-researched and carefully written content. Claims of favouritism towards big brands (perhaps due to these very AI systems finding people like big brands), lack of transparency, and dominance of major platforms like Reddit and Quora are prevalent. This shift makes it harder for smaller, specialised blogs to compete, raising concerns about the subjectivity in determining "helpfulness" and the potential reduction in content diversity. Critics argue that Google's definition of helpful content might inadvertently favour larger, mainstream publishers over niche content creators, challenging traditional SEO strategies and creating an uneven impact.

Additionally, the introduction of AI-generated overviews has led to concerns about the potential reduction in traffic to original content creators, accuracy and reliability issues, and a lack of diverse perspectives. Critics worry about overreliance on AI-generated content, which may miss out on more nuanced or in-depth information available on original websites. There are also concerns about potential misinformation, loss of context, and changes in user behaviour, potentially reducing critical thinking and evaluation of sources.

Looking ahead, Danny Sullivan has indicated that a major core update is coming in the next few weeks, advising people to "buckle up" for potentially significant impacts on search rankings. Sullivan mentioned that sites affected by the September 2023 HCU could potentially recover with the next core update if Google's systems believe they've improved. John Mueller , Google's Senior Search Analyst, emphasised that recovery can take time, sometimes months, and might require another update cycle. Both Sullivan and Mueller advise site owners to focus on creating high-quality, helpful content for users rather than optimising specifically for Google's algorithms. Sullivan also suggested that great sites with content people like receive traffic from various sources, not just Google search, implying that site owners should diversify their traffic sources. Mueller indicated that Google's systems continuously reassess sites, but significant changes might require another update cycle to be fully recognised.


Practical Applications & Strategies for Publishers in the AI-Era

Are any of you still with me? I completely understand if not, as I might not even have done a good job of explaining the epiphany that I had, as, to be fair to myself, even if I get confused with it all and how it fits together. This is something I could see Marie also struggle with as I read her book, and she was piecing it all together. But if you're still here and want to know what to do with all of this newfound knowledge or understanding, here are some helpful tips to point you in the right direction:

1. Understand and Implement Structured Data and Schema Markup

Gain an understanding of Named Entity Recognition (NER) and disambiguation. Implementing schema markup helps search engines understand the context and relationships within your content. These tools help in structuring your content to be more understandable by search engines, thereby improving its chances of being featured in knowledge bases and AI summaries. This can also enhance your chances of being included in rich snippets and knowledge panels, driving more organic traffic to your site.

2. Focus on User Engagement

Monitor user interaction metrics closely. High engagement rates signal to Google's AI models that your content is valuable and relevant. Encourage user interaction through engaging multimedia, clear and concise writing, and interactive elements.

3. Build Topical Authority

Consistently produce high-quality content around your core topics. Establish your brand as a go-to source for reliable and insightful information. This builds trust with both users and search engines, enhancing your topical authority.

Concluding Thoughts on the AI Search Era

The recent launch of OpenAI's SearchGPT prototype underscores the epiphany I had from reading Marie Haynes' books. In order to rank well in today's evolving AI search landscape, it's critical to be across how AI works and the importance of demonstrating topical authority through high-quality, insightful content that users love. As search evolves with AI like SearchGPT providing real-time answers and emphasizing transparency and trustworthiness, publishers must focus on creating original, authoritative content that genuinely helps users, aligning with Google's EEAT principles. If you're simply pumping our generic content or mass-produced AI slop, it simply won't cut it. This shift highlights the need to adapt our strategies to ensure visibility and relevance in the AI-driven search era.

Marie Haynes' "SEO in the Gemini Era" provides invaluable insights into how AI is transforming search. By understanding and leveraging these concepts, digital publishers, advertising agencies, marketers, and brands can stay ahead in the evolving search landscape. The key takeaway is clear: focus on creating genuinely helpful, high-quality content that demonstrates your expertise and aligns with user intent. This approach will not only improve your search rankings but also build lasting trust with your audience.

For further reading, I highly recommend diving into Marie Haynes' comprehensive work, which offers detailed strategies and examples for navigating the AI-driven world of SEO.

References




Responding to a Controversial LinkedIn Post on Using AI-Generated Articles to Breath Life into the Remains of Pedestrian's Licences

Recently, a post on LinkedIn sparked controversy among publishers, suggesting that someone could start a profitable local business using the remains of the Pedestrian licences. They proposed a model where LLMs (Large Language Models) are trained on parent publications overseas, localised to an Australian tone, and generate content from local press releases. He estimated that with minimal staff, this model could produce 500 articles per week, each garnering 3,000 views and generating $10 CPM in revenue, resulting in a profitable $780k revenue business annually.

Here’s my take on the proposal, along with feedback from other publishers. My original response can be found here: LinkedIn Post.

While I acknowledge the merits of trade platforms or press release summaries from the likes of Mi3 Fast News, I’m sceptical about the proposed model’s monetization potential for a consumer lifestyle or content site. There are several flawed assumptions:


  1. Google's Guidance on AI-Generated Content: As we've explored above, Google emphasises original, high-quality content that displays expertise, experience, authoritativeness, and trustworthiness (EEAT). AI-generated content risks lacking these qualities, leading to poor user engagement and lower search rankings. Websites relying on AI content have been hit hard by Google’s algorithm updates, which favour unique, original content.
  2. Training Costs and Copyright Issues: Properly training an LLM on vast archives requires significant computational power and data, incurring high costs and potential copyright issues. The proposal underestimates these complexities.
  3. Editorial Workload: The suggestion that one person can fact-check and sub-edit 125 articles weekly devalues the important work of editorial staff and journalists. Quality control would suffer, undermining the credibility of the content.
  4. Traffic and CPM Assumptions: The assumption of 3,000 views per article and $10 CPM is overly optimistic. Data from June 2024 shows that prominent sites like Pedestrian TV and Gizmodo had 500k-1.7M monthly page views. Achieving such high traffic per article is unrealistic, especially for AI-generated content. The average CPM for Google Display Ads is $3.12, far below the proposed $10.
  5. Licensing Costs and Brand Value: The costs of international licences and the risk to brand value are significant. Most reputable global publications would be reluctant to adopt such a model, fearing damage to their brand reputation.


Other industry responses included:


  • Kate Sturgess: Criticised the idea as lazy and out of touch with recovering local media.
  • Genevieve Jacobs AM: Found the proposal depressing and counterproductive to creating quality independent journalism.
  • Ben Shepherd: Pointed out the unlikely assumptions about reader engagement, CPM, and licensing costs.
  • James McManus: Mocked the idea, suggesting it reflects a lack of understanding about digital publishing.


It doesn't take long to look around at plenty of examples where this model has already gone wrong:



Update: I thought I'd also share this from Simon Owens's latest newsletter that was published today: "Legacy media brands rarely die. Instead, they're taken over by spammy marketers who then run AI-generated content farms that squeeze out all remaining SEO and brand juice from the outlet. Probably a major challenge for tech platforms moving forward is figuring out a way to quickly identify these zombie outlets so their article distribution can be cut off before it reaches unaware news consumers." Here's the article from Wired he is referencing which is worth a read.

A related issue is the phenomenon where AI systems may begin to "cannibalise" their own generated data. This can lead to a situation known as "model collapse," where models trained on their own outputs produce increasingly nonsensical or low-quality results. When AI-generated content is used in training datasets, it can introduce errors that accumulate over generations of models, potentially degrading performance over time. Experts suggest that even a small fraction of AI-generated content in training datasets can be detrimental.

In summary, while AI certainly offers cost-saving opportunities (and is something I'm still a huge fan and proponent of), it’s not a substitute for the depth and quality of human-generated content. This is exactly why we've banned it from published content to our website Man of Many. Publishers must focus on creating high-quality, original content to maintain credibility and user trust.


Recommended Reading: 7 Facts About News by Hal Crawford

Hal Crawford’s Substack post "Seven Facts About News" provides a deep dive into the findings from the annual Digital News Report by the University of Canberra. Here are the key takeaways:



  1. Mobile Dominance: Almost everyone uses a mobile phone for news consumption, with 67% of people using phones as their primary source.
  2. Generational Divide: Social media is the main news source for young people, while TV remains dominant for older generations.
  3. Decline in News Consumption: Across all media types, news consumption has decreased since 2016, including TV, websites, social media, radio, and print.
  4. Political Leanings: Online news tends to be more politically left-leaning than offline news, largely due to the strong left-wing orientation of outlets like The Guardian and younger online audiences.
  5. Gender Gap: Women, especially younger women, are less interested in news compared to men, and this gap is widening.
  6. Rise of Video News: Platforms like TikTok, YouTube, and Instagram are growing in popularity for news consumption, while Facebook's influence is declining.
  7. Algorithmic Influence: Algorithms play an increasing role in news distribution, particularly on platforms like TikTok, YouTube, and Instagram. This algorithmic curation leads to issues like outdated news being served as fresh content and low audience awareness of news source identity.



Crawford emphasises the impact of these trends on the media landscape, noting that news products must be designed for mobile consumption despite its challenges. He highlights the shift towards video news and the growing influence of algorithms, which often prioritise engagement over the accuracy and context of news.

For a more detailed read, you can visit Hal Crawford’s original post here.


Man of Many's Perspective on Google's Cookie Saga

The saga around third-party cookies has been nothing short of a rollercoaster for publishers. Initially, Google's plan to deprecate these cookies aimed to enhance user privacy but posed significant challenges. Many publishers struggled to understand cookies, attribution, and the intricate systems behind them. The constant shifts in Google's timeline—from 2022 to 2023, then 2025, and now a proposal to maintain user choice—reflect the complexity and uncertainty in the industry.

Google's latest move to delay the deprecation and introduce a new user choice experience in Chrome indicates a shift towards more user control. However, this is not an automatic return to the status quo. The likely default opt-out scenario suggests that publishers should not be complacent. Instead, they should prepare for a landscape where third-party cookies are no longer the norm.

The introduction of Privacy Sandbox APIs shows promise, with early tests indicating near-recovery levels of advertiser spend and ROI without third-party cookies. Yet, as Criteo's findings suggest, the full integration of Privacy Sandbox could still result in a significant drop in publisher revenues and increased reliance on Google’s ad services.

For publishers, the key takeaway is to take control of their own data. First-party cookies and direct audience data are more critical than ever. This shift allows publishers to forge direct deals and reduce dependence on ad-tech vendors. Building a comprehensive understanding of their audience and preparing for a cookie-less future is essential.

At Man of Many, we see this transition as an opportunity to innovate and strengthen our data strategies. While the road ahead is uncertain, focusing on what we can control—our data and audience relationships—will be crucial. This proactive approach will ensure we continue to deliver quality content and maintain economic viability in the evolving digital landscape.

In summary, Google's delay in deprecating third-party cookies highlights the ongoing complexity and uncertainty in the digital landscape. Publishers should not view this as a return to normal but must prepare for the potential of widespread user opt-out. Investing in first-party data and direct audience relationships is now more critical than ever, particularly as independent publishers face significant challenges without dedicated ad ops teams and the prohibitive costs of CDPs and DMPs. At Man of Many, we are proactively exploring subscription models and first-party data systems, prioritising audience feedback. By staying informed and proactive, publishers can navigate these changes effectively, ensuring both user trust and business sustainability.


If you liked what you've read, consider supporting us by subscribing to our LinkedIn monthly Newsletter, MoM's the Word, above.

You can also check out our new online media kit here: https://advertise.manofmany.com

For further discussions on how Man of Many can align with your advertising goals and contribute to your success, feel free to contact us at [email protected].

Visit https://manofmany.com


About the Author: Scott Purcell, CFA, the Co-Founder of Man of Many and a CFA Charterholder, is a renowned figure in the media industry. His entrepreneurial spirit was recognised in 2017 when he was a Finalist for Young Entrepreneur at the NSW Business Chamber Awards. With a special focus on technology, finance, whisky, and general lifestyle content, Scott has collaborated with leading international brands like Apple, Samsung, IWC, and TAG Heuer. In 2023, his leadership and innovation were further acknowledged as he, alongside Frank Arthur, ranked #47 on the MediaWeek Power 100 List by MediaWeek. Under his guidance, Man of Many triumphed as the Best Media Platform at the B&T Awards, 2023, and clinched the titles for Best Engagement Strategy, Website of the Year, and Publish Leader of the Year at the Mumbrella Publish Awards, 2023, showcasing his unparalleled expertise and influence in the media sector. In 2024, he was recognised in MediaWeek's Next of the Best Awards for Publishing.

要查看或添加评论,请登录

Scott Purcell, CFA的更多文章

社区洞察

其他会员也浏览了