ChatGPT: Running before Crawling: Robots.txt & Lying Bots
If you've been following recent news, you're likely aware that OpenAI is currently grappling with significant privacy concerns related to their web crawling practices. While web crawling may be a relatively novel concept for many, it's a long-standing and contentious issue within data science circles.
For well over a decade, data scientists, marketers, and business developers have been engaged in web crawling activities on platforms like Twitter (now known as X), Facebook, and LinkedIn. Even industries not typically associated with technological prowess, such as airlines, have been actively exploring how to glean insights from customers' purchasing behaviors and preferences. Whether you're a weekend traveler with your family or a frequent business traveler, airlines have, in many cases, already ascertained these details and use those to create marketing campaigns.
So, how do these entities obtain such information, and where do they draw the line? The utilization of web crawlers, along with third-party services like Sprinklr, Hootsuite, and others, allows for the analysis of social media behaviors. Additionally, third-party data sources such as Axciom (census data plus) offer intricate insights into individual profiles and interests.
This prompts the question: How do these parties determine which data sources are accessible and which are off-limits? In 1994, the concept of "robots.txt" emerged as a means to regulate bots' access to certain websites. This swiftly evolved into a de facto standard, obliging present and future web crawlers, including those owned by prominent search engines like WebCrawler, Lycos, and Alta Vista, to adhere. However, compliance is not universal; certain entities, like email harvesters, spambots, and malware, often disregard these norms, potentially accessing restricted sections of websites.
OpenAI, for instance, employs web crawling to index well-known sites such as Wikipedia, GitHub, and Stack Overflow. On the other hand, Bard relies heavily on Google's historical index data, while Meta's product leverages Facebook's extensive databases alongside other sources specifically owned by Meta. Collectively, these AI entities permeate the internet landscape, diligently processing massive volumes of data to answer our prompts.
The overarching objective is user engagement and satisfaction. Yet, what transpires when the focus shifts to pleasing users? The potential outcome is the propagation of "pink lies," wherein positive information is magnified while negative aspects are downplayed. This can lead to the generation of fabricated responses, often termed "hallucinations."
领英推荐
While experts in the field of AI cannot definitively pinpoint the root causes of such hallucinatory outputs, certain triggers are commonly identified. These include inadequate or fragmented training data, resulting in unfamiliar edge cases for the model. For instance, if I posed a question about the song "Piti Piti Piti Bull" and its rise to fame, Bard initially responded with: "
"The song 'Piti Piti Piti Bull' is not a real song. It is a meme that originated on TikTok in 2022. The song is a sped-up version of the Turkish pop song 'O Piti Piti' by Bilgen Bengü, which was released in 1978. The meme features a video of a bull running around and the sound of the sped-up song playing. The meme is often used to express excitement or energy."
However, ChatGPT initially struggled to find an answer in its database. Subsequent prompts yielded a comprehensive response that delved into the song's evolution and its integration into internet culture:
"The song's popularity has endured over the years, and it has been used in various contexts, including as the basis for the 'Piti Piti Piti Bull' meme you mentioned earlier. The meme involves a sped-up version of the song and is often paired with videos of fast-moving or energetic content, such as a bull running around. The playful and lively nature of 'O Piti Piti' has made it a recognizable tune, and its catchy melody has contributed to its lasting popularity. It's fascinating how a song from the late 1970s can find new life and relevance in the digital age through internet memes and social media trends."
In conclusion, these extensive language models offer an innovative approach to harnessing the digital data landscape we've constructed. They possess the potential to reduce repetitive tasks, uncover valuable patterns for enhancing our lives, and potentially contribute to broader societal and environmental betterment. It's now critical for companies to enhance their grasp of their data, safeguard customer and corporate privacy, establish transparent regulations for their AI tools, and solidify their AI-related missions and strategies to improve customer satisfaction and engagement.?
Financial Advisor | Service innovation | AI wave surf
1 年Hi Meltem, could I talk to you about 3 mines? at your schedule? I have missed following you. and now trying to catch up... : )
Financial Advisor | Service innovation | AI wave surf
1 年Meltem, it is clearly illustrated. and that was 2 months ago! I just follow up : )
Digital Transformation Leader | AI Futures | Org Change | Board Director
1 年Nice synopsis and example Meltem ??
Executive Advisor | Global Management Consultant | Author - Customer Driven Change | Speaker - Change & Innovation
1 年Exceptionally well written. I understood it.
Wow! Very insightful article, Meltem Ballan, Ph.D. “Pink Lies and Halliucinations” sounds like a great title for a book :-). Very interesting to see how the different tools are harnessing the data - and the biases we see in that as well.