Nearly 12,000 API keys and passwords found in AI training dataset
Close to 12,000 valid secrets that include API keys and passwords have been found in the Common Crawl dataset used for training multiple artificial intelligence models.
"The greatest challenge with the evolution of modernized AI or "synthetic cognition" a term I use is that we are quickly approaching mass acceptance without any standard control mechanisms. This goes beyond a tool or a products acceptance it is innevitably data awareness while not understanding it's innevitable consequences. The reality is that as a technologist I am excited about the future but warry about the short term consequences. Privacy is only one aspect of this tidelwave. The true concern is since technology has now surpassed our intelligence we must be aware that AI unchecked can become our master since AI invariably does not need a human to exist. AI must be developed and created in our own image with inalianable truths that cannot be bifercated by its synthetic desires or any one elses who chooses to do harm with it." Pierre Bourgeix
The Common Crawl non-profit organization maintains a massive open-source repository of petabytes of web data collected since 2008 and is free for anyone to use.
Because of the large dataset, many artificial intelligence projects may rely, at least in part, on the digital archive for training large language models (LLMs), including ones from OpenAI, DeepSeek, Google, Meta, Anthropic, and Stability.
AWS root keys and MailChimp API keys
Researchers at Truffle Security - the company behind the TruffleHog open-source scanner for sensitive data, found valid secrets after checking 400 terabytes of data from 2.67 billion web pages in the Common Crawl December 2024 archive.
They discovered 11,908 secrets that authenticate successfully, which developers hardcoded, indicating the potential of LLMs being trained on insecure code.
It should be noted that LLM training data is not used in raw form and goes through a pre-processing stage that involves cleaning and filtering out unnecessary content like irrelevant data, duplicate, harmful, or sensitive information.
Despite such efforts, it is difficult to remove confidential data, and the process offers no guarantee for stripping such a large dataset of all personally identifiable information (PII), financial data, medical records, and other sensitive content.
"Nearly 1,500 unique Mailchimp API keys were hard coded in front-end HTML and JavaScript” - Truffle Security
IAPSC - International Association of Professional Security Consultants
Security Industry Association (SIA)