October 02, 2024
Kannan Subbiah
FCA | CISA | CGEIT | CCISO | GRC Consulting | Independent Director | Enterprise & Solution Architecture | Former Sr. VP & CTO of MF Utilities | BU Soft Tech | itTrident
One of the most significant bottlenecks in training specialized AI models is the scarcity of high-quality, domain-specific data. Building enterprise-grade AI requires increasing amounts of diverse, highly contextualized data, of which there are limited supplies. This scarcity, sometimes known as the “cold start” problem, is only growing as companies license their data and further segment the internet. For startups and leading AI teams building state-of-the-art generative AI products for specialized use cases, public data sets also offer capped value, due to their lack of specificity and timeliness. ... Synthesizing data not only increases the volume of training data but also enhances its diversity and relevance to specific problems. For instance, financial services companies are already using synthetic data to rapidly augment and diversify real-world training sets for more robust fraud detection — an effort that is supported by financial regulators like the UK’s Financial Conduct Authority. By using synthetic data, these companies can generate simulations of never-before-seen scenarios and gain safe access to proprietary data via digital sandboxes.
Event sourcing is an approach to persisting data within a service. Instead of writing the current state to the database, and updating that stored data when the state changes, you store an event for every state change. The state can then be restored by replaying the events. Event-driven architecture is about communication between services. A service publishes any changes in its subdomain it deems potentially interesting for others, and other services subscribe to these updates. These events are carriers of state and triggers of actions on the subscriber side. While these two patterns complement each other well, you can have either without the other. ... Just as you can use Kafka without being event-driven, you can build an event-driven architecture without Kafka. And I’m not only talking about “Kafka replacements”, i.e. other log-based message brokers. I don’t know why you’d want to, but you could use a store-and-forward message queue (like ActiveMQ or RabbitMQ) for your eventing. You could even do it without any messaging infrastructure at all, e.g. by implementing HTTP feeds. Just because you could, doesn’t mean you should! A log-based message broker is most likely the best approach for you, too, if you want an event-driven architecture.
Mostly AI provides enterprises with a platform to train their own AI generators that can produce synthetic data on the fly. The company started off by enabling the generation of structured tabular datasets, capturing nuances of transaction records, patient journeys and customer relationship management (CRM) databases. Now, as the next step, it is expanding to text data. While proprietary text datasets – like emails, chatbot conversations and support transcriptions – are collected on a large scale, they are difficult to use because of the inclusion of PII (like customer information), diversity gaps and structured data to some level. With the new synthetic text functionality on the Mostly AI platform, users can train an AI generator using any proprietary text they have and then deploy it to produce a cleansed synthetic version of the original data, free from PII or diversity gaps. ... The new feature, and its ability to unlock value from proprietary text without privacy concerns, makes it a lucrative offering for enterprises looking to strengthen their AI training efforts. The company claims training a text classifier on its platform’s synthetic text resulted in 35% performance enhancement as compared to data generated by prompting GPT-4o-mini.
领英推荐
Enterprises are increasingly data-driven and rely heavily on the collected data to make decisions, says Choudhary. Also, a decade ago, a single application stored all its data in a relational database for weekly reporting. Today, data is scattered across various sources including relational databases, third-party data stores, cloud environments, on-premise systems, and hybrid models, says Choudhary. This shift has made data management much more complex, as all of these sources need to be harmonized in one place. However, in the world of AI, both structured and unstructured data need to be of high quality. Choudhary states that not maintaining data quality in the AI age would lead to garbage in, disasters out. Highlighting the relationship between AI and data observability in enterprise settings, he says that given the role of both structured and unstructured data in enterprises, data observability will become more critical. ... However, AI also requires the unstructured business context, such as documents from wikis, emails, design documents, and business requirement documents (BRDs). He stresses that this unstructured data adds context to the factual information on which business models are built.
Attackers are increasingly targeting supply chains, capitalizing on the trust between vendors and users to breach OT systems. This method offers a high return on investment, as compromising a single supplier can result in widespread breaches. The Dragonfly attacks, where attackers penetrated hundreds of OT systems by replacing legitimate software with Trojanized versions, exemplify this threat. ... Attack strategies are shifting from immediate exploitation to establishing persistent footholds within OT environments. Attackers now prefer to lie dormant, waiting for an opportune moment to strike, such as during economic instability or geopolitical events. This approach allows them to exploit unknown or unpatched vulnerabilities, as demonstrated by the Log4j and Pipedream attacks. ... Attackers are increasingly focused on collecting and storing encrypted data from OT environments for future exploitation, particularly with the impending advent of post-quantum computing. This poses a significant risk to current encryption methods, potentially allowing attackers to decrypt previously secure data. Manufacturers must implement additional protective layers and consider future-proofing their encryption strategies to safeguard data against these emerging threats.
Unsurprisingly, open-source software's lineage is complex. Whereas commercial software is typically designed, built and supported by one corporate entity, open-source code could be written by a developer, a well-resourced open-sourced community or a teenage whiz kid. Libraries containing all of this open-source code, procedures and scripts are extensive. They can contain libraries within libraries, each with its own family tree. A single open-source project may have thousands of lines of code from hundreds of authors which can make line-by-line code analysis impractical and may result in vulnerabilities slipping through the cracks. These challenges are further exacerbated by the fact that many libraries are stored on public repositories such as GitHub, which may be compromised by bad actors injecting malicious code into a component. Vulnerabilities can also be accidentally introduced by developers. Synopsys' OSSRA report found that 74% of the audited code bases had high-risk vulnerabilities. And don't forget patching, updates and security notifications that are standard practices from commercial suppliers but likely lacking (or far slower) in the world of open-source software.?