Assessing Security Risks When Introducing And Using Big Data Technologies
Konstantin Vladimirovich Tserazov

Assessing Security Risks When Introducing And Using Big Data Technologies

In February 2024, the social network Reddit announced the conclusion of a cooperation agreement with the Alphabet corporation, which is behind the famous Google brand. Google will actually pay $5 million a month to train its own artificial intelligence (AI) models. Reddit user-generated content will be used as material for this training. Cooperation is not for one year, so it’s worth taking a closer look at it.

Essentially, Google is gaining access to the big data that Reddit has. Characterization of specific data parameters is the subject of contract details, but we are talking about massive amounts of information. And here security risks arise when using big data, both on the side of Reddit and Google.

The accumulation of big data is associated with the need for stable operation of hardware and software in Reddit. Due to a cyber attack by attackers, both may fail. This means the need to create backup copies of information using cloud resources. And they will also need to be reliably protected. However, if a third party owns the cloud, then the security risks increase, since there is no direct control over the security system.

The accumulation of big data on the Reddit side means that a large amount of information is obtained that is unstructured or structured only in the most general terms. To understand it and package it regularly into specific “packages” for Google, you will need to use additional human resources at first. Exactly like Google’s AI models, Gemini currently works not autonomously, but with human participation.

However, allowing more employees to access Reddit's big data increases the risk of information leakage and increases the overall risk of opportunistic behavior among staff. The data that Reddit users leave is a kind of “currency” that they pay for the free use of social networks. Information about the contract between Reddit and Google that has appeared in the public field updates the value of this “currency.”

But keeping the market for this “currency” in the “white” zone, given its Internet specifics, will be very difficult. The most effective solution to this issue is to transfer at least the most expensive part of big data to the blockchain. Blockchain will allow you to control all entry points to data, record who accessed it and when.

But there is a dilemma here. The blockchain most resistant to hacker attacks is a public distributed registry, that is, with a larger number of nodes independent from each other. At the same time, posting information on it should not be free in order to minimize the likelihood of a 51% hacker attack on such a blockchain. But these are additional costs, as can be seen, for example, from the high commissions in the Ethereum blockchain. Ethereum competitors such as Solana, which position their distributed ledgers as blockchains with lower fees, have repeatedly faced blockchain overload and outages.

It is also important to remember that a public blockchain means that all information will then be available to the public and its value may decrease. On the other hand, the creation of a corporate blockchain provides greater privacy for big data, but it is less secure and less resistant to hacker attacks than a public distributed registry. Fundamentally, the situation does not change if you experiment not only with blockchain, but also with other forms of distributed registry.

Note that on the Google side there are similar security risks as on the side receiving big data from another party. It is important that solving problems with risks when implementing and using big data requires significant resources. These include energy supply systems that are maximally invulnerable to hacker attacks, their own “cloud” storage, the use of well-known blockchains with high fees, or the deployment of their own distributed registry.

In addition, we can predict that the work in training Google's key AI model, which is now partially done manually, will require the use of AI assistants from other companies in the future. They will need to either be purchased or rented - just like there are now staff at Google and Reddit who deal with big data.

It is not surprising that Reddit entered into an agreement with Google a month before it went public in New York: to fulfill the contract with Google, serious financial resources are needed, which can be obtained from the stock exchange. However, investors in Reddit shares need clarification of the business model for the development of monetization of user content of the service. Investors are still hesitating on this matter. As a result, Reddit securities, which at the peak on March 26 were worth $74.9 per share, tested values at only $44.6 on April 5.

If we go beyond the case of Reddit and Google, we can say: using big data in organizations is a very expensive thing, primarily from a security point of view. If we do not take stories related to training AI models based on big data, then in other cases companies need to increasingly entrust AI assistants with the entire process of accumulating, structuring, analyzing, making and implementing decisions based on big data.

At the same time, to reduce security risks, companies need to follow the package path: corporate big data plus corporate decentralized AI assistants. Why? Using AI assistants developed by third parties (purchased or rented) carries similar risks as when using human resources to work with big data. Therefore, the most secure use of big data involves large investments and is only available to large corporations.

Let's take the example of a modern hypermarket chain in the world. Information about the actions of customers is collected in them using sensors (audio, video, heat meters, etc.), which gradually begin to “communicate” with each other and turn into elements of the corporate Internet of Things (IoT). Already here, at the stage of collecting big data, large investments are needed in the safe operation of equipment and software.

As a result, every day such a modern hypermarket collects big data, which shows, for example, how long the buyer’s gaze stayed on a particular shelf or product; how the buyer put the goods in the grocery cart, how he later returned them to the shelf, changing his mind about buying, how much time passed between these two actions, etc. The number of such measurements is growing like an avalanche, and the count is already in the tens of thousands. As a result, every day, if you print out all this big data, you will get a hefty book of several thousand A4 pages. It is not possible to manually analyze all this big data, which means the use of AI is required.

After using AI in structuring this information, calculating indicators and making decisions, the question arises of implementing these decisions in the field of merchandising tasks: which product to make a “promotion” for, how to change the placement of goods on the shelf and much more. And, of course, how to adjust prices for certain goods based on the received and processed big data. Human errors are very possible here, as is the leakage of sensitive information to competitors.

To minimize security risks in the foreseeable future, it will be necessary to create a closed AI circuit, within which the life cycle of big data will take place, including its continuous updating. At the same time, the involvement of a public blockchain in such a circuit will occur through the management of nodes of this registry by various AI assistants of the company. At first, these assistants will include third-party ones, but in the future it will be necessary to move to a network of independent AI assistants developed by the company that manage nodes. All this is possible now, but requires large financial resources.

Much attention will be paid to the security of the hardware on which AI systems for working with big data will be built. It is not surprising that in February there were reports that OpenAI, the creator of the most famous neural network ChatGPT, intends to raise up to $7 trillion in the development and production of its chips for its AI models for working with big data. Not a single investment project in the history of mankind has focused on such an amount of financial investment. Essentially, OpenAI is following the path of creating a closed loop, which will reduce security risks along the entire business value chain in its work with big data.

Author: Konstantin Tserazov, strategic business consultant, fintech expert, former senior vice president of Otkritie Bank.

要查看或添加评论,请登录

Konstantin Tserazov的更多文章

社区洞察

其他会员也浏览了