The Data and AI Life Cycle…Correlation?
Mohsien Hassim
Seasoned Business Transformation Executive with a solid Foundation in Finance/Technology/Risk (GRC/ESG)/Security (Cyber)/Strategy and Digital Transformation. AI Researcher & Enthusiast.
We have come to accept the importance of data, and big data in our digital world as the new currency, the new oil. Data drives decision-making which impacts how AI models perform. This has facilitated a completely new world for business in understanding markets and consumer behaviour. According to the latest estimates, 402.74 million terabytes of data are created each day. By created, this refers to newly generated, captured, copied, or consumed data. In zettabytes, that equates to around 147 zettabytes per year, around 12 zettabytes per month, 2.8 zettabytes per week, or 0.4 zettabytes daily. ?It is estimated that 90% of the world's data was generated in the last two years alone.
To put that into perspective, this is how much data is generated per day in various units of measurement:
We are surrounded by everything data, on our phones, mobile devices, anything digital. With the rapid adoption of digital transformation, there are changes to everything in business that are impacting how we live and see the world, manual processes and paper usage are on the decline.
Data refers to?raw information that consists of basic facts and figures; a collection of facts, statistics, and information used for analysis, reference, and decision-making. Data has its lifecycle like everything in life. From the point of creation to its end, destruction.
Like all things in life, there needs to be a starting point. Data is created through various ways. In an organisation, Data is typically created in one of three ways – data acquisition, data entry or data capture. Data acquisition relates to acquiring already existing data which has been produced outside the organisation, while data entry is the manual entry of new data within the organisation and data capture relates to the capture of data generated in various processes within the organisation.
Once data is created, it needs to be stored. Storage of data must be protected by the appropriate level of security as data often contains sensitive elements coupled with regulatory and legislative requirements. To further enhance the storage of data, backup and recovery processes that are robust must be in place that will ensure that the data is retained during its lifecycle.
Data is not created to be stored only but to be used for a multitude of applications and purposes. Data can either be viewed, processed, modified or saved. It is important to ensure that when data is in use, all such activity is tracked through audit trails that cannot be deleted. This essential requirement is necessary for critical data as it allows a digital footprint of who accessed the data, changes that are made to the data and when. The audit trail should incorporate whether data has been shared outside the organisation as this will facilitate better control over the risks of data leakage and theft of critical data from the organisation.
As data grows all the time, the large volumes of data within the organisation are proving a challenge to manage. Data that becomes ‘old’ and does not need to be accessed immediately or online can be archived. Data Archival refers to the copying of data to another environment where it is stored in case it is needed again. The archived data is removed from the active production environment (s). The archived data can be restored to the production environment when needed.
Archiving of the ‘old’ data sounds great and appears to be the solution to solving the challenges of growing data. However, it is not the silver bullet to data management. The volume of archived data will grow and become unmanageable. Also, organisations cannot keep all the data, archived or not, forever. Storage costs (on-premises or in the Cloud) and regulatory/compliance requirements require that data be kept for certain periods after which the data should be destroyed.
The volume of archived data inevitably grows, and while you may want to save all your data forever, that’s not feasible. Storage costs and compliance issues exert pressure to destroy data you no longer need. Data destruction or purging is the removal of every copy of a data item from an organisation. It is typically done from an archive storage location. The challenge of this phase of the lifecycle is to ensure that the data has been properly destroyed. It is important to ensure before destroying data that the data items have exceeded their required regulatory retention period.
Artificial Intelligence or AI has seen a resurgence in recent years mainly due to the availability of the required computing power and large volumes of data. Artificial Intelligence or AI is “The science and engineering of making intelligent machines, especially intelligent computer programs”. AI is a way of making a computer, a computer-controlled robot, or a software think intelligently, similarly to intelligent humans. The human brain has an average of 70,000 thoughts a day.
The link between data and AI is unmistakable.?High-quality data shapes AI systems into reliable interpreters, capable of navigating and deriving meaningful insights from huge datasets. Data quality is a non-negotiable in the world of AI. Data quantity is important, the more data the better, however, if the data quality is poor, it is like using contaminated fuel in your car, it is not going to get you very far.
Artificial Intelligence thrives on data. The intrinsic dependence on data is at the heart of AI’s capability. This reliance on data is not isolated to a specific type of AI technology but spans all AI systems and technologies; from straightforward decision-making algorithms to intricate neural systems. It requires data to develop its function consistently, which is the importance of Big Data in the world of AI.
The AI lifecycle is the iterative process of moving from a business problem to an AI solution that solves that problem. Each of the steps in the life cycle is revisited many times throughout the design, development, and deployment phases.
Each stage in the AI project life cycle serves a vital role. The problem definition Design phase establishes the project’s direction. The data acquisition and preparation create the foundation for the AI solution. The model development and training phase turns this foundation into a functional tool. The model evaluation and refinement phase ensures that the tool/model meets the expected standards. Finally, deployment brings the AI solution to its intended users, and maintenance keeps it running smoothly over time.
With an initiative, starting with a design where the problem is identified and unpacked is imperative. During the Design phase, the business needs to identify the problem as it will facilitate the problem being unpacked and understood. Included in this is recognising the key objectives and requirements which will be needed to define the desired outcome as required by the business.
Gathering the required quality data is key to any use of AI as data is the lifeblood of AI.?This step involves collecting and evaluating the AI solution's required data. This includes discovering available data sets, identifying data quality problems, and deriving initial insights into the data and perspectives on a data plan. In data preparation and wrangling, all activities are carried out to construct the working data set from the initial raw data into a format that the model can use. Although time-consuming and tedious, but critically important to develop a model that achieves the goals established.
The ‘Develop’ phase involves the experimentation with AI models using the data that was prepared in the previous phase. The AI project team will train, test, evaluate and retrain different AI models to determine the best AI models for the desired outcome. The AI model training and selection process is interactive as no AI model achieves the best performance the first time it is trained. It is only through iterative fine-tuning that the model is fine-tuned to produce the desired outcome. The evaluation of how the AI models perform provides the organisation with insight into alignment with the key objectives and data suitability.
领英推荐
No AI solution will succeed without clearly and precisely understanding the business challenge being solved and the desired outcome.
Developing AI Models is useless unless they are brought to ‘life’; i.e., moved to the ‘real’ world; production. The Deploy phase relates to migrating the developed model, which has met the necessary performance tests, to the production environment. In the production environment, the AI model will take on new data that was not part of the training cycle.
Monitoring the performance of the AI model in the production environment as it processes the live data to ensure that it is sufficiently able to produce the intended outcome. This is a ‘test’ of the AI Model’s ability to adapt properly to new data (‘unseen data’). Should the AI model not produce the output exactly as expected, models can “drift,” meaning that the performance will change over time. By ongoing monitoring, of this drift, the AI model can be updated.
Data preparation is often the hardest and most time-consuming phase of the AI lifecycle.
This approach follows the agile methodology to continually retain and refresh the AI Model. In addition to agile practices, the AI model must undergo rigorous monitoring and maintenance, due to their very nature and purpose. This is to ensure they continue to perform as trained, meet the desired outcome, and solve the business challenges as intended.
The data acquisition and preparation phase creates the foundation for the AI solution. It serves as the most important phase as AI models rely on quality data to provide the intended outputs expected to meet stated objectives.4
Each stage in the AI project life cycle serves a vital role. The problem definition phase establishes the project’s direction. The data acquisition and preparation phase creates the foundation for the AI solution. The model development and training phase turns this foundation into a functional tool. Then, the model evaluation and refinement phase ensures that the tool/model meets the expected standards. Finally, deployment brings the AI solution to its intended users, and maintenance keeps it running smoothly over time.
Data is the foundation of any AI solution. Without a clear understanding of the data required and the make-up of that data, a model cannot use it.
In addition, AI projects often need to adapt to changes quickly, whether these are changes in the project’s requirements, unexpected issues with the data, or new developments in AI technology. Building this adaptability into the project life cycle can be difficult but is crucial for long-term project success.
By understanding the Data and AI lifecycle, the business and its project teams can critically evaluate each phase whilst paying particular attention to the importance good quality data plays from the inset. After all poor data will result in ‘garbage in garbage out’, which we do not want to see with AI models.
The importance of the data lifecycle on the AI lifecycle is like blood to the human body. Good blood will result in a well-functioning body. Prioritising data quality as a foundation for AI is a no-compromise approach. The lifecycle presented provides some insight into the importance of structuring the approach to AI projects and using quality data. AI is still in its infancy. As AI advances and becomes more complex and ‘intelligent’, the use of good quality data will enhance its importance coupled with the need to follow structured approaches to implementing AI.
?
Sources:
Please see my other prublised articles on LinkedIn...