Week 3: From Data to AI
From Data to AI

Week 3: From Data to AI

Data is not just a component of AI; it is its lifeblood. Without data, AI cannot exist.

Welcome to the third week of our Zero to Hero AI learning series! In this article, we will explore the different types of data, the steps involved in transforming data into AI systems, and how companies can prepare their data for the AI-driven future.


Data: The Lifeblood of AI


Types of Data

  • Structured Data is organized and easily searchable. It often resides in databases and spreadsheets, following a predefined format, such as rows and columns. For example a spreadsheet with customer information or databases containing sales records.

Structured Data


  • Unstructured Data lacks a predefined format and is more challenging to process. It includes text, images, audio, and video. Examples include Text documents (like emails, and reports), Social media posts, Images, and videos.

Unstructured Data


Using AI to Build AI: Recent advancements in AI (like GPT) have improved the ability to write code and process unstructured data, making it more valuable for generating insights and driving decision-making.


From Data to AI

High-quality data empowers AI systems to learn effectively, recognize patterns, and make accurate predictions. The quantity and quality of data significantly influence an AI model's performance. Without sufficient and relevant data, AI systems may become inaccurate or biased, resulting in poor performance and unreliable outcomes.

AI systems use data through a series of steps involving model training, evaluation, and deployment. Here’s a detailed look at the process:


1- Data Ingestion:

This crucial initial step involves gathering relevant and high-quality data from various sources, such as:

  • Internal Databases: Proprietary data stored within organizational databases.
  • Public Datasets: Freely available data from sources like government databases or open research.
  • Social Media: User-generated content and interactions.
  • Sensors and IoT Devices: Real-time data from interconnected devices.

After collecting the data, the next step is to integrate and consolidate it into a central repository, such as a database, data warehouse, or data lake.


Data Ingestion


Example: Data is collected from multiple hospitals' EHR systems (electronic health records), patient wearable devices, and public health databases. This diverse data is integrated into a central data lake.


2- Data Preparation:

Once data is collected, it must be cleaned and organized to be useful for AI models. Key data preparation steps include:

  • Data Cleaning: This involves removing duplicates, correcting errors, and handling missing values to ensure the dataset's integrity.
  • Data Transformation: Data must be normalized and converted into a consistent format, which might include scaling numerical values or encoding categorical variables.
  • Data Annotation: Labeling data, especially for supervised learning tasks, to provide the model with examples to learn from.


Data Preprocessing


Example: In healthcare, electronic health records (EHRs) often contain errors or missing values. Preprocessing this data involves several steps:

  • Data Cleaning: Correcting erroneous entries and filling in missing patient information.
  • Data Transformation: Standardizing formats for patient information such as dates, medical codes, and numerical measurements.
  • Data Annotation: Medical professionals review patient records and add labels indicating diagnoses, such as "diabetes" or "hypertension" These labels help train AI models to recognize and predict these conditions from the data.


3- Feature Engineering:

  • Extracting Relevant Features: Identifying and extracting the most relevant features from raw data.
  • Creating New Features: Developing new features that can provide additional predictive power to the model.


Feature Engineering


Example: From the EHR data, relevant features such as patient age, blood pressure readings, medication history, and lifestyle factors (e.g., smoking status, exercise frequency) are extracted. Additional features like trends in vital signs over time and co-occurrence of chronic conditions are created to improve model performance in predicting disease outcomes.


4- Model Selection:

  • Choosing Model Architecture: Selecting appropriate models such as decision trees, neural networks, or support vector machines, based on the problem and data characteristics.
  • Considering Computational Resources: Balancing the model complexity with available computational power and time constraints.


Model Selection

Example: To predict if a patient will come back to the hospital soon (patient readmission rates), we may use a model like Gradient Boosting Machine (GBM). GBM is effective at analyzing complex interactions between different features in the data, such as age, medical history, and lab results. We can use it to help us understand key factors like the significance of a patient’s age or medical history in predicting if a patient will return.


5- Model Training:

  • Data Splitting: Dividing data into training, validation, and test sets to ensure unbiased evaluation.
  • Training and Tuning: Training the model on the training set and tuning hyperparameters using the validation set to optimize performance.


Model Training


Example: We split the EHR data into three sets: 70% for training, 15% for validation, and 15% for testing. First, we train the decision tree model using the training set. Next, we fine-tune parameters like tree depth and the minimum number of samples per split using the validation set to improve performance. Finally, we test the model with the test set to ensure it can make accurate predictions on new, unseen data.


6- Evaluation, Deployment, and Monitoring:

  • Assessing Performance: Evaluating the model using metrics like accuracy, precision, recall, and metrics such as the "F1-score" to measure the performance of the model's accuracy and ensure it performs well on unseen data.
  • Generalization: Ensuring the model can generalize well to new, unseen data to avoid overfitting (Overfitting is when a model performs well on training data but poorly on new, unseen data.)
  • Deployment, and Monitoring: Deploying the trained model into production systems and regularly monitoring its performance. This includes retraining with new data and updating the model as needed to maintain accuracy and relevance.


Model Evaluation, Deployment, and Monitoring


The steps involved in building AI systems and the role of data can vary based on multiple factors. For instance, Large Language Models (LLMs) like GPT (Generative Pre-trained Transformer) use vast amounts of internet text for pretraining, human-provided question-answer pairs for supervised fine-tuning, human feedback for reward modeling, and continuous feedback loops for reinforcement learning. Check out my detailed explanation of how GPT was created in this article:

How GPT Was Created



Examples of Data-Powered AI Applications:

Example 1: Healthcare:

AI models use patient data to predict disease outbreaks, personalize treatments, and improve diagnostics. For instance, an AI system can analyze electronic health records and genetic data to identify patterns associated with specific diseases, leading to earlier detection and better treatment plans.

Example 2: Retail:

Retailers analyze customer data to optimize inventory, personalize marketing, and enhance the shopping experience. A retailer might use transaction data and browsing history to recommend products to customers, increasing sales and customer satisfaction.

Example 3: Finance:

Financial institutions leverage transaction data to detect fraud, assess credit risk, and provide investment recommendations. AI systems can analyze large volumes of transaction data in real time to identify suspicious activities and prevent fraud.

Example 4: Manufacturing:

AI-powered predictive maintenance systems use sensor data from machinery to predict failures and schedule maintenance before issues occur, reducing downtime and maintenance costs.

Example 5: Marketing:

By leveraging social media data and customer feedback, companies can develop highly targeted marketing campaigns. AI can analyze sentiment and trends to create personalized advertisements that resonate with specific audiences.



Preparing Your Company Data for the AI Era:

Data is the most precious asset of your company in the era of AI. Here's what every organization needs to do to prepare:


1. Show Executive Commitment

  • Develop and share a detailed AI strategy that aligns with the company’s goals.
  • Provide enough budget and staff for AI projects.
  • Senior leaders should actively support and engage in AI initiatives.


2. Promote a Data-Driven Culture

  • Offer training to help employees understand and use data effectively.
  • Break down departmental barriers and form cross-functional teams.
  • Recognize and reward employees who use data to drive results.


3. Invest in Modern Data Infrastructure

  • Replace old systems with modern, scalable cloud solutions and computing resources.
  • Implement platforms that help integrate, store, and access data across the organization.
  • Use AI tools to automate data cleaning and management tasks.


4. Set Up Strong Data Governance

  • Establish rules for data usage, privacy, and security.
  • Assign individuals or teams to maintain data quality and oversee governance.
  • Regularly check and audit data practices to ensure compliance and improvement.


5. Connect Data Silos

  • Create centralized data repositories to make data accessible across the organization.
  • Implement protocols and platforms for easy and secure data exchange.
  • Use data and content management solutions that support interoperability.
  • Build a knowledge graph to connect your data semantically.


6. Build an AI-Skilled and Data-Skilled Workforce

  • Recruit data scientists, AI engineers, and analysts.
  • Offer continuous learning opportunities for employees.
  • Support the RnD team, find partners, and access new AI research.


7. Prioritize Ethical AI Practices

  • Set clear rules for AI use, ensuring transparency and fairness.
  • Regularly review the effects of AI on employees and customers.
  • Ensure AI projects consider diverse perspectives and avoid biases.


8. Keep Monitoring and Adapting

  • Keep up with the latest AI advancements and industry trends.
  • Encourage Innovation.
  • Regularly review and adjust AI initiatives based on performance.


The journey from data to AI involves crucial steps that highlight the essential role of data in AI development. By focusing on data teams, data collection, preparation, and ethical practices, organizations can ensure their AI systems are accurate and effective. As AI technology advances, maintaining data quality and integrity will be key for organizations to fully leverage the potential of AI for innovation and success.



To learn more, stay connected, and up to date, I’ll feature five key influencers to follow in each article whose content is both relevant and insightful. Starting with:

  • Alex Freberg : Lead Data Analyst, Founder of Analyst Builder
  • Ashleigh N. Faith : Leader in Semantics and Knowledge Graphs, Founder of IsA DataThing
  • Barr Moses: Leader and Entrepreneur in Data Observability and Reliability.
  • Chad Sanderson : Data Leader, Entrepreneur, and Leader of Data Quality Camp.
  • Daliana Liu : Data Science Coach, Host of The Data Scientist Show.
  • Joe Reis ?? : Instructor and Author in Data Engineering and Architecture.
  • Jon Krohn : Leader in Data Science, Host of SuperDataScience.
  • Zach Wilson : Leader in Data Engineering. Founder of DataExpert.io

Feel free to mention other influencers and top voices in the discussion section below.


In this Zero to Hero: Learn AI Newsletter, we will publish one article per week. Next week, we'll introduce machine learning. Check out the plan here:

AI Learning Paths: What to Learn and What's the Plan?

Share your thoughts and suggestions. Join us in shaping and sharing this learning journey.

Azza Khattab

Physician, Public Health: Microbiology Consultant

8 个月

Again, very impressive, clear, useful guidance, really reflects the title of the series #zerotohero thanks Alaaddin Alweish

要查看或添加评论,请登录

Alaaeddin Alweish的更多文章