Ghanaian Food Vision model

Ghanaian Food Vision model

Over the past few weeks, I delved into the field of computer vision, exploring the various neural network architectures used for tasks like object detection, segmentation, and classification. One paper that particularly caught my attention was "An Image is Worth 16x16 Words," which introduced the Vision Transformer (ViT) architecture by Google. This paper explained how we could allow CV models to learn from images in a way similar to how large language models (LLMs) process text sequences.

As I learnt about advancements, I became excited about building my own computer vision project. In this project, I built a food classification model that could recognize Ghanaian dishes, using the Devkyle Ghanaian food dataset on hugging face. For this, I leveraged ConvNeXt, a convolutional network (CNN) architecture that improves the performance of traditional CNNs by adapting some design principles from the transformer model (Vision transformer). In this article, I’ll walk you through the entire process, including why I chose ConvNeXt, the challenges I faced, like overfitting, and how I overcame them with data augmentation and advanced techniques like early stopping and scheduling.


The Inspiration: The Original FoodVision model

Before diving into my food classification model, I want to give a shout out to the project that inspired me: FoodVision. This project, initially developed to classify just five food categories—sandwich??, pizza??, pasta??, doughnuts??, and burger??—was built using the Vision Transformer (ViT) B16 model.

The original FoodVision was impressive for its simplicity and performance, focusing on demonstrating how Vision Transformers could be applied to image classification tasks. By leveraging ViT's attention mechanism, which breaks an image into small patches and treats each patch like a "word," it allowed the model to achieve high accuracy while capturing both local and global patterns in the images.

However, as exciting as FoodVision was, I wanted to take this idea a step further. Rather than just classifying five exotic food items, I aimed to:

  1. Expand the number of food categories: My model now classifies 30 different Ghanaian food items as instead of just five.
  2. Experiment with the ConvNeXt model: I wanted a balance between the model's complexity and scalability. So I chose to use the ConvNeXt model. ConvNeXts are CNNs that take some ideas from ViTs, like using self-attention and working with image patches. However, it does not use the exact same method called transformers that ViTs use. Instead, ConvNeXt uses a different technique called depth-wise convolution. This technique helps to reduce the amount of calculations needed while still maintaining accuracy.'' Medium article: ConvNeXt: A Family of Pure ConvNet Models.
  3. Overcome the limitations of training time and resource usage: By experimenting with different techniques like data augmentation, schedulers, and early stopping, I optimized the model for better performance over more extended training periods.

This new project builds on the foundation laid by FoodVision but with simplicity, scalability and a broader range of foods, making it more useful for real-world applications like restaurant menu scanning for tourists or food/calorie-tracking apps.


Why the Upgrade?

The motivation behind building a model for African cuisine was simple: to give a better representation of African cuisine in vision models and to test the boundaries of what could be achieved with more diverse data. While the original project was an excellent starting point, I knew that scaling up the model to classify more food categories would present unique challenges, such as:

  • Handling a broader range of visual diversity: More categories mean more variance in textures, shapes, and colors.
  • Managing overfitting and generalization: With more food items, the model needs to generalize well across different types of images, something I struggled with early on.

Through these upgrades, my goal was not just to recreate what FoodVision had done, but to enhance it, making the model more robust and scalable to real-world scenarios.



  • More Categories: Instead of 5, the model now classifies 30 underrepresented food items, making it more versatile for food recognition tasks.
  • Enhanced Model Training: I implemented advanced techniques to combat overfitting and optimize training, like data augmentation, schedulers, and early stopping, which helped stabilize the model.
  • Better Deployment: While FoodVision was a great demo, deploying my version on Hugging Face Spaces ensures that anyone can interact with the model live, testing its ability to classify a wider range of food.



Dataset: Devkyle Ghanaian food dataset with 30 Classes

For this extended model, I worked with the Devkyle Ghanaian food dataset, a small collection of images featuring 30 different types of food. The Ghanaian food dataset provided a solid foundation for training, with diverse and challenging examples across all 30 categories. This diversity pushed the model to capture fine details, making it a great learning experience in balancing data diversity and model performance.


Dataset Details: Image Distribution

For this project, I organized my dataset into two subsets to ensure a well-rounded evaluation of the model's performance:

  • Training Set: 119 images – This set is used to train the model, allowing it to learn the features and characteristics of each food category.
  • Validation Set: 30 images – This set is used during training to tune the model's hyperparameters and make adjustments. It helps to assess how well the model generalizes on unseen data.

This structured approach to dataset distribution ensures that the model is adequately trained, validated, and tested, leading to a more reliable assessment of its performance in classifying the 30 food categories.


Training the Model

Thus, in my training loop for 15 epochs, which took about 20 minutes, the model improved its ability to recognize the food categories. However, I encountered some challenges, particularly with overfitting. This meant that while the model excelled at classifying training data, it struggled when presented with new, unseen images.

Handling Overfitting: Data Augmentation and Early Stopping

To reduce overfitting and make the model generalize better, I used:

  • Data Augmentation: Flipping, rotating, zooming, and shifting images created more diversity in the dataset, forcing the model to learn generalized patterns.
  • Learning Rate Scheduler: This helped by gradually reducing the learning rate, ensuring the model didn’t make drastic updates to the weights in later stages of training.
  • Early Stopping: By stopping the training process when the validation loss stopped improving, I avoided over-training the model.


Performance Metrics:

Here’s how the model performed after 15 epochs of training:

  • Accuracy: 83.75%
  • Precision: 88.93%
  • Recall: 83.75%
  • F1 Score: 83.87%

Although the model isn’t perfect (about 83% accuracy overall), it performs well in predicting the majority of food items accurately and quickly! This is a significant improvement from my original model, and I plan to keep refining it to improve these metrics further.


How Can the Model Be Improved?

While the model is performing well, there are several ways to enhance it further:

  1. Expand the Dataset: The current dataset of 4 images per category is limited. Adding more images and using data augmentation (rotation, zoom, flips) can introduce more diversity, helping the model generalize better.
  2. Hyperparameter Tuning: The training parameters could be optimized further. Using techniques like grid search or learning rate scheduling can find the most efficient settings for faster and better learning.


Deployment on Hugging Face

Once the model was trained, I deployed it on Hugging Face to make it accessible to everyone. Hugging Face Spaces offers a user-friendly interface where anyone can test the model in real-time. You can try it out here: Ghanaian Food vision Hugging face space??


What's Next?

In the future, I plan to expand the model to classify even more African food items, making it even more versatile and useful. Additionally, I will explore other model architectures beyond ConvNeXt to see if they can improve accuracy and performance. This exploration could lead to discovering new techniques and strategies in food classification, ultimately enhancing the user experience and practical applications of the model.


Conclusion

lt was a rewarding experience to build an African food classification model. This taught me the importance of overcoming common machine learning challenges like overfitting and tuning hyperparameters. I learned how to tackle common challenges in machine learning, such as overfitting and adjusting hyperparameters for better performance.

If you’re working on similar projects or want to collaborate, or you just want to learn how I did all of it, reach out! I’d love to connect and hear your thoughts!

Mauricia Tiendrebeogo

Graduate Student at USF_Energy Systems Management | Clean Energy | Power Production | Mining

3 个月

I would love to learn more! Kenneth Kwame Dotse ????

Prince Agyei Tuffour

Software Engineer | Machine Learning Engineer | Mathematics Graduate | AI Enthusiast

3 个月

As someone passionate about food and machine learning, I'm definitely going to try this out too! ??Great work!

Sylvester Akrong

Aerial Robotics (Multi-agent UAV Swarm) enthusiast ? ? Software Engineer ?

3 个月

This is really impressive bro??????

要查看或添加评论,请登录

Kenneth Dotse的更多文章

社区洞察

其他会员也浏览了