Leveraging VGG16 and Nearest Neighbors for Efficient Image Classification and Similarity Retrieval: A case study on Outdoor Place Recognition
Introduction
The scope of this project lies within the field of Image Recognition, an essential subset of Machine Learning and Computer Vision. The main objective of this study is to design an image recognition model capable of identifying the location of a queried image by comparing it to a set of referenced images within a predefined library. The images are categorized into three types,
1.????? day_left
2.????? day_right
3.????? night_right
The GardensPointWalking dataset presents a collection of images captured from the Gardens Point Campus of the Queensland University of Technology (QUT) in Brisbane, Australia. It is an outdoor dataset, typically used for visual place recognition research, including visual navigation and visual localization tasks and the images are recorded under varied conditions such as different times of day and different environmental conditions, to create a wide-ranging dataset that simulates real-world challenges for image recognition models. The diversity in lighting and weather conditions, shadows, and changes in scene layout make this dataset ideal for testing the robustness and adaptability of visual place recognition related algorithm developments.
In this dataset, the images are organized into different categories or classes based on the time and direction of recording as mentioned on top paragraph. The paths followed in capturing the images are almost the same for all these categories. The challenge lies in identifying these minor changes between similar yet distinct images, thus making the model discern and classify the images accurately therefore it becomes a perfect resource to use in developing and evaluating image recognition models which the context where the model must distinguish between extremely similar images and understand nuanced differences induced by changes in environmental conditions, by potentially improving the adaptability and accuracy of these models under varied scenarios.
When come to the place recognition task the basic functionality to be address is that the model is given an image as input, and it returns an image from its reference database that was captured at the same location. To accomplish this, the model leverages a technique, "feature extraction," that involves the use of pre-trained networks to extract meaningful features from the images. Then it uses a distance calculation strategy to identify the image in the reference library closest to the query image.
The model relies on the VGG16 architecture, a deep learning model pre-trained on the ImageNet dataset - a large-scale dataset used in visual object recognition software research, which contains over 14 million hand-annotated images. The model can potentially be switched to the MobileNet architecture with minimum adjustments, catering to different application requirements.
At last, the final output is a prediction indicating the class to which the query image most likely belongs. Such image recognition models have vast applications ranging from image search engines, surveillance to advanced driver-assistance systems in automobiles.
Related work
Place recognition has long been a subject of interest in the field of Computer Vision, with numerous methodologies proposed to address this complex problem. In recent years, Deep Learning, a subset of Machine Learning has significantly influenced these solutions, with Convolutional Neural Networks (CNNs), specifically, leading the charge. CNNs, designed to automatically and adaptively learn special hierarchies of features, have shown exemplary performance in image recognition tasks.
Several significant research studies have applied these networks to place recognition, including the study "NetVLAD: CNN Architecture for Weakly Supervised Place Recognition," which proposed a new CNN architecture with a specific focus on place recognition. This architecture leverages the ability of CNNs to extract highly discriminating information from raw pixel data while demonstrating robustness to striking appearance changes. however, the use of pretrained models such as VGG16 and MobileNet for feature extraction in place recognition has also gained significant traction in the research community. These models are beneficial due to their training on large-scale diverse datasets, like ImageNet. This pre-training enhances their capability to extract high-level features from images, proving particularly effective in place recognition.
Regarding specific methodologies, the use of the Nearest Neighbors algorithm for classification, similar to this project's approach, is widely seen in place recognition systems. Its efficiency in handling high-dimensional data, combined with its ease of implementation, embarks it as a preferred choice for this type of tasks.
However, the problem of place recognition continues to present several challenges, predominantly due to the dynamic nature of our environments. From fluctuating illumination conditions to infrastructural changes, numerous factors pose robustness challenges to these systems. This continues to stimulate ongoing research in the field, driving the emergence of novel methodologies capable of adapting to these changing circumstances.
My intention and effort in the project are to apply Convolutional Neural Networks for image recognition tasks. Specifically, our implementation relies on the VGG16 model. Which was introduced by the Visual Geometry Group at University of Oxford, this deep learning model is designed for image classification and recognition tasks. VGG16 provides excellent results due to its depth - it contains 16 layers that contribute to learning features, thus enabling it to capture more complex features from images with excellent performance. This architecture has been widely adopted in various applications in the literature, reflecting the effectiveness of this model in image recognition tasks.
To tackle the problem of classifying the given images into three categories the softmax activation function in the output layer of our model, a common choice in multi-class classification problems along with that data splitting, where the GardensPointWalking dataset is segregated into training and test subsets and in the project used the K-Nearest Neighbors (KNN) classifier as an efficient and easy-to-implement algorithm. After reshaping the training images to a 2D array, we fit the KNN model to ultimately retrieve images similar to the given query image based on Euclidean distance, underlying its usage for image retrieval systems in literature.
Methodology
The rapid advancements in the field of Neural Networks and artificial intelligence have paved the way for complex tasks such as image classification and recognition. Convolutional Neural Networks, a deep learning algorithm, have proven to be highly effective in handling such tasks and have been increasingly incorporated in a variety of applications including, but not restricted to, autonomous vehicles, augmented reality, and healthcare. However, training these deep learning models from scratch demands an extensive amount of data along with vast computation power and can be time-consuming. This limitation instigated the development of transfer learning, effectively allowing the use of pre-trained models on large datasets, and reusing the acquired knowledge on a different but related problem. This study employs one such transfer learning technique using a pre-trained VGG16 model, a widely recognized model in the machine learning community for its effectiveness in image-based tasks.
The methodology begins just after importing all the necessary Python libraries and modules, including the required machine learning libraries for image processing and machine learning algorithms. Importing these modules is essential because they contain the functionalities necessary to load, manipulate and design the image handling machine learning model used in this analysis. This is actually done just after drafting the idea of the project to be implemented step by step. After all the modules have been imported, a critical step in the image classification process is to initialize the pre-trained VGG16 model. VGG16 is a convolutional neural network model, which particularly known for its excellent performance in image recognition tasks. In my case, the model is initialized without the top classification layer because the final dense layer of the original VGG16 model is specific to the categorization of 1000 classes of the ImageNet dataset, which I don’t require to fulfill the task.
The next step involves fine-tuning the base VGG16 model to specialize it for the given task. Layers are added to this imported model using the functional API provided by TensorFlow's Keras. For that I have added a Flatten layer to convert the feature maps into a one-dimensional vector, followed by dense layers acting as fully connected layers. A final Dense layer with a SoftMax function is used as an output layer to predict the probability of each of the three classes of the dataset. These added layers, unlike the pre-trained layers of VGG16, are to be trained from scratch during our model training then the model is compiled using the Adam optimizer due to its efficiency in handling sparse gradients on noisy problems. To guide the optimization process, a loss function is defined. The Sparse Categorical Cross-Entropy is used as the loss function as it is suitable for multi-class classification problems.
Based on all of these and following the model's configuration, training data needs to be prepared. For this, a function named 'load_images_and_extract_features' is created. This function reads all the images from a given directory, pre-processes and resizes them to match the input size of VGG16, extracts the features of the images using the VGG16 model, and appends them to an array keeping the track of the image's label.
To test the model's robustness, the dataset is split into a training and test set, using an 8020 partition, meaning, that 80% of the data will be used for training the model and the remaining 20% will be used for testing the model's performance. A technique called early stopping is implemented in the model training phase which monitors the validation loss and stops the training process if the model doesn't show an improvement in consecutive iterations to prevent overfitting. Then the model is trained using the prepared training data for multiple iterations for like 10. The model's performance on each epoch is evaluated using the test set. The training process involves fine-tuning the weights of the added layers of the model to minimize the loss function while keeping the weights of the pre-trained VGG16 layers frozen.
To compare the model's performance on the training and the test sets, an accuracy plot is generated. In addition, the precision and recall metrics for each of the three classes are plotted to provide a holistic view of the model's performance.
The main part of the code also includes an implementation of the K-nearest neighbor’s model. The KNN model is trained to find the most similar image in the training set for a given test image. This is done as an additional step to showcase the model's ability to retrieve similar instances from the training set.
Finally, the predictions made by the model on the test set are evaluated using a confusion matrix, a classification report, and precision-recall curves. These evaluation metrics provide insightful information about the model's performance and its ability to correctly classify images and as a conclusion, this methodology demonstrates a robust way of implementing a deep learning model for a multi-class image classification task using pre-trained models, ensuring accurate and efficient performance using a minimum number of computational resources.
Code
Feel Free to download the full code using the following link :- https://github.com/minidu97/Place-Recognition.git
Experiments and Results
In this experiment, a Convolutional Neural Network based model was used to classify images into three categories: 'day_left', 'day_right', and 'night_right'. The pre-trained VGG16 model was used as a base model with a few more dense layers added to it. The optimized version of Adam optimizer was used with sparse categorical cross-entropy as the loss function.
?
Selection of VGG16 Over MobileNet
In the beginning of the project, the main intention was to implement the project using the MobileNet rather than going ahead with VGG16, but after going through some of the comparisons and related articles I was able to find out that VGG16 is quite good enough to resolve this. Here are some of the key points that I came across:
?????? Complexity and richness of the features: VGG16 contains a larger depth, represented by more layers of neural networks (16 weight layers), compared to MobileNet. This depth often translates into a richer hierarchy of features, because of that it has better performance. This could be especially beneficial when dealing with high-resolution images, where more complex patterns and structures exist.
?????? Proven performance on large-scale datasets: VGG16 has shown exceptional performance on ImageNet, which is a very large and diverse dataset. This places VGG16 amongst the top-performing models for image classification tasks.
?????? Customizability: VGG16 is more flexible in the sense that it provides a more straightforward, intuitive structure, composed of symmetrical and sequential architecture. This makes it easier to understand, manipulate and modify according to specific project needs.
?
领英推荐
The model was trained on a dataset which was split into 80% training data and 20% testing data. The training process was regulated by monitoring the validation loss and using early stopping to avoid overfitting. The model was trained for a total of 10 epochs. This epoch count can be modified in order to take further good solution.
Along with that in this experiment I have made another change by reshaping the X_train and X_test arrays from 4D to 2D where each row corresponds to a single image and each column relates to a pixel. The original shape of X_train and X_test is (num_images, width, height, channels). Here, the 'num_images' represents the total number of images, 'width' and 'height' define the dimensions of each image, and 'channels' refers to the color channels of the image which is typically 3 for RGB images.
However, the Nearest Neighbors model, which aim to use, cannot work with this 4D shape. KNN requires a 2D input where each row represents a sample and each column represents a feature.
So, i used a reshape operation to flatten the 'width', 'height', and 'channels' dimensions into a single dimension, resulting in a 2D array. The "-1" in the reshape function makes NumPy automatically calculate the number of columns needed and the reshaping operation unravels all the pixels in an image into a single row, so each pixel is now a feature in the transformed array. This array now meets the input requirements of the KNN model, allowing to use KNN for training and prediction.
There was a requirement to add the function to retrieve a similar image from the KNN model to the experiment to showcase the model's capabilities of identifying similar trends or patterns within the dataset. This is particularly useful in image recognition systems as it not only enables the identification of images that are similar but also facilitates comparison based on learned features. The retrieval of similar images is an important aspect in various real-world applications such as object recognition, image categorization, face recognition, etc. Therefore, by including this function in the experiment, I was able to demonstrate the practical usefulness of the model in not just classifying, but also finding relationships and similarities in image data.
The results of the model's performance were illustrated through a plot of training accuracy vs validation accuracy across the epochs.
The K-Nearest Neighbors (KNN) algorithm with the Euclidian metric was also used on the reshaped training data to predict the labels of the test data for comparison with the results from the CNN-based model. For that the following results were obtained from this model.
The above graph of model accuracy was plotted which showed a huge difference in the beginning of testing and training of datasets and at the end it simply drops the test accuracy little bit and again in the next few epochs it started to continue to get an increasing trend as the number of epochs increased.
In the code I have added a confusion matrix which is a table used to describe the performance of a machine learning model on a set of test data for which the true values are known. In my solution, i have a multiclass classification problem with 3 classes, so your confusion matrix is a 3x3 matrix. Let's assume the classes are 'day_left', 'day_right', and 'night_right' respectively.
34 7 2
6 26 4
0 0 41
The first row first column element (34) represents the True Positives for class 'day_left'. In other words, 34 'day_left' images were correctly identified and the first row second and third elements are the images that are 'day_left', but predicted as 'day_right' (7) and 'night_right' (2) respectively. The second row first column (6) represents the images that are 'day_right' but predicted as 'day_left'. The second row second column (26) gives the True Positives for class 'day_right' and 'day_right' images were correctly identified. The last element in the second row (4) are the images that are 'day_right', but predicted as 'night_right'. The third row represents similar facts for 'night_right'. Here, no 'night_right' images were misclassified and all 41 were correctly identified. Therefore, by looking at the confusion matrix, you can get a detailed understanding of how well the model is performing for each class and where the misclassifications are happening most of the time.
The classification report provided precision, recall, and f1-score for the three classes. Well on the image classification task.
?????? For the 'day_left' class, the precision is 0.85, suggesting that when the model predicts an image as 'day_left', it is correct 85% of the time. The recall, or sensitivity, is 0.79, meaning the model correctly identified 79% of the 'day_left' images. The F1-score, which is a harmonic mean of precision and recall, is 0.82 for this class.
?????? For the 'day_right' class, the precision is 0.79, meaning that when the model predicts an image as 'day_right', it is correct 79% of the time. The recall is 0.72, suggesting that the model correctly identified 72% of 'day_right' images. The F1-score for this class is 0.75.
?????? For the 'night_right' class, the precision is 0.87 and the recall is 1. This suggests that not only were all 'night_right' images correctly identified (recall=1), but when the model labeled an image as 'night_right', it was correct 87% of the time. The F1-score for 'night_right' is an impressive 0.93.
Overall, the model achieved 84% accuracy, indicating it classified the images correctly 84% of the time across all classes. The average F1-score, which can be interpreted as a balanced measure of the model's precision and recall, is also 84%. Along with that I have to mention that?
It is worth noting that the 'night_right' class has the highest F1-score, suggesting that the model performed the best on this class. Conversely, 'day_right' has the lowest F1-score indicating room for improvement in future iterations of the model.
The precision-recall curve plots Precision (Y-axis) against Recall (X-axis) for every possible classification threshold. Recall measures the ability of a model to find all the relevant cases within a dataset and Precision expressing the proportion of the data points our model says was relevant actually were relevant. The curve starts at the top-left, (Recall=0, Precision=1). This is because at a threshold of 1, the model predicts very few or no positive instances, and any that are predicted are likely to be true positive, but a large number of true positives are not picked up.
As the recall increases from 0.2 to 0.7, precision starts to drop from 1.0 to 0.8. This decline happens because of the decrease of threshold, the model begins to classify more instances as positive, and inevitably some of these are false positives, causing precision to fall. In the diagram there is a rapid drop in precision from 0.7 to 0.3 when recall is between 0.8 and 1. This sharp reduction could be due to the model starting to classify many negative instances as positive (false positives), causing a drop in precision. Meanwhile, recall grows because the model is also predicting more true positives.
This behavior points out the trade-off between precision and recall. As is I try to increase the recall, might end up increasing the number of false-positive cases, therefore reducing precision. Similarly, if I try to increase precision, you could end up missing some positive cases, therefore reducing recall. Therefore, one conclusion remains that threshold depends on the relative importance of precision and recall for the specific problem.
Lastly, a function was defined to retrieve similar images from the KNN model. From the test dataset it gets the first image out of the test dataset and checks and then gives the relevant similar image out of the relevant location using the following coding part.
As a conclusion of the experiments and the results the model showed promising results with the potential for possible improvements. ?
Conclusion
The purpose of this project was to design an image recognition model capable of accurate place recognition by identifying the location of a queried image by comparing it to a set of referenced images. Through extensive testing and experimentation using the GardensPointWalking dataset, the model was able to address the challenging task of distinguishing minor changes between similar yet distinct images, proving robust and adaptable to various visual place recognition tasks.
For the development purpose used the pre-trained VGG16 network model which significantly influenced the study, demonstrating itself to be efficient in handling deep learning tasks due to its depth and excellent performance on large-scale datasets. Moreover, its ability to capture more complex features from images made it a favorable choice over MobileNet for the categorization and recognition of high-resolution images and use of various methodologies in the processing and training stages revealed the potential of the model. Its capabilities were expanded with a Nearest Neighbors model, which enabled the identification and retrieval of similar images. This practical feature could be useful for myriad real-world applications including object recognition, image categorization, face recognition, among others.
Despite some initial wiggles between the training and testing datasets, the model showed positive trends in accuracy over time. By employing the Avant-garde precision-recall curve in the evaluation, the model depicted an interesting trade-off relationship between precision and recall, which provided insight on how to balance these two metrics for a more accurate model.
As a conclusion, the results obtained from the experiment were encouraging, with the model achieving an overall accuracy of 84% across all classes in categorizing images as 'day_left', 'day_right', and 'night_right'. Nevertheless, there is room for improvement, particularly in the classification of 'day_right' images. Future research could explore more sophisticated models or techniques to leverage the promising results already obtained in this study and can further improve the performance of the model.???
References
[1]Amir, E., Ahmad, J., & Mohamad, R. (2021). A Comparative Study of Different Deep Learning Model for Recognition of Handwriting Digits. ResearchGate. Retrieved from https://www.researchgate.net/publication/348684055_A_Comparative_Study_of_Different_Deep LearningModel_for_Recognition_of_Handwriting_Digits
[2]González, B., Ports, M., López, V., & Dominguez, B. (2022). Optical Trapping and Neutron Activation Analysis for the Characterization of Atmospheric Nanoparticles. Optik, 250, 167721. doi:10.1016/j.ijleo.2021.167721
[3]Bansal, A., Kumar, M. (2019). Transfer learning using VGG-16 with Deep Convolutional Neural Network for Classifying Images. ResearchGate. Retrieved from
[4]Zhang, Q., Liu, Q., Xu, S., & Lu, Q. (2021). Estimation of Maximum Load Power in Threephase Power Systems Using the Modified Least Squares Method. In Proceedings of the 2021 IEEE Energy Conversion Congress and Exposition (Asia) (ECCE Asia), pp. 1754-1761. doi:10.1109/ECCEA49283.2021.9652277
[5]Garg, K., Kumar, M., & Ramesh, K. (2018). Place Recognition: An Overview of Vision Perspective. ResearchGate. Retrieved from https://www.researchgate.net/publication/328974767_Place_Recognition_An_Overview_of_Visi on_Perspective.
[6]Elsayed, Z., Elshennawy, N., & Badr, A. (2021). An Empirical Evaluation of Enhanced Performance Softmax Function in Deep Learning. ResearchGate. Retrieved from https://www.researchgate.net/publication/369866954_An_Empirical_Evaluation_of_Enhanced_P erformance_Softmax_Function_in_Deep_Learning.
[7]Patel, P., Von Behren, C., Kumaresan, P., & Scopelliti, R. (2023). Computational insight into the near-room temperature formation of near single-crystal Nano-platinums. Scientific Reports, 13, Article number: 2457. https://doi.org/10.1038/s41598-023-40110-y.
Impressive work! Can't wait to check out the code on GitHub. ??