LUNG CANCER PREDICTION (CNN APPROACH)
Santiago Reyes Chávez
GCID AIOps Engineer (SRE) at SAP | Master's Student in Applied Artificial Intelligence at Tecnológico de Monterrey | Innovation and Development Engineer | Specialization in Renewable Energy and Advanced Data Analytics
Contents
Introduction:
About The Data:
Approach to Solve The Problem:
Exploration Data Analysis (EDA):
Approach To Transform Data:
Data Engineering:
Data Separation:
Defining our Metrics:
First Model:
Class Labeling for the First Model:
First Batch of Data:
Validation Results:
Second Batch of Data:
Validation Results:
Best Batch Selection for Model One:
Test Results for Model One:
Second Model:
Discussion:
Conclusion:
References:
Introduction:
In this report, we have decided to create a model in which we can perform our own analysis on computed tomography (CT) scans of lungs. These lungs are categorized into three different groups in our dataset. It is a dataset where benign and malignant cancers are identified, and there are also CT scans of normal lungs.
In the medical context, CT images are used by the radiology department, where, in our case, the lungs help identify abnormalities. Cancer itself is one of the diseases with the highest mortality rate in the world, and within this category, lung cancer is one of the most harmful as it attacks a vital system. In pathology, lung cancer can be divided into the following sections:
By defining four classifications of lung cancer, radiology allows radiologists to identify cancerous cases, where further studies such as biopsies will be conducted depending on the case.
Our data pertains to an oncology hospital center in Iraq in 2019, where CT scans were collected over a period of 3 months from patients diagnosed with various cases of cancer as well as patients without cancer.
Throughout our project, we will seek ways to train a neural network to extract these characteristics from the images and predict whether a patient has malignant cancer, benign cancer, or normal lungs.
About The Data:
The data, being images, are collected from the internet in the IQ-OTH/NCCD study section, which belong to the dataset created by various studies whose hypothesis was the prediction of different categories of cancer. However, since the data are structured as Normal, Benign, and Malignant, we have decided to create a model that can focus on these categories.
The data consists of 1,190 representative images from 110 patients, where 40 patients have been diagnosed as malignant, 15 as benign, and 55 as normal.
Approach to Solve The Problem:
As a collaborative project, we have decided to create two deep learning models that can correctly classify CT scans to determine if a patient has normal, benign, or malignant lungs.
A study of the images will be conducted, including data separation to prevent information leakage. Data transformations and data engineering will also be covered.
The two proposed models will be compared, and the best one will be selected based on the validation data to test on the test data.
Exploration Data Analysis (EDA):
In this section, a study was conducted to understand the structure of the original images and to assess the balance among the different categories. Additionally, the need to address noise, if present in the images, was evaluated.
The dataset contains images with dimensions of 512px by 512px.
As shown, the images themselves are not large enough to be affected when scaling down to a smaller size, which might otherwise result in the loss of significant details and data quality.
A data balance study has been conducted to determine if it will be necessary to use augmented or synthetic data. However, a significant amount of data is available for all three categories.
For noise analysis, the possibility of having noise in the images has been evaluated. Since we are specifically focusing on lungs, having different parts of the body or objects could be detrimental to the model.
As can be seen, the equipment where the patients lie down is visible, as well as what appears to be the patient's clothing or blanket when lying on the equipment.
Approach To Transform Data:
Regarding the transformation of the images, we have opted for standard scaling of the images to 244 x 244 pixels to help the image processing.?
Regarding image noise, we have chosen two approaches: leaving the image original and adjusting the saturation of the image to reduce the level of noise caused by the machine or the patients' clothing.
For image size, we have decided on:
For image noise:
领英推荐
Data Engineering:
Data engineering allows us to balance the data or extract new features from the images to help the model better generalize the information. However, if our goal is archived, no further action will be taken. Otherwise, we propose data augmentation by randomly modifying the training images with various changes, such as rotation, saturation adjustments, among others.?
Data Separation:
A random split was performed on the dataset, dividing the data into training, validation, and test sets.
The training set will be used to provide sufficient information to the model so it can learn from the images and generalize them. This generalization will be reviewed using the validation set. The validation set will be used to initially test the model's generalization, allowing us to modify the neural network architecture and even identify which dataset has yielded better performance for that architecture.
Once the model's architecture is implemented and improved, the best combination of data and architecture will be selected to conduct a final test on the test data. This final test will conclude the investigations and determine whether the data was correctly generalized.
Defining our Metrics:
Since this is a classification problem with a focus on health, it was determined that the recall metric would be appropriate for identifying cases. However, within this context and objective, the priorities are weighted as follows:
The reason for prioritizing recall is to correctly identify positive cases, particularly those with higher priority. For ethical reasons and to ensure practical quality of the model's use, it is preferable to identify a greater number of positive cases, even if it means increasing the rate of false positives in the data.
First Model:
In this section, as mentioned in the document, two approaches were taken. First, the images were transformed by increasing the saturation by 100% and decreasing the brightness by 100% to highlight the variables in the images. Based on these variations, the best performance was selected.
Class Labeling for the First Model:
A number was assigned to the classifications of the images as follows:?
0 - Benign Cancer Classification?
1 - Normal Lungs Classification?
2 - Malignant Lungs Classification
First Batch of Data:
The first dataset contains images with transformations made to saturation and brightness. They were scaled to 244px to facilitate filtering in the convolutional stages. The results were as follows:
Validation Results:
As observed, the model training with the data was nearly perfect; however, there is a risk of overfitting the model due to the large number of epochs. Nevertheless, the best dataset will be selected for testing, unlike the second model.
Second Batch of Data:
In the second dataset, the same architecture was used; however, the original data was utilized without any transformation in saturation and brightness.
Validation Results:
Best Batch Selection for Model One:
By applying transformations, the first dataset achieved better regularization in the validation data. This allows us to identify that it has significantly helped in generalizing the data by removing noise and focusing on the most important aspects, which are the lungs.
It has been decided to select the first transformation methodology for using the test data, where the same architecture will be employed.
Test Results for Model One:
Once the best dataset was selected, it was deployed in the test batch, yielding the following results:
Positively, there is excellent data generalization, considering that the recall for malignant cancer is 70%, which is the lowest value. However, the model correctly identified all 10 positive cases of malignant cancer, showing an overall accuracy of 87%. These are positive results for the first model architecture, demonstrating its good performance in data generalization.
Second Model:
In the second model attempt, We wanted to analyze the features hidden in the image just by using a deep learning model. In this model, the image is first normalized using five convolutional layers, five pooling layers, and two fully connected layers. The first two convolutional layers have a convolutional kernel size of 3 for finer feature extraction and the last three convolutional layers have a convolutional kernel size of 5 for extracting more advanced features with a larger sense field. A pooling layer was added after each convolutional layer to reduce the size of the feature map. After a series of convolutions, two fully connected layers are used for feature extraction and summarization and finally for image classification. During training, the RMSprop optimizer is used and the loss function is a sparse cross-entropy loss function. After dividing the data into training and test sets using data segmentation with a segmentation ratio of 0.2, the training set is used for training using the segmented training set.
Finally, 10 sheets from each lung cancer category were randomly selected for testing by human hand and found to be 100% accurate, proving the effectiveness of the model.
Discussion:
The results from the two models developed for this study underscore the importance of rigorous data preprocessing and model architecture design in achieving high accuracy in medical image classification tasks. Several key factors emerged as critical in influencing the performance of the models:
In conclusion, the study demonstrates the feasibility of using deep learning models to accurately classify lung CT scans, with potential applications in early cancer detection and diagnosis. Future work could focus on further optimizing the models, exploring additional preprocessing techniques, and validating the models on larger and more diverse datasets to enhance their robustness and applicability in real-world clinical settings.
Conclusion:?
The development and evaluation of deep learning models for lung cancer prediction from CT scans have yielded promising results. The two models created for this study successfully classified CT images into normal, benign, and malignant categories with high accuracy. Key takeaways from this study include:
Overall, the study highlights the potential of deep learning in advancing medical diagnostics, particularly in the early detection of lung cancer. Future research should focus on further refining these models, exploring additional data augmentation techniques, and validating the findings on larger datasets to confirm their efficacy and reliability in diverse clinical environments.
References:
1 - H. F. Al-Yasriy, M. S. Al-Husieny, F. Y. Mohsen, E. A. Khalil, and Z. S. Hassan, "Diagnosis of Lung Cancer Based on CT Scans Using CNN," IOP Conference Series: Materials Science and Engineering, vol. 928, 2020.
2 - H. F. Kareem, M. S. A.-Husieny, F. Y. Mohsen, E. A. Khalil, and Z. S. Hassan, “Evaluation of SVM performance in the detection of lung cancer in marked CT scan dataset,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 21, no. 3,pp. 1731-1738, 2021, doi: 10.11591/ijeecs.v21.i3.pp1731-1738.
3 - alyasriy, hamdalla; AL-Huseiny, Muayed (2023), “The IQ-OTH/NCCD lung cancer dataset”, Mendeley Data, V4, doi: 10.17632/bhmdr45bh2.4
4- Wang, S., Dong, L., Wang, X., & Wang, X. (2020). Classification of Pathological Types of Lung Cancer from CT Images by Deep Residual Neural Networks with Transfer Learning Strategy. Open medicine (Warsaw, Poland), 15, 190–197. https://doi.org/10.1515/med-2020-0028
?? Data Engineer & Analytics ?? / @GM Financial
8 个月Amazing job!!
Business and IT from Tecnológico de Monterrey (ITESM)
8 个月Congratulations!! ??????