LUNG CANCER PREDICTION (CNN APPROACH)

LUNG CANCER PREDICTION (CNN APPROACH)

Contents

Introduction:

About The Data:

Approach to Solve The Problem:

Exploration Data Analysis (EDA):

Approach To Transform Data:

Data Engineering:

Data Separation:

Defining our Metrics:

First Model:

Class Labeling for the First Model:

First Batch of Data:

Validation Results:

Second Batch of Data:

Validation Results:

Best Batch Selection for Model One:

Test Results for Model One:

Second Model:

Discussion:

Conclusion:

References:


Introduction:

In this report, we have decided to create a model in which we can perform our own analysis on computed tomography (CT) scans of lungs. These lungs are categorized into three different groups in our dataset. It is a dataset where benign and malignant cancers are identified, and there are also CT scans of normal lungs.

In the medical context, CT images are used by the radiology department, where, in our case, the lungs help identify abnormalities. Cancer itself is one of the diseases with the highest mortality rate in the world, and within this category, lung cancer is one of the most harmful as it attacks a vital system. In pathology, lung cancer can be divided into the following sections:

  • SCLC (Small Cell Lung Cancer)?
  • NSCLC (Non-Small Cell Lung Cancer)?

Fig 1: CT images of lung cancer pathological types: from left to right are ISA (adenocarcinoma in situ), SCLC (small cell lung cancer), SCC (squamous cell cancer) and IA (invasive adenocarcinoma). The green box areas are ROI areas of tumors. Ref [4]
Fig 2: ROI areas of four types tumors, from left to right are ISA (adenocarcinoma in situ), SCLC (small cell lung cancer), SCC (squamous cell cancer) and IA (invasive adenocarcinoma). Ref [4]

By defining four classifications of lung cancer, radiology allows radiologists to identify cancerous cases, where further studies such as biopsies will be conducted depending on the case.

Our data pertains to an oncology hospital center in Iraq in 2019, where CT scans were collected over a period of 3 months from patients diagnosed with various cases of cancer as well as patients without cancer.

Throughout our project, we will seek ways to train a neural network to extract these characteristics from the images and predict whether a patient has malignant cancer, benign cancer, or normal lungs.

About The Data:

The data, being images, are collected from the internet in the IQ-OTH/NCCD study section, which belong to the dataset created by various studies whose hypothesis was the prediction of different categories of cancer. However, since the data are structured as Normal, Benign, and Malignant, we have decided to create a model that can focus on these categories.

The data consists of 1,190 representative images from 110 patients, where 40 patients have been diagnosed as malignant, 15 as benign, and 55 as normal.

Fig 3 - Lung Image Classification Examples

Approach to Solve The Problem:

As a collaborative project, we have decided to create two deep learning models that can correctly classify CT scans to determine if a patient has normal, benign, or malignant lungs.

A study of the images will be conducted, including data separation to prevent information leakage. Data transformations and data engineering will also be covered.

The two proposed models will be compared, and the best one will be selected based on the validation data to test on the test data.

Exploration Data Analysis (EDA):

In this section, a study was conducted to understand the structure of the original images and to assess the balance among the different categories. Additionally, the need to address noise, if present in the images, was evaluated.

The dataset contains images with dimensions of 512px by 512px.

Fig 4 - Image Size of All Classifications

As shown, the images themselves are not large enough to be affected when scaling down to a smaller size, which might otherwise result in the loss of significant details and data quality.

A data balance study has been conducted to determine if it will be necessary to use augmented or synthetic data. However, a significant amount of data is available for all three categories.

Fig 5: Category Distribution of Images

For noise analysis, the possibility of having noise in the images has been evaluated. Since we are specifically focusing on lungs, having different parts of the body or objects could be detrimental to the model.

Fig 6: Possible Noise in the Images By Clothes and CT Machine

As can be seen, the equipment where the patients lie down is visible, as well as what appears to be the patient's clothing or blanket when lying on the equipment.

Approach To Transform Data:

Regarding the transformation of the images, we have opted for standard scaling of the images to 244 x 244 pixels to help the image processing.?

Regarding image noise, we have chosen two approaches: leaving the image original and adjusting the saturation of the image to reduce the level of noise caused by the machine or the patients' clothing.

For image size, we have decided on:

  • 244px by 244px

For image noise:

  • Original
  • Saturation
  • Brightness

Data Engineering:

Data engineering allows us to balance the data or extract new features from the images to help the model better generalize the information. However, if our goal is archived, no further action will be taken. Otherwise, we propose data augmentation by randomly modifying the training images with various changes, such as rotation, saturation adjustments, among others.?

Data Separation:

A random split was performed on the dataset, dividing the data into training, validation, and test sets.

The training set will be used to provide sufficient information to the model so it can learn from the images and generalize them. This generalization will be reviewed using the validation set. The validation set will be used to initially test the model's generalization, allowing us to modify the neural network architecture and even identify which dataset has yielded better performance for that architecture.

Once the model's architecture is implemented and improved, the best combination of data and architecture will be selected to conduct a final test on the test data. This final test will conclude the investigations and determine whether the data was correctly generalized.

Defining our Metrics:

Since this is a classification problem with a focus on health, it was determined that the recall metric would be appropriate for identifying cases. However, within this context and objective, the priorities are weighted as follows:

  1. Malignant Cancer Detection (Recall) - Priority One
  2. Benign Cancer Detection (Recall) - Priority Two
  3. Normal Cancer Detection (Recall) - Priority Three

The reason for prioritizing recall is to correctly identify positive cases, particularly those with higher priority. For ethical reasons and to ensure practical quality of the model's use, it is preferable to identify a greater number of positive cases, even if it means increasing the rate of false positives in the data.

First Model:

In this section, as mentioned in the document, two approaches were taken. First, the images were transformed by increasing the saturation by 100% and decreasing the brightness by 100% to highlight the variables in the images. Based on these variations, the best performance was selected.

Fig 7: First Architecture of the Proposed Model 1

Class Labeling for the First Model:

A number was assigned to the classifications of the images as follows:?

0 - Benign Cancer Classification?

1 - Normal Lungs Classification?

2 - Malignant Lungs Classification

First Batch of Data:

The first dataset contains images with transformations made to saturation and brightness. They were scaled to 244px to facilitate filtering in the convolutional stages. The results were as follows:

Validation Results:

Fig 8: Accuracy comparison with Train and Validation set with model 1 batch 1
Tab 1: Results of Model 1 Batch 1 in Validation Data

As observed, the model training with the data was nearly perfect; however, there is a risk of overfitting the model due to the large number of epochs. Nevertheless, the best dataset will be selected for testing, unlike the second model.

Second Batch of Data:

In the second dataset, the same architecture was used; however, the original data was utilized without any transformation in saturation and brightness.

Validation Results:

Fig 9: Accuracy Comparison with Train and Validation set with Model 1 Batch 2
Tab 2: Results of Model 1 Batch 2 in Validation Data

Best Batch Selection for Model One:

By applying transformations, the first dataset achieved better regularization in the validation data. This allows us to identify that it has significantly helped in generalizing the data by removing noise and focusing on the most important aspects, which are the lungs.

It has been decided to select the first transformation methodology for using the test data, where the same architecture will be employed.

Test Results for Model One:

Once the best dataset was selected, it was deployed in the test batch, yielding the following results:

Tab 3: Results of Model 1 Batch 1 in Test Data

Positively, there is excellent data generalization, considering that the recall for malignant cancer is 70%, which is the lowest value. However, the model correctly identified all 10 positive cases of malignant cancer, showing an overall accuracy of 87%. These are positive results for the first model architecture, demonstrating its good performance in data generalization.

Second Model:

Fig 10: Architecture of the Proposed Model 2

In the second model attempt, We wanted to analyze the features hidden in the image just by using a deep learning model. In this model, the image is first normalized using five convolutional layers, five pooling layers, and two fully connected layers. The first two convolutional layers have a convolutional kernel size of 3 for finer feature extraction and the last three convolutional layers have a convolutional kernel size of 5 for extracting more advanced features with a larger sense field. A pooling layer was added after each convolutional layer to reduce the size of the feature map. After a series of convolutions, two fully connected layers are used for feature extraction and summarization and finally for image classification. During training, the RMSprop optimizer is used and the loss function is a sparse cross-entropy loss function. After dividing the data into training and test sets using data segmentation with a segmentation ratio of 0.2, the training set is used for training using the segmented training set.

Finally, 10 sheets from each lung cancer category were randomly selected for testing by human hand and found to be 100% accurate, proving the effectiveness of the model.

Fig 11: Accuracy Comparison with Train and Validation set with Model 2
Fig 12: Loss of Train and Validation set with Model 2
Tab 4: Results of Model 2 in Test Data

Discussion:

The results from the two models developed for this study underscore the importance of rigorous data preprocessing and model architecture design in achieving high accuracy in medical image classification tasks. Several key factors emerged as critical in influencing the performance of the models:

  1. Data Preprocessing and Augmentation: The first model demonstrated that enhancing image quality through saturation and brightness adjustments significantly improves the model's ability to generalize from the training data. This transformation reduced noise and emphasized critical features in the CT scans, leading to better performance in identifying cancerous tissues.?
  2. Model Architecture: The architecture of the neural networks played a crucial role in their performance. Model 1 utilized a series of convolutional and pooling layers followed by fully connected layers, showing a good balance between complexity and computational efficiency. Model 2 incorporated a deeper architecture with more convolutional layers and larger kernel sizes in the later stages, which helped in capturing finer details and improving accuracy. The use of RMSprop optimizer and sparse cross-entropy loss function in Model 2 contributed to its high accuracy, achieving perfect results in the test data.
  3. Evaluation Metrics: Recall was chosen as the primary metric due to the critical need to correctly identify positive cancer cases. This decision is ethically and practically important in a medical context where missing a positive case could have severe consequences. The models' high recall rates for malignant and benign cancers validate this approach, ensuring that the models are reliable for clinical applications.
  4. Data Balance and Noise Reduction: Ensuring a balanced dataset and addressing noise were crucial steps. The dataset, although balanced in terms of category distribution, still required augmentation techniques to handle the inherent noise in medical images, such as artifacts from patient movement or scanning equipment. The transformations applied in Model 1 were effective in mitigating these issues.
  5. Validation and Testing: The rigorous validation process highlighted the models' strengths and potential areas of overfitting. The first model, despite its high accuracy, showed signs of overfitting due to the large number of epochs. However, by selecting the best-performing dataset and architecture, the final test results confirmed the model's ability to generalize well to unseen data.

In conclusion, the study demonstrates the feasibility of using deep learning models to accurately classify lung CT scans, with potential applications in early cancer detection and diagnosis. Future work could focus on further optimizing the models, exploring additional preprocessing techniques, and validating the models on larger and more diverse datasets to enhance their robustness and applicability in real-world clinical settings.

Conclusion:?

The development and evaluation of deep learning models for lung cancer prediction from CT scans have yielded promising results. The two models created for this study successfully classified CT images into normal, benign, and malignant categories with high accuracy. Key takeaways from this study include:

  1. Effective Data Preprocessing: Enhancing image quality through transformations such as saturation and brightness adjustments significantly improves model performance by reducing noise and emphasizing important features.
  2. Robust Model Architectures: Utilizing deep convolutional networks with appropriate kernel sizes and pooling layers enables the extraction of critical features from medical images, leading to high classification accuracy.
  3. Importance of Recall: Prioritizing recall in the evaluation metrics ensures that the models are highly sensitive to positive cases, which is crucial in a medical context to avoid missing any potential cancer diagnoses.
  4. Comprehensive Validation: Rigorous validation and testing processes are essential to ensure that the models generalize well to new, unseen data, thereby demonstrating their practical applicability in clinical settings.

Overall, the study highlights the potential of deep learning in advancing medical diagnostics, particularly in the early detection of lung cancer. Future research should focus on further refining these models, exploring additional data augmentation techniques, and validating the findings on larger datasets to confirm their efficacy and reliability in diverse clinical environments.

References:

1 - H. F. Al-Yasriy, M. S. Al-Husieny, F. Y. Mohsen, E. A. Khalil, and Z. S. Hassan, "Diagnosis of Lung Cancer Based on CT Scans Using CNN," IOP Conference Series: Materials Science and Engineering, vol. 928, 2020.

2 - H. F. Kareem, M. S. A.-Husieny, F. Y. Mohsen, E. A. Khalil, and Z. S. Hassan, “Evaluation of SVM performance in the detection of lung cancer in marked CT scan dataset,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 21, no. 3,pp. 1731-1738, 2021, doi: 10.11591/ijeecs.v21.i3.pp1731-1738.

3 - alyasriy, hamdalla; AL-Huseiny, Muayed (2023), “The IQ-OTH/NCCD lung cancer dataset”, Mendeley Data, V4, doi: 10.17632/bhmdr45bh2.4

4- Wang, S., Dong, L., Wang, X., & Wang, X. (2020). Classification of Pathological Types of Lung Cancer from CT Images by Deep Residual Neural Networks with Transfer Learning Strategy. Open medicine (Warsaw, Poland), 15, 190–197. https://doi.org/10.1515/med-2020-0028

Vanessa Alvarado Chapa

?? Data Engineer & Analytics ?? / @GM Financial

8 个月

Amazing job!!

Jesús Pablo López Ibarra

Business and IT from Tecnológico de Monterrey (ITESM)

8 个月

Congratulations!! ??????

要查看或添加评论,请登录

Santiago Reyes Chávez的更多文章

社区洞察

其他会员也浏览了