Hive UseCase: Breast Cancer Data Analysis
This blog will help readers to learn data analytics using HIVE. The very common problem present in whole world coming since starting of humans and is still present today.
The dataset used in this blog is a clinical data released for the awareness of breast cancer.Few problems has been designed to perform analytics also with codes as solution for practice.
The most common type of breast cancer is ductal carcinoma, which begins in the cells of the ducts. Breast cancer can also begin in the cells of the lobules and in other tissues in the breast. Ductal carcinoma in situ is a condition in which abnormal cells are found in the lining of the ducts but they haven't spread outside the duct. Breast cancer that has spread from where it began in the ducts or lobules to surrounding tissue is called invasive breast cancer. In inflammatory breast cancer, the breast looks red and swollen and feels warm because the cancer cells block the lymph vessels in the skin.
In the U.S., breast cancer is the second most common cancer in women after skin cancer. It can occur in both men and women, but it is rare in men. Each year there are about 100 times more new cases of breast cancer in women than in men.
Breast Cancer
Dataset:https://drive.google.com/open?id=0B1QaXx7tpw3SSXZnbW16anA2MjQ
Dataset Description
Creating and Loading of Data
CREATE TABLE bcancer(Complete_TCGA_ID string,Gender string,Age_Initial_diag int,ER_Status string,PR_Status string,HER2_Final_Status string,Tumor string,Node string,Node_Coded string,Metastasis string,Metastasis_Coded string,AJCC_Stage string,Converted_Stage string,Survival_Data_Form string,Vital_Status string,Days_to_Date_of_Last_Contact int,Days_to_date_of_Death int )row format delimited FIELDS terminated BY ',' stored AS textfile;
Load data local inpath '/home/acadgild/Downloads/breast_cancer_clinical_data.csv' into table bcancer;
Problem Statement
Determine Average age of people when they got diagnosed for the 1st time
select AVG(Age_Initial_diag) from bcancer ;
Problem Statement
Find Patient’s ID and their gender who are still alive after IIIrd stage.
select Complete_TCGA_ID,Gender from bcancer where AJCC_Stage="Stage IIIC" OR AJCC_Stage="Stage IIIB" OR AJCC_Stage="Stage IIIA" OR AJCC_Stage="StageIII" OR AJCC_Stage="StageIV" AND Vital_Status="LIVING";
Problem Statement
Also find the AVG age when the above patients got diagnosed for first time.
select Gender,AVG(Age_Initial_diag) from bcancer where AJCC_Stage="Stage IIIC" OR AJCC_Stage="Stage IIIB" OR AJCC_Stage="Stage IIIA" OR AJCC_Stage="StageIII" OR AJCC_Stage="StageIV" AND Vital_Status="LIVING";
Problem Statement
Find out ID and tumor category related with initial diagnosis of people who already Died because of breast cancer.
select Complete_TCGA_ID,Tumor from bcancer where Vital_Status="DECEASED";
Problem Statement
After reaching last stage how many live.
Select count(Vital_Status) from bcancer where AJCC_Stage="Stage IV" OR Converted_Stage="Stage IV" AND Vital_Status="LIVING";
Problem Statement
Give the count how many of the people are in final stage of cancer and Metastasis is M1 and age is more than 50;
Select count(Vital_Status) from bcancer where AJCC_Stage="Stage IV" OR Converted_Stage="Stage IV" AND Vital_Status="LIVING" AND Metastasis="M1" AND Age_Initial_diag>50;
Problem Statement
Find the Average age where ER and PR are positive and node is N3.
Select AVG(Age_Initial_diag) from bcancer where ER_Status="Positive" AND PR_Status="Positive" AND Node="N3";
Problem Statement
Find the min and max age for people got diagnosed.
SELECT MIN(Age_Initial_diag) from bcancer;
SELECT MAX(Age_Initial_diag) from bcancer;
Problem Statement
Make a List of people with ID,Age where ER_Status,PR_Status,HER2_Final_Status as positive,and in what stage AJCC_Stage. and Find average survival rate for range 40-49 AGE.
SELECT Complete_TCGA_ID,Age_Initial_diag,ER_Status,PR_Status,HER2_Final_Status,AJCC_Stage from bcancer where AND ER_Status="Positive" AND PR_Status="Positive" AND HER2_Final_Status="Positive" ORDER BY Age_Initial_diag;
SELECT Complete_TCGA_ID,Age_Initial_diag,ER_Status,PR_Status,HER2_Final_Status,AJCC_Stage from bcancer where 40<Age_Initial_diag AND Age_Initial_diag<49 AND ER_Status="Positive" AND PR_Status="Positive" AND HER2_Final_Status="Positive" ;
Hope the readers take this blog as learning material and take the good info out of it.
For more hands-on practice on Hive for beginners readers can try Real-Estate data Analysis.