Hive UseCase: Breast Cancer Data Analysis

Hive UseCase: Breast Cancer Data Analysis

This blog will help readers to learn data analytics using HIVE. The very common problem present in whole world coming since starting of humans and is still present today.

The dataset used in this blog is a clinical data released for the awareness of breast cancer.Few problems has been designed to perform analytics also with codes as solution for practice.

The most common type of breast cancer is ductal carcinoma, which begins in the cells of the ducts. Breast cancer can also begin in the cells of the lobules and in other tissues in the breast. Ductal carcinoma in situ is a condition in which abnormal cells are found in the lining of the ducts but they haven't spread outside the duct. Breast cancer that has spread from where it began in the ducts or lobules to surrounding tissue is called invasive breast cancer. In inflammatory breast cancer, the breast looks red and swollen and feels warm because the cancer cells block the lymph vessels in the skin.

In the U.S., breast cancer is the second most common cancer in women after skin cancer. It can occur in both men and women, but it is rare in men. Each year there are about 100 times more new cases of breast cancer in women than in men.

Breast Cancer

Dataset:https://drive.google.com/open?id=0B1QaXx7tpw3SSXZnbW16anA2MjQ

Dataset Description


Creating and Loading of Data

CREATE TABLE bcancer(Complete_TCGA_ID string,Gender string,Age_Initial_diag int,ER_Status string,PR_Status string,HER2_Final_Status string,Tumor string,Node string,Node_Coded string,Metastasis string,Metastasis_Coded string,AJCC_Stage string,Converted_Stage string,Survival_Data_Form string,Vital_Status string,Days_to_Date_of_Last_Contact int,Days_to_date_of_Death int  )row format delimited FIELDS terminated BY ',' stored AS textfile;

Load data local inpath '/home/acadgild/Downloads/breast_cancer_clinical_data.csv' into table bcancer;


Problem Statement

Determine Average age of people when they got diagnosed for the 1st time

select AVG(Age_Initial_diag) from bcancer ;

Problem Statement

Find Patient’s ID and their gender who are still alive after IIIrd stage.

select Complete_TCGA_ID,Gender from bcancer where AJCC_Stage="Stage IIIC" OR AJCC_Stage="Stage IIIB" OR AJCC_Stage="Stage IIIA" OR AJCC_Stage="StageIII" OR AJCC_Stage="StageIV" AND Vital_Status="LIVING";

Problem Statement

Also find the AVG age when the above patients got diagnosed for first time.

select Gender,AVG(Age_Initial_diag) from bcancer where AJCC_Stage="Stage IIIC" OR AJCC_Stage="Stage IIIB" OR AJCC_Stage="Stage IIIA" OR AJCC_Stage="StageIII" OR AJCC_Stage="StageIV" AND Vital_Status="LIVING";

Problem Statement

Find out ID and tumor category related with initial diagnosis of people who already Died because of breast cancer.

select Complete_TCGA_ID,Tumor from bcancer where Vital_Status="DECEASED";

Problem Statement

After reaching last stage how many live.

Select count(Vital_Status) from bcancer where AJCC_Stage="Stage IV" OR Converted_Stage="Stage IV" AND Vital_Status="LIVING";

Problem Statement

Give the count how many of the people are in final stage of cancer and Metastasis is M1 and age is more than 50;

Select count(Vital_Status) from bcancer where AJCC_Stage="Stage IV" OR Converted_Stage="Stage IV" AND Vital_Status="LIVING" AND Metastasis="M1" AND Age_Initial_diag>50;

Problem Statement

Find the Average age where ER and PR are positive and node is N3.

Select AVG(Age_Initial_diag) from bcancer where ER_Status="Positive" AND PR_Status="Positive" AND Node="N3";

Problem Statement

Find the min and max age for people got diagnosed.

SELECT MIN(Age_Initial_diag) from bcancer;   
SELECT MAX(Age_Initial_diag) from bcancer;

Problem Statement

Make a List of people with ID,Age where ER_Status,PR_Status,HER2_Final_Status as positive,and in what stage AJCC_Stage. and Find average survival rate for range 40-49 AGE.

SELECT Complete_TCGA_ID,Age_Initial_diag,ER_Status,PR_Status,HER2_Final_Status,AJCC_Stage from bcancer where AND ER_Status="Positive" AND PR_Status="Positive" AND HER2_Final_Status="Positive" ORDER BY Age_Initial_diag;
SELECT Complete_TCGA_ID,Age_Initial_diag,ER_Status,PR_Status,HER2_Final_Status,AJCC_Stage from bcancer where 40<Age_Initial_diag AND Age_Initial_diag<49 AND ER_Status="Positive" AND PR_Status="Positive" AND HER2_Final_Status="Positive" ;

Hope the readers take this blog as learning material and take the good info out of it.

For more hands-on practice on Hive for beginners readers can try Real-Estate data Analysis.







要查看或添加评论,请登录

Prateek K.的更多文章

  • Why Becoming a Data Scientist is the Next Logical Move?

    Why Becoming a Data Scientist is the Next Logical Move?

    Amongst the variety of job profiles, Data Scientist might not be the first job to come to your mind. But big companies…

  • Frequently Asked Hadoop Interview Questions in 2017 Part - 2

    Frequently Asked Hadoop Interview Questions in 2017 Part - 2

    Before going through this Hadoop interview questions part-2, we recommend our users to go through our previous post on…

    2 条评论
  • Data Serialization with Avro in Hive

    Data Serialization with Avro in Hive

    This blog focuses on providing in-depth information of Avro in Hive. Here we have discussed the importance and…

  • File Formats in Apache HIVE

    File Formats in Apache HIVE

    This Blog aims at discussing the different file formats available in Apache Hive. After reading this Blog you will get…

  • Writable and WritableComparable in Hadoop

    Writable and WritableComparable in Hadoop

    This blog helps those people who want to build their own custom types in Hadoop which is possible only with Writable…

    1 条评论
  • Frequently Asked Hadoop Interview Questions in 2017 – Part 1

    Frequently Asked Hadoop Interview Questions in 2017 – Part 1

    In this first Part of Hadoop interview Questions, we would be discussing various questions related to Big Data Hadoop…

    1 条评论
  • Static vs dynamic partition in hive

    Static vs dynamic partition in hive

    In our Last post, we mentioned regarding Introduction to Hive Partition therein post we clearly mentioned the most…

    1 条评论
  • Strict Mode HIVE

    Strict Mode HIVE

    We have the partition in Hive to save lots of our time whereas performing analysis on the area only we want and not the…

    2 条评论
  • Solving the Unstructured Data Dilemma

    Solving the Unstructured Data Dilemma

    A web browser generates tons of data every day. All of which may not be stored in a structured format.

社区洞察

其他会员也浏览了