United States and Puerto Rico Cancer Statistics, 1999-2019 Incidence
Siddhant Kondekar
On a mission of sustained self-growth and success | Marketing Strategy Enthusiast | A content writer | AMFI |Management Trainee- JB Pharmaceuticals | PGDM-Healthcare Management-Welingkar |
BACKGROUND OF THE DATA
Cancer is one of the deadliest diseases of today’s time. Millions of people get affected due to cancer today and many are off which leads to death only. The data presented and analyzed in the United States and Puerto Rico Cancer Statistics,1999 to 2019 Incidence. In this, the incidence of various types of cancers according to Age, Sex, race, and Sites of Cancer. All this data can be used in cancer detection and treatment. The number of people having a specific type of cancer according to their age groups is specifically divided into data and through it the prediction, detection and various factors leading to cancer can be calculated and early diagnosis can be done.
DATA DICTIONARY
YEAR: the year of diagnosis (integer)
STATE: the state or territory of residence (string)
COUNTY: the county of residence (string)
AGE_ADJUSTED_RATE: the age-adjusted incidence rate per 100,000 population (float)
AGE_ADJUSTED_CI_LOWER: the lower limit of the 95% confidence interval for the age-adjusted rate (float)
AGE_ADJUSTED_CI_UPPER: the upper limit of the 95% confidence interval for the age-adjusted rate (float)
COUNT: the number of cases (integer)
POPULATION: the population count (integer)
CRUDE_RATE: the crude incidence rate per 100,000 population (float)
CRUDE_CI_LOWER: the lower limit of the 95% confidence interval for the crude rate (float)
CRUDE_CI_UPPER: the upper limit of the 95% confidence interval for the crude rate (float)
RACE: the race/ethnicity of the patient (string)
SEX: the sex of the patient (string)
SITE: the site of cancer (string)
YEAR_ID: a unique identifier for the year (integer)
AGE_ADJUSTED_RATE_STD: the age-adjusted incidence rate standardized to the 2000 U.S. standard population (float)
CRUDE_RATE_STD: the crude incidence rate standardized to the 2000 U.S. standard population (float)
EVENT_TYPE: whether the data is incidence (I) or mortality (M) (string)
?
DATA INTERPRETATION
There are 2 models that I have applied to this data’s Interpretation.
领英推荐
1.?????Logistic Regression - Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary).?Like all regression analyses, logistic regression is a predictive analysis.?Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval, or ratio-level independent variables. Sometimes logistic regressions are difficult to interpret; the Intellects Statistics tool easily allows you to conduct the analysis, then in plain English interprets the output.
?
2.?????ARIMA - An autoregressive integrated moving average, or ARIMA, is a statistical analysis model that uses time series data to either better understand the data set or to predict future trends. A statistical model is autoregressive if it predicts future values based on past values. For example, an ARIMA model might seek to predict a stock's future prices based on its past performance or forecast a company's earnings based on past periods.
?
The Data given consists of various types of cancers like the Brain and Nervous System, Breast and Cervix Uteri. Data is used to predict these, and a predictive model is prepared.
The Model includes:
Logistic Regression using Age has alpha =0.05 and the significance is present ie. Yes. The Chi-square is 748.8194 and df is 4. ?In this, the accuracy is 0.9550. The ROC curve is the following -
Logistic Regression using Race and gender. In the logistic data, the age group was divided, and then count was added into sex codes, and race with leading cancer sites data was also separated and regression was carried out on it.
A forecasting model was also created for the ARIMA, and the year-wise forecast of cases was done to create an ARIMA dataset. It included the year-wise total number of cancer cases. The ARIMA model data has alpha=0.05 and model parameters. In the ARIMA model, total case data was used to predict the next 5 years of cancer cases and prevention measures can be done.?
It showed that both the models applied were successful.
1.?????Overall, the incidence of cancer in the United States and Puerto Rico has remained stable from 2012 to 2019.
2.?????The most common types of cancer in both the United States and Puerto Rico are breast, lung, prostate, and colorectal cancer.
3.?????The incidence of lung cancer has been declining in both the United States and Puerto Rico, likely due to a decrease in smoking rates.
4.?????The incidence of liver cancer has been increasing in both the United States and Puerto Rico, likely due to a rise in the prevalence of hepatitis C and non-alcoholic fatty liver disease.
5.?????The incidence of thyroid cancer has been increasing in both the United States and Puerto Rico, but this may be due to increased detection rather than a true increase in the number of cases.
?
This data can be used by various hospitals and medical platforms for the prevention of cancer and to detect the total cases of cancer and to predict the future cases of cancer in the country as per the significance of age, sex, race wise. This data can be sold and earnings could be done at a price of 10 lakh with marketing strategies like email and B2B could be done.
This data is a very important one and by using various interpretation and statistical tools to predict and use this data for the prevention and treatment of cancer. The models used are very effective and the accuracy is also the highest, which gives the credibility of the data. The data and both models of Logistic Regression and ARIMA are very reliable and indicate precision.
Management Associate @ ICICI Prudential
1 年Amazing write-up! Siddhant Kondekar
Student at Welingkar Institute Of Management, Mumbai
1 年Great work!