Basics of data science
Jainendra Kumar, CPM, M.IOD
Member of Forbes Technology Council | Advisor | AI, ML, SaaS, Cloud, DevSecOps | Digital Transformation | Certified Independent Director
Several fields make data science and diverse skills are needed. It also involves several roles – planning, data preparation, modeling, and follow-up.
Definition of data science:
Data science is an inclusive analysis of a diverse set of data. It involves coding, domain knowledge and statistics in applied setting resulting into value. Coding is required for data gathering, formatting, and analysis. R & Python are well established for statistical coding whereas SQL is for database query. Basic understating of probability, algebra, regression, etc. is required to diagnose issues as it uncovers. It also includes machine leering in which coding and math skills are of a higher order than domain knowledge.
Planning a data science project:
Every project starts with a well-defined value/goal, resource organization, coordination between people across functions and a schedule.
Data acquisition and preparation:
It involved getting data from various sources, cleaning, explore and refine them, making transformation and more. The data scientist has the option to use existing data, open data, 3rd party data, use APIs, scrape data or even make data.
Statistical modeling:
A statistical model is a small set of data summarizing the gist of the population set. It embodies the algorithm like linear regression, classification, etc. and a subset of data. Once the model is created it is validated using a test set. The model also has to be evaluated - what does it mean and then further refined.
Follow-up:
it involves communicated/presented data science work to the business stakeholders in a meaning full way, driving value. Before that or maybe after approval on the concept, the model is deployed in the production environment with the full set of data. The model has to be revisited if it is not performing as it was with the test data set, or things have changed over time. The model, ETL, assets are to be documented for repeat analysis or future reuse.
Overall data science is not just technical work and it is not defiantly only coding. Technology is just a means of ding data science. It is not just statistics either. Math and statistics form the foundation of data science. So along with coding and math, it also involves domain knowledge, planning, analyzing, communication, and most importantly the contextual skills of relating data science activities to business value.
Tools for data science:
R, RapidMiner, SQL, Python, Excel, KNINE, Hadoop, Tableau, Power Builder, Hadoop, Spark, Google BigQuery, Amazon Sagemaker, and more.
Data science is an ethical practice. It is the responsibility of data scientists to maintain privacy, confidentiality, and security of the data. Maintaining the anonymity of individuals and proprietary data is critical. It is not bias-free and analysis are not neural. So a good judgment is vital for the quality of success of data science project.