New $1 Million Biennial Prize to Revitalize Statistics
While nowadays everyone talk about machine learning, data science and AI, few are mentioning statistical science. Its association, The American Statistical Association, founded in 1839, is the second oldest continuously operating professional society in the US according to Wikipedia.?The growth of this community was still strong a few decades ago.
A Bit of History
Of course machine learning relies heavily on statistics. The disconnect started possibly 20 years ago. I was myself a statistician back in 1990, working on exciting projects such as processing satellite images. Over time, it became clear that the work of people like myself was becoming less and less relevant to the statistical community. It became more and more computerized and automated, and less and less theoretical. Indeed, back then, the term used was computational statistics. But it was never absorbed into statistics. Instead, it became part of data mining, and then machine learning.
In the meanwhile, statistics became more and more associated with narrow fields. In particular, biostatistics and the pharmaceutical industry. The drug industry was the major source of revenue for the American Statistical Association. In turn, ASA started to heavily promote this field. Plenty of statistical methods were developed for tiny data sets (clinical trials). People working on big data became known as data scientists.
Read the full article, here.
Retired at Space Applications Centre, ISRO
3 年Is it FASHIONABLE to call oneself DATA SCIENTIST when he/she has NO Science Background for the last 15-20 years of schooling+higher education? Confusing Mathematics (and allied subjects) with Science is the Biggest Foolish Act! Now I find, in office, surrounded by nobody else but DATA SCIENTISTS??(with no domain knowledge) Where are the people, who know even little bit of Physics, Chemistry, Biology? Even the basics? (No recollection of anything learnt in school etc). AI, ML, Deep Learning? Are the promoters/sponsorers/supporters of these suffering from lack of Human Intelligence, Learning Disability, Shallow knowledge. With only some capability of writing programs in PYTHON (what a name, for a computer language! ?? it could have been BOA CONSTRICTOR as well, but two words), this sea of data scientists can "flood"/conquer/do anything??[From advising Neurosurgeons to forecasting Earthquakes and other disasters like Flood, Tsunami, Avalanche, ...!] Just sit down with these people with their AI/ML/DL outputs, to discuss whether they make any sense in real life, you'll make NO progress, only evasive and diversionary replies (laced with insults) fuelled by ego + false superiority complex. New era of DIGO (data in garbage out).
Researcher in neuropsychiatry and behavioral neurology. Applying DL/MS and advanced statistics/econometrics to discovery, modeling and prediction of individual' and cohort behavior.
3 年Thank you very much, Vince, for highlighting the current and historical relationships and juxtapositions of Statistical Science and Data Science. Especially for taking us back to the history before Hastie and his colleagues first published The Elements of Statistical Learning. The target audience for that text seemed obviously narrow at the time, at least for those of us busy trying to build data science teams in industry. That audience seemed to be mathematicians and statisticians, and the goal of the book and others like it was to help solve problems encountered by generations of statisticians but as yet not solved (by statisticians). These included well-known problems with data (e.g., sheer database sizes, sparsity, multicollinearity, unknown/uncontrollable sources of random variance), and problems with parameters (e.g., deriving distributions of parameters, model fit to data, reducing the sampling variance of, predictive validity, and validation practices). Those problems were addressed by ad hoc methods, to name a few, such as shrinkage terms, latent variable terms, and hierarchical Bayes. In addition, of course, there were problems with finding relationships among variables, true versus spurious predictors, difficulties in iterating model fit, with large numbers of parameters, to improve outcomes (before the ability of very fast yet affordable computing power), which led to deep learning methods (e.g., neural networks) and some interesting non-NN models such as random forests, specialized gradient boosting methods, and hierarchical Markov models, and more. The combination of innovations in data handling and in model structure has led to systems one can now unabashedly call Artificial Intelligence (with appropriate caveats, of course). To ground statistical science in the work of the founders, such as Ronald Fisher and Max Weber in the early 20th century, one must take into account that the methods they developed, nearly all drawing upon parametric statistical tools, utilized small samples (tiny compared to those of today). Thus the problems detailed above (with data and parameters) need to be understood against the background of the contributions from such geniuses. The point is, understanding why traditional statistics have become (perceptually) less utilized is not that we have evolved beyond them. It is true we now have massive data handling and fast computational modeling, but the original face of statistical science still looks out at us as we tweak the structures and parameters of our machine learning models unendingly. Machine learning (including deep learning) has now become a set of operational data management tools implemented most often by computer scientists. It has become divorced from the need to understand the underlying statistical structure as well as what problems in the statistical approaches need to be fixed. Some machine learning specialists have, in an entirely human way, created outcome heuristics for their models, and these heuristics do no necessarily tie back to those fixes. Recently, in fact, I heard the group leader of a vital data science group in one of America's largest corporations announce that LSTM was always better than HMM. To compare these, of course, requires ignoring basic principles of maximum likelihood, KL convergence to estimate parameter distributions, and the structure of latent variables/states. I am not at all trying to criticize anyone or their level of understanding. But as a case in point, it tells us where computer scientists, having no fundamental knowledge of statistics, may eventually lead us. But despite its being a young science (wearing old shoes), we already hear (a) outcries that data science has produced no monetary value for stockholders and (b) continued hype about what AI can do (e.g., "solve business problems previously unsolvable"). So back to your original point, Vince, let's at least position machine learning as the science of addressing, finally, the data and modeling issues that have plagued statisticians for over 100 years, and thus we need to understand what those issues actually are from the same, well-grounded statistical starting point.