Top Myths About Data Science, Pt. 2
This is the second part to my first LinkedIn article Top Myths About Data Science, Big Data and Statistics. (https://www.dhirubhai.net/in/a-s-wong-12181845/detail/recent-activity/posts/ca/post-analytics/urn:li:fs_profilePost:6159572136664014848/). As the literature and chatter on Data Science grow, inevitably, the myths that make up a part of this literature grow in tandem. When something is popularized beyond its technical base, one has to be very careful in teasing truth apart from myth. It is especially encouraging to know that along with the increasing misconceptions and myth-mongering in this field, there has also emerged an increasing but nonetheless very gradual backlash in the official and unofficial (social) media. One of the rare responsible pieces of mainstream journalism (https://www.forbes.com/sites/kalevleetaru/2017/07/30/from-subject-to-soundbite-results-to-hype-communicating-data-science/?c=0&s=trending#44ca28361365) urges in its last paragraph for professionals in Data Science to go from shimmering soundbyte to explaining nuance and illuminating truth. Beyond illuminating truth, I also hope to achieve the additional purpose of helping women in Data Science by revealing how the Data Science hegemony favoring Computer Scientists and Engineers over Statisticians perpetuates itself and thus indirectly marginalizes women, who comprise 40% of the field as opposed to anywhere from 12 to 26% in Computer Science or Engineering. Though maleness might never have been consciously instituted as a pillar to bolster the voice and influence of males in Data Science and I have encountered many supportive males in Data Science myself, the traits most correlated with males (e.g. coming from a Computer Science or Engineering background) certainly have. I would call this an unwitting, impersonal discrimination perpetuated by systems of (erroneous, as I will point out later) thought. In this article, I discuss the myths surrounding statistical inference and its non-applicability in predictive analytics, two myths surrounding choice of computing languages and lastly, the myth surrounding Computer Science algorithms and their relation to predictive algorithms. After exposing these myths, I will go on to explain how continuing to be misled in these areas can stall our progress towards increased or even maintained gender diversity in the field of Data Science.
I would encourage non-technical readers, besides technical ones, to read this article. Since many non-technical employees such as Product Managers have the managerial clout to influence recruiting as well as groom their team cultures to prevent gratuitous intellectual domination over talented employees who're not articulate or powerful enough to express the obstacles they are facing, it is especially important for leaders to monitor the culture within their Data Science employees to see if any of the myths highlighted below are preventing their team from realizing their full potential.
Myth 1: Statistical Inference Models Are Only For Statistical Inference, not Predictive Modeling
Mechanism of Statistical Inference
To recap from Pt 1 of Top Myths, statistical inference is the expected change in outcome when one changes one unit of an input factor, everything else held constant. In its simplest and most familiar representation, below is a linear regression:
y = a + bX1 + cX2 + dX3...
This means that if you increase X1 by 1 unit, you will get a change in y by b units. The interpretation is the same for all the other input factors while a is a constant that is added no matter what. Now, we have completed our statistical inference with this statistical inference model. After several measures to check for overfitting or generalizability of the model to other data sets drawn from the same distribution, you would test this model on a separate data set that you hope fits the same distribution as the original data set.
For example,
1) If you have trained your regression model on 1 Oct subscribers, you could use 1 Nov subscribers' 7-day revenue data as a test set. You would want to take all the values of input factors X1, X2, X3, etc. per subject, multiply each input factor by the coefficients b, c, d, etc. found from training on the 1 Oct subscribers' training data and add them up.
2) Then, you would use metrics such as the root mean squared error to see how much your predicted revenue y differs from the observed (real) 7-day revenue of these 1 Oct subscribers. If you still have good performance on your 1 Nov subscribers' data, you could make the assumption that the relationship between the outcome variable and these input factors doesn't change much from the training period to the testing period. The lag between training and testing was a month. Hence, if you wanted to predict (into the future relative to today in the sense of forecasting) 1 Dec subscribers' data and it is now 8 Nov, you would use the 1 Nov subscribers' 7-day revenue data to predict the 1 Dec subscribers' data. If the relationship between the outcome variable and the input variables held from Oct to Nov subscribers (assuming your Nov model held up to those objective methods of scrutiny), you could assume that a month from your training in Nov, your results on Dec would be almost as accurate.
Notice that I deliberately did not break all of this up earlier into statistical inference v prediction, precisely because I wanted the reader to be hit with their own realization that the 'statistical inference' based on already observed data, i.e. Oct and Nov data, is exactly the same mechanism from which we predict from present data into the future, i.e. from training on Nov to predicting out into the future for Dec.
Note that this predictive modeling needn't be restricted to just predictions into the future. One could use the training data and model from one neighborhood to form predictions on another similar neighborhood. The assumptions behind this might be shakier than with the temporal leaps above, simply because, while we can infer from good performance on the validation set that a monthly lag from training to prediction doesn't pose a huge issue, it is harder to make the assumption that just because training on Cambridge works on Somerville, it will work on Back Bay as well. The 'gaps' between Cambridge, Somerville and Back Bay are not as well defined as such temporal lags described above.
Bias-variance tradeoff
Against all accusations that linear regressions (more generally, more simply represented or 'smoother' parametric regressions like logistic regression) tend to have higher bias (less accurate on the training data), the above examples illustrating where there is uncertainty of the prediction set's relation to the original training set perfectly exemplify the advantage over machine learning in achieving lower variance. That is, if you have a slightly higher accuracy from a machine learning model but are unsure if you can apply use your training set on what you suspect to be a significantly different data set, you should probably pick a linear regression over a machine learning model to hedge against unforeseen variations in performance on data sets different from the training data. Please see https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/ in the appendix and the relevant section highlighted.
So, if you know something has a supposed tendency to behave in a certain way, what do you do? In the case of the linear regression, would you just ignore the possibility that it will be the most accurate model or write one extra line of code (resembling the above equation in all its simplicity) to test if, for this particular dataset, your regression predictive accuracy is actually the highest one (and in the ways that matter but that is beyond the scope of this article)? For those of you not familiar with the coding up of models, most of the work lies in checking for data quality (e.g. do you have values outside of the theoretically possible range?), cleaning data and then merging and aggregating the data in ways that would suit your model(s) best. Typically, one way of aggregating the data will work for multiple models though it's useful to get as creative as possible with many different aggregations, e.g. summing up all visits to a website per week as opposed to do a meaningful web traffic and churn/engagement longitudinal model insulated from possible overfitting due to the random fluctuations of a variable from day to day. After all of this data inspection and manipulation, all of which takes quite a while, I wonder why trying to squeeze one additional line of linear regression code from many data scientists is like bleeding a stone. It's as if, the minute you've typed out that one line, that one line of code becomes a plumbline straight to terrestrial dirt, whereas, before the advent of Data Science and Machine Learning hype, even a simple linear regression was treated with the reverence due a truth-bearing Platonic Form (especially for its simplicity) sucking all the points from the empirical world into its compelling vortex.
Below is the cost-benefit analysis to illustrate this curious anthropological phenomenon characterizing Data Scientists and their contribution to dethroning the linear regression.
Cost(One additional line of code) < Probability(linear regression will outperform)*Amount of outperformance
However, the Data Scientist is after all human and, some might say, uniquely human!
Cost(One additional line of code) + Cost(freefall in reputation from people thinking linear regression was all you were aware or capable of) - 10,000 ego-derived utils = Probability(linear regression will outperform)*Amount of outperformance
Yet make no mistake about it, you could hit paydirt with a linear regression. And even if you don't, any employer worth their salt will applaud you for having the rationality to privilege the first cost-benefit analysis over the second, just because there was a could pitted against your one line of code. There's a reason paydirt was called paydirt - money doesn't flow only to pedestals and hoisted sedans, or shouldn't at least. In our increasingly glitzy world of tech with its deteriorating values of individualism, many in the profession have eschewed the honest living of paying their company back for their huge salaries by undertaking useful work, no matter how unglamorous, in favor of a resume star-studded with the trendiest algorithms. (I would be very curious and amused to mine a corpus of resumes discovering the frequency with which 'linear regression' or even 'regression' is listed. Especially if the candidate clearly has a vast experience of methodologies beyond regression, you can safely bet the candidate has a head on their shoulders and isn't nouveau-data-sciencey, always with something to prove.) I have some sympathy for the branding that these data scientists have resorted to - in the Data Science world where it's sink-or-swim, mostly due to the unscientifically backed intellectual domination or aggression (which resulted in such stark methodological hierarchies to begin with) prevalent in the field, it is a more arduous task to write the sort of article I'm writing now than to simply press delete for a couple of letters to wipe 'linear regression' forever off the face of their resume. In this respect, a Data Scientist can be an extremely rational genus.
I can sense a few raised eyebrows among legitimate practitioners - is it really just one line of code? No, it rarely is for a linear regressions for reasons to be explained later, but my point is, even if you didn't want to do any extra work beyond that one line testing a linear regression, it is still better to see if it might work in the off-chance that no further processing is required for the linear regression to outperform. And if it does, it certainly is worth it to continue with the data manipulations and small statistical tricks described in the appendix that are not only often more useful in increasing predictive accuracy than any change in model but are also ego-fulfilling to those who display a proficiency in these tools. I have also attached a comment from KDNuggets as an example of good hiring requirements, which go beyond methodology to highlight the importance of the preparatory work that goes into the modeling.
Statistical Inference
One of the biggest and saddest ironies of the field of Data Science is that, just because statistical inference models can fulfill the dual purposes of prediction and statistical inference, it has become synonymous with what other models cannot do, which is statistical inference, at the huge cost of being undervalued for its predictive uses. It definitely didn't help that many academic Statisticians, especially from decades ago, questioned the usefulness of predictions or the validity of assuming the extrapolability from one training data set to another prediction set. However, some Statisticians like Akaike, Deming, Leo Breiman and Jerome Friedman, the last two of which are machine learning heavyweights, rooted for the use of statistical models for prediction. Regardless of what these social currents were or are among Statisticians, it is undeniable that the use of statistical inference models to form predictions is entirely a function of social constructs rather than 'biological' constructs related to the statistical inference models' mechanisms for prediction. In the example given under 'Mechanism of Statistical Inference,' a machine learning model would have been implemented exactly the same way. The only difference is that those under the statistical inference school might be more cautious about extrapolation from one data set to another. We should not penalize a field for their cautiousness and use that against them; rather, we should appreciate their telling it like it is and their any useful warnings we can draw from them. This caveat should apply with equal degree to machine learning models, as nothing inherent to their mechanisms makes them more exempt from these Statisticians' concerns.
The growing marginalization of statistical inference would not be so egregious if people remembered to perform statistical inference at all in the tech world that has come to privilege predictive analytics over statistical inference. Most of this negligence stems from an undue preoccupation with machine learning where relationships between the outcome variable and input variables cannot be easily expressed and interpreted, so data scientists often forget that relationships between outcome variables and input variables should be explained where possible. 'Where possible' does not even mean that the statistical inference model has to perform the best in predictive accuracy - it just needs to have a decent fit to the observed data. One model can be used for statistical inference to recommend business strategies while a separate model, which may or may not be a statistical inference model, can perform the predictions if no one model performs best on both statistical inference and predictive accuracy.
What the marginalization of statistical inference means to businesses is a huge untapped potential in actionable insights. When you know that turning the dial on one input factor will significantly increase your revenue, you can work on turning that dial. This is arguably even more important than prediction in many business use cases because it allows you to impact and not just predict (which is really just diagnosing your future but not necessarily being able to do anything about it). Yet, many competent Statisticians choose to steer away from statistical inference precisely because statistical inference is now classed in the less remunerative and less glamorous Analytics category, despite all the additional work (again, see appendix) a statistical inference model may require to optimize it in comparison with a machine learning model. Yes, they are likely to earn less and have less visibility despite the fruits of such labor, which would fulfill the double purpose of adding yet another predictive model to be compared with contending models as well as providing rich actionable insights through statistical inference. Five years ago, when the term 'Data Science' barely existed, the term Analyst came to encompass two types of teams: Business Intelligence Analysts who generated charts and tables, and those who performed statistical inference and predictive analytics mixing among those who performed only predictive analytics (due to having taken only machine learning classes only, for example, but not through any real hegemony). This was a useful model for organizations as it allowed Analysts to weave seamlessly between statistical inference and prediction, as only befits statistical inference modeling since the mechanisms for statistical inference were exactly the same ones for predictions. With the ever increasing reorganization of statistical modelers into Analytics teams that focus more on charts and visualizations, statistical modelers, many of whom are also completely fluent in machine learning (see 'Top Myths, Pt 1 for the surprising field, Statistics, in which most canonical machine learning algorithms arose) have to delegate less of their time to predictive accuracy and the business suffers from a loss in applying statisticians' skills on predictive modeling as a result.
This oversight and wastefulness of opportunity prevalent in Data Science could be one of the biggest reasons future generations would look back upon Data Science (as it is currently practiced) as the biggest pyramid scheme of our generation.
2) Myth 2: If you don't know a programming language used often by computer scientists and engineers, you don't program.
This is also sometimes a euphemism for "You're a statistician by training. Do you code?" This seriously outdated belief makes me wonder if those who ask me this were friends with my 70-year old professors who rarely coded and certainly not in any living language at least. I wonder about that, despite their being half their age and also more likely to be using this modern thing called LinkedIn showing dozens of endorsements of multiple programming languages on each statistician's page. Since the last few decades at least, every statistician who came out of school sweated through weekly assignments to generate results from models using code. Then add to that all the results of serious research output (at hospitals, for example) that often accompanies academia and you wonder how this question exists unless one assumes statisticians wave their hands and tell non-statisticians, "Can you please merge A and B by this variable using an inner join? And then run a RandomForest please."
Programming is rightly associated with software engineers and computer scientists, because that is a huge part, if not the overwhelming part, of what they do, and these two professions have also taken up an overwhelmingly large share of tech jobs and thus loom large in the public's eye for all the skills they offer. This has obscured the fact that less common professions or fields like Statisticians also program - more as an inevitable means to an end.
"Oh, but they code in R, not Python or C++!" First, there is increasing crossover into each language from the main practitioners of their fields, which is something I'd encourage if called for and discuss at greater length later. There seems to be a misconception that any language besides the ones software engineers and computer scientists use is not a programming language. Do you need further proof than the code snippets in R and Python below?
In R:
for (i in 1:10)
print (i + 1)
In Python:
for i in range(1,11):
print i + 1
3) There is a best software language for Data Science:
I'd recommend a polyglot training. No one language is optimal every major purpose in Data Science.
Some examples:
a) While Python, compared to R, has a more comprehensive natural language processing package, NLTK, it is greatly lacking in variations on mixed effects models (useful for longitudinal analyses) and survival analysis (useful for retention). The link below shows some as yet unimplemented proposals for expansions in these areas. https://github.com/statsmodels/statsmodels/wiki/GSoC-2015-Proposal:-Improvements-to-Mixed-Effects-Models
b) The link below describes the differences between R and SAS for their mixed effects models, perfectly illustrating that there will always be significant differences between languages even if they, on the surface, have the same broad categories of models. I've found from practice that where R (not run on an AWS server) sometimes crashes from fitting both random intercepts and slopes, allowing for the estimation of individuals (groupings' ) trajectories over time, SAS can handle this very useful feature better. https://glmm.wikidot.com/pkg-comparison
Contrary to popular outdated belief, SAS has a free edition and again, those who deride it for its decreasing trendiness are drawing attention to themselves for not having a broad enough range to realize the methodologies such as mixed effects models that span a much larger range in SAS than in Python were developed by world-class Statisticians or Biostatisticians such as Nan Laird, who also developed the same methodologies that Computer Scientists are likely to encounter in Python such as EM (not saying that doesn't exist in R). Did the same person go through a concussion, which would be the only plausible explanation for one poorly developed methodology compared with another fabulous one because it exists in Python and appears in Computer Science textbooks? If you can't assume a concussion (and I'm almost certain she's never had one), I'm guessing all the SAS haters out there who are often also religious Pythonistas and computer scientists who haven't always been exposed to the broadest range of methodologies might have to concede that, beyond the language hype lies much more crucial fundamentals like sophistication in methodology, and more importantly, usefulness in real-world applications.
c) Regarding what's best suited to production, it pains me to observe that many in Data Science have foregone more optimal (e.g. more predictive) models in favor of sticking to their main production language. You can have your cake and eat it too - most of your code will be processing such as aggregations of data, which you can perform in your main production language, and when you get down to the final algorithm command, you can use a different language. Simply output the results of the step before into a file that will be read into the other software. All of this can be automated with batch processing.
Most of the time, models need to be trained as regularly as, say, once a month. If that is the case, frankly, you can just train the model once a month and not have to worry about whether it takes 6h or 10h, making the choice of language a bit less crucial. The predictions using the trained parameters may need to occur on a daily, even hourly, basis though. If that's the case and if your model is a parametric one such as a regression, save your monthly trained parameters or 'rules' into a 'playbook' and then it is easy as pie to calculate something like y = a + bX1 + cX2 ... in a fast production language like C++.
In the light of the above, we should always consider for every case if speed is a need or a want. Furthermore, when you choose to hire someone, should the relative speed of their language be a crucial factor or even a pivot? I believe that it should be a pivot when two candidates are exactly the same and you need language as a determining factor. If you ever think two candidates are so close that language should ever be a pivot, perhaps you are not prioritizing the right things like solidity of methodological knowledge, the facility with which they match methodology with application, personality and a history of careful preparation of data. And I hate to suggest this, but your interview take-home assignment may also not be discriminating enough.
Please consider this cost-benefit analysis with easily quantifiable benefits and costs (rather than even more important ones like analytical performance): from real-world experience, I observed that it took one day for a software engineer to translate a script for production (all aggregations at least should be done in a common denominator language like SQL and thus easily understood by the translator). Making the optimistic assumption that one puts four models into production every year, we have taken up 4 days of the software engineer's time. Now, compare these four days with the more than four days it typically takes to find the next candidate who fulfills all your criteria after you pass on someone who is the best in all respects except in your production language. Keep in mind, this cost-benefit analysis assumes both candidates are perfectly equal in all other respects.
The significance of the language religiosity seen so often in Data Science endangers those from Statistics backgrounds, who are more apt to use R over Python, for example. The marginalization of those using R is an indirect, unintended way of marginalizing Statisticians, and thus, an unintended way of marginalizing women, who make up the field in greater proportion that Computer Science and Engineering, fields which tend not to use R. All this is not to say that we should lower our standards just to include more Statisticians (and women as a result) - we need to revisit our standards to see if they were justifiable to begin with. If the concerns I pointed out regarding some of Python's deficiencies in some key statistical methodologies are valid, then we should embrace the methodological diversity that Statisticians, who tend to use other languages, bring.
4) Myth 4: Data Structures and such Computer Science algorithms are an important part of data science and should be tested
When I mean data structures, I mean algorithms like binary search, mergesort, quick sort. Once you move past these three most familiar ones to non-CS students not because they have to use them in their Data Science jobs (except in rare cases like implementing a locality sensitive hashing algorithm from scratch for approximate nearest neighbor searches, though this algorithm is rarely known, let alone used) but simply because they are routinely tested, you can go down a rabbit hole of these algorithms and it is impossible to study to cover every base unless you take a semester's worth of CS. (Oh, c'mon, take that notorious CS50 then! But why, is the question.) Yet, Data Science teams continue to interview candidates on such algorithms that I have never used in my more than five years in predictive analytics (not to forget statistical inference too). I thought perhaps that I just didn't know what I didn't know, so I finally asked an interviewer if the company actually used such data structure algorithms in Data Science. He said no. If no Data Science learning academy seems to teach it and if no Data Scientist I know uses them in Data Science work, what could be the reason for such testing?
My speculation is that Data Scientists arose from lean startups where it was finally decided that Software Engineers had to do some Analytics too as the next logical phase of the company. Primarily, they still functioned as Software Engineers so when they interviewed for these 'Data Scientist' positions, they always tested for Software Engineering skills in addition to Data Science ones. Even when companies grew and could afford full-time Data Scientists focusing on just Data Science, the traditions might have continued because it is so easy to grab that old interview Q&A sheet than reformulate another one. Another more plausible reason is simply that many Data Scientists fall under Engineering teams. Their interviewers will be Software Engineers (I'm not referring to those who went into Data Science, though a few of those are just as culpable) who don't know what Data Science is or requires, and will thus test you on Data Structures. If you don't actually code up a quicksort, they may kindly (if you are lucky) but nonetheless misplacedly ask you if you code. First, recursion is not all there is to coding, and these quicksorts are not the only recursive algorithms (think the Fibonacci algorithm, for example). It is the algorithm itself and not the recursive coding that made their inventors famous through work that certainly wasn't completed overnight so why should anyone expect that those who haven't encountered these data structure algorithms should solve them on the spot without prior study? As for prior study, like I've said, beyond the first three I've mentioned, you could really go down a rabbit hole. If you're concerned with principles, you may not even see a need to complete these exercises and may want to move on from a team that is looking only to hire their next selfie. Principles aside, even if you make the cut, you might face endless challenges of trying to fit in with a team that sees things only from their perspective without recognizing your boundaries regarding what should be expected from you. Finally, an even larger part of the coding in Data Science that builds up towards the final command (which usually is a 'canned' command that executes something like a logistic regression or a support vector machine) hinges on merges of data sets, aggregations of variables and mathematical transforms of variables, all of which are examples of functions executed in code. Thus, the synonymity of coding or programming with coding up data structure algorithms is completely unwarranted.
Yet another myth is that if you don't know data structure algorithms or what are (too) loosely called Computer Science algorithms, you don't know algorithms and are thus unfit for predictive analytics. Under the umbrella of Computer Science algorithms, we can find two major types, those that aid in data management and processing, and those that fall under predictive algorithms. It is the height of irony when recruiters or even technical hiring managers assume that Statisticians do not know algorithms (and I'm not assuming this - I have this clearly in writing) if they do not know the non-predictive side of the Computer Science algorithms, when many of the predictive algorithms found in Computer Science textbooks originated from Statistics. Just to name a few, there are Classification and regression trees, gradient boosted trees, RandomForest, k-nearest neighbors and EM. Certain other algorithms that might have sprung from Computer Science such as k-means clustering and hierarchical clustering are widely familiar to Statisticians just as those models originating from Statisticians are familiar to CS majors who have taken a Machine Learning course or mastered them otherwise. Computer Science does not have a monopoly on predictive algorithms, and it is time for Statisticians to receive their due in this field and not have baseless assumptions about their ignorance of algorithms (the predictive ones, the ones that matter for Data Science) foisted upon them over and over again.
Now, much of the domination of computer scientists and engineers in the field of Data Science (maybe not in numbers but in voice and social influence) no doubt arose from the media, where the likes of the HBO show Silicon Valley have propelled their professions into even greater view. Nevertheless, many Computer Scientists and Engineers have contributed to the domination of those fields in Data Science through their greater tendency to promote awareness of algorithms they are inventing or trying out. I am absolutely not begrudging them for this and I think that promoting awareness of the right messages and in the right way in our field is healthy and can only contribute to progress in the field. The problem with the lack of diversity and attendant lack of diversity in views such as the ones I've just expressed in this article lies not with the Computer Scientists and Engineers but with power systems influencing who speaks up and how much they say. Many women are afraid to deviate from the norm of the indirectness or agreeability expected of their gender, and women make up 40% of the Statistics majors who face the relatively subtle and unintended yet ultimately harmful forms of discrimination that I highlighted in this article. The greater representation of women in Statistics may be one reason that Statisticians in general tend to be less vocal about the obstacles they face in Data Science. Another reason is simply that female Data Scientists have chosen to limit their range of industries to insurance, healthcare and marketing rather than the more vocal tech industry to avoid the 'shinynewthing' syndrome that can favor those who are always quick to showcase their talents and new projects. In tech, women will have to choose between keeping up their image through self-promotion and losing social support at work for their lack of humility or being humble and passed up for career advancements because their talents were hidden. I don't believe the mechanism for discrimination in Data Science was ever sinister or ill-intended towards women, but I have pointed out too many ways in which Statisticians are marginalized in a field that they arguably created. If we penalize Statisticians for the reasons mentioned in my article, especially the most unfair one of being able to perform statistical inference models with a high degree of rigor, we will be weeding out a huge pool of women, all for no good reason. There is a difference between trying to accept more women by accepting lower standards (that is affirmative action), and a difference between realizing that the standards with which we evaluate Statisticians (and thus, many women) are not 'too high' but just off-base in the case of data structures questions, for example.
As has often been pointed out, women must not lose sight of the need for resilience and their own part in lifting themselves out of difficult situations. They must not stay the embittered and embattled victim. They must take a shot at carving out the life they owe themselves, and this is just what I hope to achieve for myself and others in writing this article.
Appendix:
Bias-variance tradeoff
https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
"Machine learning algorithms that have a high variance are strongly influenced by the specifics of the training data. This means that the specifics of the training have influence the number and types of parameters used to characterize the mapping function.
- Low Variance: Suggests small changes to the estimate of the target function with changes to the training dataset.
- High Variance: Suggests large changes to the estimate of the target function with changes to the training dataset.
Generally, nonparametric machine learning algorithms that have a lot of flexibility have a high variance. For example, decision trees have a high variance, that is even higher if the trees are not pruned before use.
Examples of low-variance machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression.
Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines."
Important Steps for a Data Scientist Especially Before a Linear Regression
To quote user rk2153 from KDNuggets (https://www.kdnuggets.com/2016/02/21-data-science-interview-questions-answers.html/3) in response to Cheryl Howard's rather amusing comment on how to catch fake Data Scientists,
"Cheryl
I presume this is still a valid question after 1 year. I am ranking candidates:
-- by their ability with an actual data set -- not on the techniques that I am comfortable with, but any technique they choose to use with a good bit of rationale.
-- by their willingness to ask questions with other resources/departments
-- by their ability to explain their process and outcomes to all levels of colleagues
-- by their willingness to develop others around them
-- by their willingness to integrate diverse data sets and data sources
Unfortunately this is not rush job."
In addition to this excellent response to the sign of our times, may I add my more technical checklist:
- find optimal ways to bucket the many levels in certain categorical variables, e.g. multiple correspondence analysis
- account for multicollinearity with ridge regression
- perform feature selection methods such as LASSO, principal components analysis, ElasticNet, etc.
- splining
- weighted least squares to account for differences in variance among different groups
The list is potentially endless.