Datawashing and Datawishing
Jim Crompton
Professor of Practice, Petroleum Engineering Department at Colorado School of Mines
One of the biggest challenges I had in my petroleum data analytics courses was getting my hands on good data sets for the students to work on. Many academic data sets are simulated through known equations (and some add a little random noise). But the ideal assignment was to work with an industry data set. That is where the problem lies as most companies (with a couple of exceptions) do not want to donate their data to a university. Sometimes the excuse is intellectual property, sometimes it is just that they do not have the time to gather the data together to send it to the school. Other times they are afraid of what the students might find in operations mistakes (unwanted headlines). So, I often had to get creative.
Equinor is a welcome outlier to this picture with their data donation of the Volve field data set (and more) and if you are old enough you might remember the DOE donation of the Teapot Dome data from Wyoming. I had a graduate student collect the donated data sets we could find and build a data foundation catalog that we shared with the department.
One data set I was always looking for was a time series collection from multiple related variables so I could get the students some practice with multi-variable regression analysis. The data set I used was not from the oil industry it was from my own experience with blogging on LinkedIn (hopefully close enough to reality). The independent variable was the number of views, the dependent variables were number of likes (or reactions that LinkedIn now calls them), number of comments, number of views by folks from Chevron (my old employer), number of views by folks in Houston and lastly number of views by folks in Denver. The problem was to look at this data set (about 125 records) and develop an equation that helps me write a blog article (this one probably won’t do that well) that gets more views (makes me feel better).
Most students figured out that the best fit model narrowed down to only two variables (likes and number of views from Houston). Statistics gurus called this reducing the dimensionality of the model. ?The Chevron views and the Houston views were correlated so you can throw that one out. There were so few comments that that variable did not help the model much and the number of folks in Denver were also a pretty small population (I need to work on that since it has now been a decade since I retired and moved away from Houston). Anyway, this dataset was adequate for the purpose of exercising their statistical muscles.
Occasionally there was an outlier. Outliers are usually thrown out a random noise but sometimes they are just a different response. My last blog (Goodbye Professor) was one of those outliers. So far, I have 719 views (average from 126 articles is 185 and mean of 123 so you know my data set is not a Gaussian distribution), 265 likes (average of 29), 67 comments (average of only 5), 97 views from Chevron (average of 28), 214 from Houston (average of 54) and 133 from Denver (average of 10). I have found the answer to my homework assignment. All I must do to get a big response is quit ?Just kidding, the different response was a great outpouring of kind thoughts from friends (even those I have never met, the social media kind of friends) and colleagues over the years. I do not know how to top that but it made me feel good so thanks everyone.
领英推荐
I used to give a lecture on how to tell a good story with data. I quickly followed that talk with a lecture on how to lie with data. We get lied to a lot by pseudo-analysis (and marketers and political consultants), even if it is just by using biased axis plot. The old saying that you are allowed you own opinion but not your own facts (supposedly by Daniel Patrick Moynihan). In today’s world it seems like first you form your opinion and then you go looking for data that supports it. Often good data, good analysis, good models still do not bring us together even when the analyst is trying to do just that. Data has become individual currency not a universal one. Advanced analytics, machine learning and other statistical techniques often tend to confuse more than enlighten.
I get students all the time that rush to the programming technique or the technology platform and I must pull them back into looking at the data. I cannot blame them, that is what the culture (and the marketers of the high-tech digital technologies) are urging them to do. Data is not cool, technology is. I learned my lesson by having to spend countless hours going through the company’s (paper) well files in the company library to find what I need for my interpretation project. Often, I did not have enough data so my interpretation was as much art and hope than real analysis (thanks to a mentor by the name of Bruce Baum who taught me how to contour with sparce data and how to make a syncline look like a syncline and an anticline look like a drilling prospect).
In conclusion, in this digital world, we do not pay a lot of attention to data. Just pour what you can find into the artificial neural network machine learning subroutine in the Jypter notebook of Python code and magically outcome the answer you have always been looking for. If not go back and wash your data a little bit better and try a second time (reduced dimensionality). With some more programming you can get your data to tell the story you want it to. If not try to manipulate the charts and graphs from the data visualization lecture and see if that works out.
Ouch, did I really just write that? Actually, please spend a little extra time with the data. It can tell you a story that you might not be looking for. And thanks for all the kind words and memories from my retirement blog. It has been a good journey.
Independent Advisor (retired from SLB 2018)
1 年I loved the terminology of data washing vs data wishing- speaking of which, once again I wish you happy retirement years