登录查看更多内容

Datawashing and Datawishing

Jim Crompton

Professor of Practice, Petroleum Engineering Department at Colorado School of Mines

发布日期: 2023年5月13日

One of the biggest challenges I had in my petroleum data analytics courses was getting my hands on good data sets for the students to work on. Many academic data sets are simulated through known equations (and some add a little random noise). But the ideal assignment was to work with an industry data set. That is where the problem lies as most companies (with a couple of exceptions) do not want to donate their data to a university. Sometimes the excuse is intellectual property, sometimes it is just that they do not have the time to gather the data together to send it to the school. Other times they are afraid of what the students might find in operations mistakes (unwanted headlines). So, I often had to get creative.

Equinor is a welcome outlier to this picture with their data donation of the Volve field data set (and more) and if you are old enough you might remember the DOE donation of the Teapot Dome data from Wyoming. I had a graduate student collect the donated data sets we could find and build a data foundation catalog that we shared with the department.

One data set I was always looking for was a time series collection from multiple related variables so I could get the students some practice with multi-variable regression analysis. The data set I used was not from the oil industry it was from my own experience with blogging on LinkedIn (hopefully close enough to reality). The independent variable was the number of views, the dependent variables were number of likes (or reactions that LinkedIn now calls them), number of comments, number of views by folks from Chevron (my old employer), number of views by folks in Houston and lastly number of views by folks in Denver. The problem was to look at this data set (about 125 records) and develop an equation that helps me write a blog article (this one probably won’t do that well) that gets more views (makes me feel better).

Most students figured out that the best fit model narrowed down to only two variables (likes and number of views from Houston). Statistics gurus called this reducing the dimensionality of the model. ?The Chevron views and the Houston views were correlated so you can throw that one out. There were so few comments that that variable did not help the model much and the number of folks in Denver were also a pretty small population (I need to work on that since it has now been a decade since I retired and moved away from Houston). Anyway, this dataset was adequate for the purpose of exercising their statistical muscles.

Occasionally there was an outlier. Outliers are usually thrown out a random noise but sometimes they are just a different response. My last blog (Goodbye Professor) was one of those outliers. So far, I have 719 views (average from 126 articles is 185 and mean of 123 so you know my data set is not a Gaussian distribution), 265 likes (average of 29), 67 comments (average of only 5), 97 views from Chevron (average of 28), 214 from Houston (average of 54) and 133 from Denver (average of 10). I have found the answer to my homework assignment. All I must do to get a big response is quit ?Just kidding, the different response was a great outpouring of kind thoughts from friends (even those I have never met, the social media kind of friends) and colleagues over the years. I do not know how to top that but it made me feel good so thanks everyone.

领英推荐

Understanding Wide Confidence Intervals and…

Jesca Birungi 7 个月前

You don't need a Crystal ball or Calculus to do Sizing

Jayakrishnan Vijayaraghavan 1 年前

The Powers of “Normal Distribution”

Paras Varshney 4 年前

I used to give a lecture on how to tell a good story with data. I quickly followed that talk with a lecture on how to lie with data. We get lied to a lot by pseudo-analysis (and marketers and political consultants), even if it is just by using biased axis plot. The old saying that you are allowed you own opinion but not your own facts (supposedly by Daniel Patrick Moynihan). In today’s world it seems like first you form your opinion and then you go looking for data that supports it. Often good data, good analysis, good models still do not bring us together even when the analyst is trying to do just that. Data has become individual currency not a universal one. Advanced analytics, machine learning and other statistical techniques often tend to confuse more than enlighten.

I get students all the time that rush to the programming technique or the technology platform and I must pull them back into looking at the data. I cannot blame them, that is what the culture (and the marketers of the high-tech digital technologies) are urging them to do. Data is not cool, technology is. I learned my lesson by having to spend countless hours going through the company’s (paper) well files in the company library to find what I need for my interpretation project. Often, I did not have enough data so my interpretation was as much art and hope than real analysis (thanks to a mentor by the name of Bruce Baum who taught me how to contour with sparce data and how to make a syncline look like a syncline and an anticline look like a drilling prospect).

In conclusion, in this digital world, we do not pay a lot of attention to data. Just pour what you can find into the artificial neural network machine learning subroutine in the Jypter notebook of Python code and magically outcome the answer you have always been looking for. If not go back and wash your data a little bit better and try a second time (reduced dimensionality). With some more programming you can get your data to tell the story you want it to. If not try to manipulate the charts and graphs from the data visualization lecture and see if that works out.

Ouch, did I really just write that? Actually, please spend a little extra time with the data. It can tell you a story that you might not be looking for. And thanks for all the kind words and memories from my retirement blog. It has been a good journey.

Najib Abusalbi, PhD

Independent Advisor (retired from SLB 2018)

1 年

I loved the terminology of data washing vs data wishing- speaking of which, once again I wish you happy retirement years

查看更多评论

要查看或添加评论，请登录

Jim Crompton的更多文章

It is easier to scare than to inspire

2025年3月17日

It is easier to scare than to inspire

First an admission. I am not a journalist or a professional writer.

6 条评论
Climate realism

2025年2月24日

Climate realism

When I was growing up on our ranch in Divide, Colorado I decided to play peacemaker. We had a great boxer dog named…

1 条评论
Not Your Grandparents’ Power Grid Anymore

2025年2月5日

Not Your Grandparents’ Power Grid Anymore

We are back! The second season of the USC Viterbi School of Engineering’s Ershaghi Center for Energy Transition podcast…

3 条评论
How will genAI and AgenticAI affect human IQ?

2025年1月20日

How will genAI and AgenticAI affect human IQ?

I am sure that many of you have tried this, either out of curiosity or as a valuable productivity tool. The action I am…

8 条评论
The day I stopped teaching and started communicating

2024年12月26日

The day I stopped teaching and started communicating

First, I must thank everyone for the tremendous reaction to my last article. My normal audience (those who click on the…

15 条评论
“The King will reply, ‘Truly I tell you, whatever you did for one of the least of these brothers and sisters of mine, you did for me.’

2024年12月6日

“The King will reply, ‘Truly I tell you, whatever you did for one of the least of these brothers and sisters of mine, you did for me.’

I am not a bible scholar so I will admit I had to look up the specific verse for this well-known phrase. By the way it…

27 条评论
Imagine there is no Fossil Fuels. Even when you try it is not easy.

2024年11月7日

Imagine there is no Fossil Fuels. Even when you try it is not easy.

"Imagine" is a song by the English musician John Lennon (remember the Beatles?) from his 1971 album of the same name…

9 条评论
The Tale of Jemmy Button

2024年10月17日

The Tale of Jemmy Button

This article is a bit of a milestone for me. This is my 150th LinkedIn post since I started writing articles in 2013.

4 条评论
The Target was painted around the Arrow

2024年9月7日

The Target was painted around the Arrow

With our world becoming more digital, more automated, more directed by advanced analytics, deep learning, and…

5 条评论
One size hardly fits anything

2024年8月27日

One size hardly fits anything

You certainly have heard of the phrase “one size fits all.” This approach to standard design and manufacturing has…

See all articles

Datawashing and Datawishing

Jim Crompton

Professor of Practice, Petroleum Engineering Department at Colorado School of Mines

领英推荐

Jim Crompton的更多文章

社区洞察

其他会员也浏览了

Data

Why I Started Humans of Data Science (HoDS)

6 Inconvenient Truths about Data Analytics

Top 10 lessons learned in data science

How to Learn Statistics for Data Science As A Self Starter[ Day - 09 ]

How to Learn Intermediate Statistics for Data Science As A Self Starter[ Day - 11 ]

Does Correlation really prove Causation!

How To Master The Fundamentals Of Statistics By Practising 15 Minutes A Day For 3 Weeks

Data Scientists: 3 Lessons From Sir Alex Ferguson, Arsène Wenger, and Pep Guardiola That Can Change Your Career Trajectory Overnight

How to Learn Intermediate Statistics for Data Science As A Self Starter[ Day - 10 ]

领英推荐

Jim Crompton的更多文章

It is easier to scare than to inspire

Climate realism

Not Your Grandparents’ Power Grid Anymore

How will genAI and AgenticAI affect human IQ?

The day I stopped teaching and started communicating

“The King will reply, ‘Truly I tell you, whatever you did for one of the least of these brothers and sisters of mine, you did for me.’

Imagine there is no Fossil Fuels. Even when you try it is not easy.

The Tale of Jemmy Button

The Target was painted around the Arrow

One size hardly fits anything

社区洞察

其他会员也浏览了

Data

Why I Started Humans of Data Science (HoDS)

6 Inconvenient Truths about Data Analytics

Top 10 lessons learned in data science

How to Learn Statistics for Data Science As A Self Starter[ Day - 09 ]

How to Learn Intermediate Statistics for Data Science As A Self Starter[ Day - 11 ]

Does Correlation really prove Causation!

How To Master The Fundamentals Of Statistics By Practising 15 Minutes A Day For 3 Weeks

Data Scientists: 3 Lessons From Sir Alex Ferguson, Arsène Wenger, and Pep Guardiola That Can Change Your Career Trajectory Overnight

How to Learn Intermediate Statistics for Data Science As A Self Starter[ Day - 10 ]