Zillow and the problem of 'average accuracy'
A live, cautionary tale for anyone who thinks that pharma will be fixed with the simple application of better data, simulation/ ML/ AI and ‘disruption’… I was so struck by?this Twitter thread?by?Mark Tenenholtz, where he covers the disastrous Zillow project.?
What is clear is that Zillow had better data on its market than pharma does. The process of buying and selling is less complex than pharmacology and biology (although as I am between houses right now, that’s hard to imagine!). They did it right, in terms of spending years testing their model against actual market data.?
But they still got it wrong. As Mark writes, ‘average accuracy metrics’ hide big decision errors and privileged information. “The regression model was totally fine. Their decision analysis was not.”
Now, please, think about how your eNPVs corrupt your decision process… And, perhaps worse, the source of your PTS/ PRS algorithms, which are like the necrotic core of those eNPVs…
Read on:
Zillow’s home buying business lost them $500,000,000, 25% of their stock value, and 25% of their workforce.
How did this happen to a company with so much data on housing prices?
Bad model evaluation.
Here’s the fatal error they made that you must avoid when deploying models??
Anyone who has even wanted to buy or sell a home before knows how arduous of a process it is.
It’s a difficult process with tons of back-and-forth, and usually takes months.
So what if someone buy from impatient sellers and sell to impatient buyers?
Enter, Zillow:
Zillow is really good at pricing homes.
I mean?really?good.
Their Zestimate score reportedly has an average accuracy of 96%, and closer to 99% on homes up for sale.
With all this data available to them, they could carefully back-test through all sorts of market conditions.
However, they didn’t just thrust themselves into the market.
Over the course of ~3 years, they simulated their strategy.
Inspired by successful simulations, they began to purchase tens of thousands of homes.
If their simulation was so successful, though, how’d they fail?
The first part of their failure was a massive information disadvantage.
I know what you’re thinking:
“But Mark, you just said they have a huge information advantage and a super accurate price estimate for homes!”
Sure, on average, they’re going to be very accurate.
But this is the problem with average accuracy metrics — they mask big errors.
领英推荐
It’s inevitable that even the Zestimate score with up to 99% will miss big on some homes.
How does this happen in the housing market?
Well, the home owner and their real estate agent inevitably have more information on the home than Zillow.
What happens, for instance, if the house has a strong odor or big plumbing issues?
In the long run, this hurts Zillow a lot.
The second part of their disadvantage was an adversarial market.
Remember how I mentioned average accuracy metrics don’t capture the big misses?
Well, the big misses likely come in situations when the homeowner has a key piece of info that Zillow is missing.
So, if Zillow put in a bid that wasn’t high enough, the homeowner would reject it.
But if Zillow put in a bid that was way too high, the homeowner would definitely accept it.
Basically, Zillow was getting the worst case scenario on almost all of their purchases.
Finally the death blow — all of their simulations took place during a market where housing prices were significantly rising.
This meant that if they screwed up a bid, they were probably still going to survive since their portfolio was constantly growing in value.
However, once the market cooled off, they were exposed.
A successful house-flipping operation can still succeed in a cool market.
But in Zillow’s case, it just uncovered the deficiencies that were otherwise masked by a rising market.
This is why model evaluation is so difficult, and yet so incredibly important to get right.
As a field, we’re still in the early phases of understanding how to account for adversarial conditions.
I hope this thread drives home just how important they are to consider!
I hope you learned something!
Follow me @marktenenholtz for more high-signal ML content.
Let’s build more robust ML models together.
Okay folks, we need to talk.
If you think the problem here is that they needed a "human" factor, or that they did poor regression analysis, you're wrong.
The problem is their?decision analysis.?
THIS is what failed in backtesting, NOT the regression.