Algorithms to Live By- A sneak peek
SHIV KUMAR
Chief Marketing Officer (CMO) & Head Passenger Experience - GMR Airports- Driving Growth, Commercial Excellence & Innovation -Champion of Digital transformation & Growth Hacking-Via MRF &Coca-cola
Algorithms to live by is one of the popular works in the modern world where the authors have used the algorithms to figure out the common life problems. It helps us to analyse the complex life problems and simply the decision making process.
Optimal Stopping Problem.
You can use solutions from computer science to solve problems in real life. These can be wide-ranging: from how to choose your soulmate, to when to choose a new restaurant vs your favourite.
This problem of deciding how long to scan - and who to choose is the Optimal Stopping Problem.
The optimal solution takes the form of Look-Then-Leap Rule: You set a predetermined amount of time for “looking” - that is, exploring your options, gathering data - in which you categorically don’t choose anyone, no matter how impressive. After that point, you enter the “leap” phase, prepared to instantly commit to anyone who outshines the best person you saw in the look phase.
This optimal number comes out to 37%.
The 37% Rule: look at the first 37% of the applicants, choosing none, then be ready to leap for anyone better than all those you’ve seen so far.
How do you determine 37%? This could be on the time or the total number of candidates.
For example, searching for a soulmate - with your preferred age for looking between 20 and 32, the optimal age to leap comes at 24.5.
In this case, when you find someone better than anyone you’ve met after you’re 24.5 years old - marry them.
Explore - Exploit Problem
Imagine walking into a casino full of different slot machines, each one with its own odds of a payoff. You don’t know the odds in advance. Until you start playing, you won’t have any idea which machines are the most lucrative and which ones are money sinks.
This scenario is the “multi-armed bandit problem.”
Choosing a restaurant or an album is all about deciding which arm to pull in life’s casino. You wouldn’t know how great they are till you experience them. You could rely on what other people tell you, but you can never know for sure.
A sobering property of trying new things is that the value of exploration, of finding a new favourite, can only go down over time, as the remaining opportunities to savour it dwindle.
Discovering an enchanting café on your last night in town doesn’t give you the opportunity to return.
The flip side is that the value of exploitation can only go up over time. The loveliest café that you know about today is at least as lovely as the loveliest café you knew about last month. So explore when you will have time to use the resulting knowledge, exploit when you’re ready to cash in. The interval makes the strategy.
A better than chance strategy is the Win-Stay, Lose-Shift Algorithm:
Choose an arm at random, and keep pulling it as long as it keeps paying off. If the arm doesn’t pay off after a particular pull, then switch to another.
But changing arms each time one fails is harsh. Would one disappointment at your favourite restaurant be enough to give up on it?
Gittins came up with a solution to this problem.
For every slot machine we know little or nothing about, there is some guaranteed payout rate which, if offered to us in lieu of that machine, will make us quite content never to pull its handle again. This number - which Gittins called the “dynamic allocation index,” and which the world now knows as the Gittins index - suggests an obvious strategy on the casino floor: always play the arm with the highest index.
When to put things in order
Sorting
Err on the side of messiness.
Sorting something that you will never search is a complete waste; searching something you never sorted is merely inefficient.
The question becomes how to estimate ahead of time what your future usage will be.
“To lower costs per unit of output, people usually increase the size of their operations,” - J. C. Hosken. This is the economy of scale.
But with sorting, scaling up is expensive.
As a sort grows larger, the unit cost of sorting rises.
Cooking for two is typically no harder than cooking for one, and it’s certainly easier than cooking for one person twice. But sorting, say, a shelf of a hundred books will take you longer than sorting two bookshelves of fifty apiece: you have twice as many things to organize, and there are twice as many places each of them could go. The more you take on, the worse it gets. This is the first and most fundamental insight of sorting theory. Scale hurts.
This problem is also known as the Agony of Sorting.
What’s the solution? With pairwise comparisons, none. But, if we can objectively rank everything in the sort - we don’t need the comparisons.
This move from “ordinal” numbers (which only express rank) to “cardinal” ones (which directly assign a measure to something’s caliber) naturally orders a set without requiring pairwise comparisons.[..]. The Fortune 500 list is one of these. To find the most valuable company in the United States, analysts don’t need to perform due diligence comparing Microsoft to General Motors, then General Motors to Chevron, Chevron to Walmart, and so forth. These seemingly apples-to-oranges contests (how many enterprise software installations equal how many oil futures?) become apples-to-apples in the medium of dollars. Having a benchmark—any benchmark—solves the computational problem of scaling up a sort.
The Agony of Sorting
With sorting, size is a recipe for disaster.
The first and most fundamental rule of sorting: scale hurts.
Record for sorting a deck of 52 cards is 36 seconds.
Determine how you are going to measure, best case scenario time or average sort time.
Also need to know worst time or worst case scenario.
This chapter and book is discussing worst case scenario unless noted otherwise.
Computer science short hand term is “Big O” notation for algorithmic worst case scenarios.
Sheds fine details, schema for dividing problems into different broad classes
Bubble sort is simple but extremely inefficient. Quadratic time.
Bubble sort is scanning over the line and comparing two side by side and moving one to the left or right, going back over the shelf over and over until it is sorted. Example is sorting a bookshelf of books in alphabetical order. The problem is each scan through the shelf only allows you to move each book one position at most.
Insertion Sort: take every book off the shelf and put them back on one at a time. Runs a bit faster than bubble sort. Quadratic time.
What is the minimum effort of time required to create order?
Merge sort is between Linear Time and Quadratic Time, one of the legendary algorithms in computer science.
Merge sort is the divide and conquer approach. You can collate two sorted stacks almost instantly.
In sorting a census level number of items, this is a difference between making 29 passes through the data set and 300 million.
Method of choice for large scale industrial sorting problems.
Can easily be paralleled.
Beyond comparison, outsmarting the logarithm.
Preston sort center, one of the biggest and most efficient book sorting facilities in the world. Run by King County Library.
Bucket Sort
Items grouped into a number of general categories.
Sorting is prophylaxis for searching.
Central trade-off between sorting and searching.
The effort expended on sorting materials is a preemptive strike against the effort to search them later.
Sorting something you will never search is a complete waste. Searching something you never sorted is merely inefficient.
Google for example, presorts search results by machine so that searching is done in seconds.
Most domestic bookshelves do not need to be sorted.
Choosing what books to read in the library
Caching
“It is a mistake to think that that little room [the brain] has elastic walls and can distend to any extent. Depend upon it there comes a time when for every addition of knowledge you forget something that you knew before. It is of the highest importance, therefore, not to have useless facts elbowing out the useful ones.” - Sherlock Holmes
One way to think about forgetting is that our minds run out of space. The problem might not be not one of storage, but of organization. According to Andersons’ theory, the mind has infinite capacity for memories, but we have only a finite amount of time in which to search for them. Anderson made the analogy to a library with a single long shelf. You can fit as many items as you want on that shelf, but the closer something is to the front the faster it will be to find.
Something similar occurs in libraries. There’s a recently returned section - which makes the lives of librarians easier. They aren’t expected to immediately put the books back on the shelves. For us, this is an interesting opportunity to look at books that others are reading - instead of the showcase which tells us what to read. It’s the cache of recently read books.
Forgetting can be as important as remembering.
The Memory Hierarchy
Computer hard drive vs solid state drive.
A small fast memory and a large slow one.
Computer memory access has not increased as fast as processing speed.
Most computers, phones and tablets have a six layer memory hierarchy.
What do we do when memory gets full?
Eviction and clairvoyance.
There comes a time when for every addition of knowledge you forget something that you knew before.
Eviction policy. Caching algorithm.
Known today as Bellamy’s Algorithm.
Approach options to managing the cache:
- Random.
- FIFO
- LRU = Least Recently Used. Evict the item that has gone the longest untouched.
LRU method consistently performed the best.
Temporal locality.
The last thing we will likely need is the thing we have gone longest without.
History repeats itself backwards.
Our best guide to the future is a mirror image of the past.
Caching physical items like library books, internet servers and files, Amazon warehouse items, etc.
Multi-level memory hierarchy.
Self-organizing lists. Research paper.
Always put an item back at the front of the list, this utilizes the LRU principle.
The Forgetting Curve
Ebbinghaus study.
A big book is a big nuisance.
Forgetting things and taking longer to process is largely a result of knowing more and having more memories to process as we age and get older.
How to get things done
Scheduling
Do the difficult things while they are easy and do the great things while they are small. - LAO TZU
Every scheduling algorithm needs some metric you’re optimizing for.
Here are a few.
Doing everything as soon as possible
Assume that every week you get a lot of fresh fruits all at once. Each fruit is set to spoil on a different date — so eating them by Earliest Due Date sounds reasonable. You’ll get to every item as soon as you can.
However, it’s not the end of the story. Earliest Due Date is optimal for reducing maximum lateness, which means it will minimize the rottenness of the single most rotten fruit you’ll have to eat. This may not be the most appetizing metric to eat by.
How do you fix this? Change what you’re optimizing for.
Doing something(s) on time
Let’s say, we want to minimize the number of fruits that spoil and we won’t eat spoilt fruits.
Here a strategy called Moore’s Algorithm gives us our best plan. Moore’s Algorithm says that we start out just like with Earliest Due Date—by scheduling out our fruits in order of spoilage date, earliest first, one item at a time.
But as soon as it looks like we won’t get to eating the next item in time, we pause, look back over the meals we’ve already planned, and throw out the biggest item (that is, the one that would take the most days to consume). For instance, that might mean forgoing the watermelon that would take a half dozen servings to eat; not even attempting it will mean getting to everything that follows a lot sooner. We then repeat this pattern, laying out the foods by spoilage date and tossing the largest already scheduled item any time we fall behind. Once everything that remains can be eaten in order of spoilage date without anything spoiling, we’ve got our plan.
Doing as much as possible
Minimizing the sum of completion times leads to a very simple optimal algorithm called Shortest Processing Time: always do the quickest task you can.
Even if you don’t have impatient clients hanging on every job, Shortest Processing Time gets things done. (Perhaps it’s no surprise that it is compatible with the recommendation in Getting Things Done to immediately perform any task that takes less than two minutes.) Again, there’s no way to change the total amount of time your work will take you, but Shortest Processing Time may ease your mind by shrinking the number of outstanding tasks as quickly as possible. Its sum-of-completion-times metric can be expressed another way: it’s like focusing above all on reducing the length of your to-do list. If each piece of unfinished business is like a thorn in your side, then racing through the easiest items may bring some measure of relief.
Predicting the future - How long things will last
BAYERS RULE gives us a means to predict the future. The only requirement? Have a strong sense of priors.
What are priors? What you know about what you’re trying to predict.
Good predictions require good priors. This has a number of important implications. Our judgments betray our expectations, and our expectations betray our experience. What we project about the future reveals a lot—about the world we live in, and about our own past.
Let’s dive in.
The Copernican Principle:
If we want to predict how long something will last, and have no other knowledge about it whatsoever, the best guess we can make is that it will continue just as long as it’s gone on so far. This is the power law distribution and the multiplicative rule.
But there is a problem. If you meet a 90-year-old man, the Copernican Principle predicts he will live to 180. Every 6-year-old boy, meanwhile, is predicted to face an early death at the tender age of 12. To understand why the Copernican Principle works, and why it sometimes doesn’t, we need to return to Bayes rules and the priors.
When we apply Bayes’s Rule with a normal distribution as a prior, we get an Average Rule: use the distribution’s “natural” average as your guide.
For instance, if somebody is younger than the average life span, then predict the average; as their age gets close to and then exceeds the average, predict that they’ll live a few years more. Following this rule gives reasonable predictions for the 90-year-old and the 6-year-old: 94 and 77, respectively. (The 6-year-old gets a tiny edge over the population average of 76 by virtue of having made it through infancy: we know he’s not in the distribution’s left tail.)
Between those two extremes, there’s a third category of things in life: those that are neither more nor less likely to end just because they’ve gone on for a while. Sometimes things are invariant. The Danish mathematician Agner Krarup Erlang, who studied such phenomena, formalized the spread of intervals between independent events into the function that now carries his name: the Erlang distribution.
The Erlang distribution gives us a third kind of prediction rule, the Additive Rule: always predict that things will go on just a constant amount longer.
· Power law distribution
o The Multiplicative Rule
· Erlang distribution
o The Additive Rule
· Normal distribution
o The Average Rule
News, social media, and other forms of story-telling can very easily skew our perceived distributions and likelihoods for a given phenomenon. Representation in the media does not equal frequency in the real world.
If you want to be a good intuitive Bayesian—if you want to naturally make good predictions, without having to think about what kind of prediction rule is appropriate—you need to protect your priors. Counterintuitively, that might mean turning off the news.
When to think less
Overfitting
The lesson is this: it is indeed true that including more factors in a model will always, by definition, make it a better fit for the data we have already. But a better fit for the available data does not necessarily mean a better prediction.
One way to combat overfitting: Withhold some available data points. This can be conscious or built into the circumstance.
Consciously, give yourself 5 minutes to make a decision.
Built in, start making the decision 5 minutes before the due-date.
When to think less.
Pro and con lists. Recommended by Benjamin Franklin.
There can be wisdom to deliberately thinking less in specific circumstances.
Cross validate to prevent over fitting.
Use secondary data points to check the first data point.
How to combat over fitting.
Penalizing complexity.
If you can’t explain it simply, you don’t understand it well enough.
Occam’s Razor.
9 factor model vs 3 factor model.
Allowing more time can create more complexity and be counterproductive.
Early stopping.
Another way: Testing your model with data derived from some other form of evaluation. The use of proxy metrics—taste as a proxy for nutrition, number of cases solved as a proxy for investigator diligence—can also lead to overfitting.
In these cases, we’ll need to cross-validate the primary performance measure we’re using against other possible measures.
How to solve difficult problems
Relaxation
And this is where computer science figured out something invaluable, something we can all learn from: how to best approach problems whose optimal answers are out of reach. How to relax.
It’s probably not the relaxation you’re thinking about. It’s about relaxing a problem.
· Constraint Relaxation removes some constraints altogether and makes progress on a looser form of the problem before coming back to reality.
· Continuous Relaxation turns discrete or binary choices into continua: when deciding between iced tea and lemonade, first imagine a 50–50 “Arnold Palmer” blend and then round it up or down.
· Lagrangian Relaxation turns impossibilities into mere penalties, teaching the art of bending the rules (or breaking them and accepting the consequences). A rock band deciding which songs to cram into a limited set, for instance, is up against what computer scientists call the “knapsack problem” - a puzzle that asks one to decide which of a set of items of different bulk and importance to pack into a confined volume. In its strict formulation the knapsack problem is famously intractable, but that needn’t discourage our relaxed rock stars
What would you do if you weren’t afraid? What would you do if you could not fail? What would you do if you won the lottery? What would you do if all jobs paid the same? The idea behind such thought exercises is exactly that of Constraint Relaxation: to make the intractable tractable, to make progress in an idealized world that can be ported back to the real one. If you can’t solve the problem in front of you, solve an easier version of it—and then see if that solution offers you a starting point, or a beacon, in the full-blown problem. Maybe it does.
What to do the next time someone cancels on you
In human society, we tend to adopt a policy of giving people some finite number of chances in a row, then giving up entirely. Three strikes, you’re out. This pattern prevails by default in almost any situation that requires forgiveness, lenience, or perseverance. Maybe we’re doing it wrong.
Solution: Exponential Backoff on the invitation rate. Try to reschedule in a week, then two, then four, then eight. The rate of “retransmission” goes toward zero - yet you never have to completely give up.
How to make sure your plans work out
Computational Kindness
Language like “Oh, I’m flexible” or “What do you want to do tonight?” has a dark computational underbelly that should make you think twice. It has the veneer of kindness about it, but it does two deeply alarming things.
· It passes the cognitive buck: “Here’s a problem, you handle it.”
· By not stating your preferences, it invites the others to simulate or imagine them.
The simulation of the minds of others is one of the biggest computational challenges a mind (or machine) can ever face. In such situations, computational kindness and conventional etiquette diverge. Politely withholding your preferences puts the computational problem of inferring them on the rest of the group.
In contrast, politely asserting your preferences (“Personally, I’m inclined toward x. What do you think?”) helps shoulder the cognitive load of moving the group toward resolution.
TOP 20 INSIGHTS
- The "37% rule" refers to a series of steps, or algorithms, that someone must follow to make the best decision within a set amount of time. Someone allots 37% of their time to research before they make a decision, then commits to the very next "best choice" they find.
- The "explore/exploit" trade-off refers to the need to balance the tried and tested with the new and risky. The payoff of this algorithm depends entirely on how much time you have to make decisions. People are more likely to visit their favorite restaurant on their last night in town than risk something new.
- Developed in 1952 by mathematician Herbert Robins, the "Win-Stay, Lose-Shift" algorithm uses slot machines as a metaphor. Choose a machine at random and play it until you lose. Then switch to another machine; this method was proven to be more reliable than chance.
- A psychology study found that given choices, people often "over explore" rather than exploit a win. Given 15 opportunities to choose which slot machine would win, 47% used Win-Stay, Lose-Shift strategies, and 22% chose machines randomly instead of staying with a machine that paid out.
- Hollywood is a prime example of the exploit tactic. The number of movie sequels has steadily increased over the last decade. In both 2013 and 2014, seven of the Top 10 films were either sequels or prequels. The trend is likely to change if new movie ideas draw more box office dollars.
- The A/B test is similar to the two slot machine scenario in that you stick with the option that performs best. More than 90% of Google's $50 million in annual revenue is from paid advertisements, which means that explore/exploit algorithms power a large portion of the internet.
- The Gittins Index provides a framework of odds that assume you have an indefinite amount of time to achieve the best payoff, but the chances reduce the longer you wait. For example: choose a slot machine with a track record of one-to-one wins/losses (50%) over the machine that has won nine out of 18 times.
- "Upper Confidence Bound" algorithms offer more room for discovery than the "Win-Stay, Lose-Shift" method. This algorithm assigns a value based on what "could be" based on the information available. A new restaurant has a 50/50 chance to provide a good experience because you have never been there.
- The "Shortest Processing Time" algorithm requires that you complete the quickest tasks first. Divide the importance of the task by how long it will take. Only prioritize a task that takes two times as long if it is two times as important.
- Laplace's Law calculates the odds that something will occur with only small amounts of data. Count how many times that result has happened, add one, then divide by the number of opportunities plus two. For example: Your softball team plays eight games per season. It has already won two games. 2+1/ 6+2=3/8, or a 37.5% chance you win the next game.
- The Copernican Principle allows you to predict how long something will last without much of anything about it. The solution is that it will go on as long as it has gone on so far. Based on this principle, Google will reasonably last until 2044 (23 years since 1998 + 23 from 2021).
- "Power-law distribution" considers that, in life, most things fall below the mean and a few rise above. Two-thirds of the US population makes less than the mean income, but the top 1% make almost ten times the mean. Few movies make "Titanic" level money in the box office, but some do.
- The "Nash Equilibrium" explores the phenomenon of two-player games and the way that players form strategies that neither wants to change based on what the other person does. This creates stability. In Rock-Paper-Scissors with three options, players adopt a 1/3-1/3-1/3 strategy unless the other person changes tactics, and the process starts again.
- Human brains have a nearly infinite capacity for memories, but we have a finite amount of time to access them. This results in the "forgetting curve." A study by Hermann Ebbinghaus found that he could recall nonsense syllables 60% of the time after he read them, but it declined to 20% after 800 hours.
- Ebbinghaus' "forgetting curve" was shown to closely match how often words are used in society. The recurrence of words found in headlines of The New York Times declined at a rate of 15% over 100 days and implied that human brains naturally tune their processes to the world around us.
- The stock market "flash crash" of May 6, 2010 was caused by an "information cascade." When one person does something different, then other people follow suit, assuming that the first person knows something they don't. This behavior causes people to panic buy or exhibit mob behavior.
- Sociologist Barry Glassner noted that murders in the United States declined by 20% throughout the 1990s, and yet the mention of gun violence on American news increased by 600%. An information cascade can be caused more by public information than private information.
- When authors Brian Christian and Tom Griffiths scheduled interviews for the book, they found that experts were more likely to accept a narrow, predetermined window than a wide-open one. It is less challenging to accommodate restraints than find another solution.
- Believe it or not, randomness is part of life's algorithm, too. Nobel prize-winner Salvador Luria realized that random mutations could produce viral resistance by watching his friend win the jackpot on a slot machine.
The best-laid plans are often the simplest. Jason Fried and David Heinemeier Hannson, founders of software company 37signals, use a thick marker when they start to brainstorm because it limits room and forces them to keep it simple and focus on the big picture.