The countdown to launch begins
Episode 13: 19/12/2019
?
There have been two clear phases in the creation of El Toco. The first, lonely, years were spent on the proof of concept. There was a lot of concept that needed to be proven, which is why that phase took three years. We have explored some of the more interesting events during that time in prior episodes. Essentially though, during those three years, life consisted of getting up every day, eating a bowl of cereal in front of the computer, and programming.
?
Initially, those years felt like a race. Every single day, I lived with paranoia gently simmering in the background. Paranoia at the possibility that some other search engine would launch with the same idea before us, and steal the limelight. For those long years, working alone in the bedroom, what got me up every morning was equal parts fear and adrenalin.
I have been rock climbing exactly once in my life and it was the same experience. You dare not look up, and dream of what might happen if you get to the top, because that doesn’t help you with the here and now. Looking down, and imagining a horrible violent death, is similarly unhelpful. Trust me, I tried it. The trick is to focus just on the rocks in front of you. The rest takes care of itself.
?
El Toco entered the second phase in summer 2019, when we started working with the University of Greenwich. There is one singular goal in that phase, which is to launch.
?
From this point on, El Toco will always be a team, so it really is a we, not an I. I have found all the pieces of the puzzle and turned them the right way up. All we need to do, as a team, is assemble them into the finished picture.
?
?
Our new business plan is to launch in early 2020. I'm writing this entry having just finished six months working with Greenwich. Things have slipped a bit, but this still looks feasible.
?
In fact, the future me will discover that it’s not feasible at all. Like the shifting snow at the beginning of an avalanche, the tiny slippage that's appearing heralds a much bigger cascade to come.
It will eventually take four years for us to launch, making the second phase of El Toco’s life even longer than the first.
The rest of this series will explain how that avalanche happened. It started with the first quietly shifting bits of snow, during our work with Greenwich in 2019.
?
?
The drama with the lazy classifiers
?
?
Working with the University of Greenwich has been a fun mini project made difficult by some pretty aggressive goal setting on my part. We almost bit off more than we could chew. El Toco needs four main classifiers: genre, web page type, language, and whether the page contains adult content.
?
Creating each one of these classifiers would be a solid year’s work for a PhD student. Our slightly insane goal was to create four, during six months, then set them up in production, downloading and classifying web pages in the cloud. I think, in hindsight, if you have done this exact job before, that timeline is feasible. We have not done this exact job before.
?
Our graduate employee and I continued working at home, with meetings at the university once every few weeks to discuss progress. This let me continue my fantasy that the university's gorgeous buildings are our real office. During those meetings, we present our results over a cup of coffee, then go back to our real office at home to beaver away on the next series of experiments. With each round of experiments we tweak the classifiers, making them more and more accurate.
?
It is a testament to the hard work of our merry band of three that we have achieved the minor miracle of creating all four classifiers in six months. We also got a bit of help, by the fact that somebody had published a language classifier publicly which we could pilfer. If that sounds like cheating, remember that there’s two kinds of startup: the quick and the dead.
At the end of the six month project, you can feed a page to our classifiers and they will tell you it is a family-friendly page, in English, about rainforest frogs, with lots of pictures.
?
However, in this blog we like to focus on the bits that have gone wrong. They are generally way more entertaining.
What we have not done is to set those classifiers up in a production environment. The reason we have not been able to do this is because they are incredibly, tragically, slow.
?
The above story about our work with Greenwich is true. But it glosses over some of the details. After agreeing in each meeting what we would do, we would scamper off to write the necessary code. The graduate was particularly prolific at this, and after he got into his stride would churn out reams and reams of Python script with very little effort. I tend to spend more time sitting at the computer thinking, and less time writing. I have genuinely no idea if this is the mark of a more mature developer or somebody who just gets distracted more easily by that annoying man with the leaf blower outside.
?
After some time coding, we took the reams and reams of Python script and applied it to the data set. Yes, if you have read that entry, this is the part where we used the data collected during the first half of the year.
?
And then, we would wait. And wait. And wait.
?
A few days would pass. Then, one of us would casually get in touch with the other one with a message along the lines of "hey, how are those results coming along?". The reply would be, "Oh. You know, just waiting".
?
Sometimes, we got the results. Sometimes, we lost patience after thirty hours of waiting and just killed the operation. On investigation, what we would find is that the classifier was making a real meal out of some trivial task. Very often, the task was counting words. Counting words is painfully slow in Python for technical reasons, the upshot being that it took over a minute to classify a single page. Experiments involving thousands of pages lasted for days.
?
Note that these were experiments. By definition, we didn't know what we were doing. Often, we would run them, wait for several days, notice an error in the code, and then have to wait for several days to run them again.
?
?
All that waiting adds up. Over the course of the six month project, literal weeks were spent sitting by the computer, reading a book, waiting for the classifiers to finish working.
?
We did learn something doing this. It reminded me of a phrase economists often use, when criticising central banks: "If you don’t know what you’re doing, don’t do too much of it".
As it applies to science, this means "For goodness sake don’t run the big experiment without getting the little experiment working first".
?
Both of us fell victim to this. The result being that, by the time we finished working with Greenwich, we had barely scraped through writing the classifiers. They work just fine but are way too slow to use in production. They have to be completely rewritten, and because of this we haven’t had time to do anything else.
?
Rewriting the classifiers from the ground up will be our priority for 2020. That, ladies and gents, is the slipping snow heralding the avalanche of work headed our way. Rather than being a few months' work, it will take most of the year, snowball with other tasks which are also slipping, and eventually four years will have passed, doing something which we expected to finish in six months.
?
领英推荐
Two very busy interns
?
While three of us work on the classifiers, El Toco's ranks have been further bolstered by two graduate interns, helping with other areas of the project.
?
One intern is trying to sort out the shoddy design of our website. I designed it, so can use that word. Since testing it in 2018, I’ve known that our website needs a facelift before going live. It looked good on paper but the users found my attempt at bright, cheerful graphic design too distracting.
?
The design we’ve settled on is cleaner and simpler, through sparing use of the artwork we already own. Making the new design has been the same iterative process as when I attempted it alone the previous year. Just when you get it all looking good on desktop screen sizes, you go across to mobile and realisee the design has to be completely different.
?
The other intern has been working out which results you see, and in what order, when you enter a search on our website.
?
It is a very thorny issue. When you have millions of pages, there has to be some order to the results. Google’s original insight was that more important websites have more links pointing to them, so should be displayed first.
?
At the same time as changing their slogan from “don’t be evil”, they’ve also been backpedalling on the way they rank pages. Are those two facts connected, we wonder. As of 2019, nobody outside Google is allowed to know what their algorithm is any more. This is amusing, because if there’s one thing most people can agree on, it’s that their rankings suck.
?
We won’t go into why the rankings suck, because I’m sure you already have your own views on the matter. Our solution has been to ditch all the secret complexity and go back to something daring, yet so simple that it might just work.
?
When you type in your search words, we find the web pages whose text best matches those words.
?
We won’t guess what you really meant. If you’ve not typed it in properly, you can fix that yourself. We won’t add extra words that you didn’t include. This annoys people as often as it helps them. We won’t adapt our rankings based on how popular a website is. This punishes smaller websites by making them invisible.
?
Looking back on the internships, both interns have made concrete contributions. Despite these contributions, I would caution other entrepreneurs to think very carefully before taking on interns. Especially in very small companies like circa-2019 El Toco, where managing them necessarily takes time away from other tasks.
You have to ask yourself as a manager, which jobs will we simply not do so that I can instead spend time managing the interns? This sounds harsh but startup life is also harsh. Interns require almost constant attention and, unless they have already accumulated experience elsewhere, deliver the least benefits. I think, summing up what they have achieved and deducting from it the opportunity cost of my time, there was a small net gain. But it was a close-run thing, and with a different pair of interns we could have easily been less lucky.
?
Dark clouds on the horizon
?
In the Science Museum in London is a chunk of Google’s original server racks. It is called the corkboard server because they strung lots of cheap computers together, whose motherboards have lots of little pointy bits of metal which stick out underneath. Everybody who has handled the innards of a computer stresses about what happens if these pieces of metal get damaged. I strongly suspect the answer is nothing, but have never dared test this experimentally. Sergey and Larry’s solution was to mount them on thin sheets of cork.
?
Google quickly outgrew the capacity of a single physical machine, so spreading its load over lots of cheap computers was the only real way of solving the problem. Aside from cutting up bits of cork, it no doubt involved a lot of fiddling. Getting the computers to talk to each other, and creating a system which distributed the work, using the hardware available in the late nineties, are serious technical feats. As a search engine founder, I’d be more interested to learn how they achieved that than to see some dusty electronic components.
?
Twenty years later, nobody, apart from Chinese bitcoin miners, really hosts stuff on their own hardware any more. Everybody from small entrepreneurs to governments keeps their public facing systems in vast warehouses full of servers, which we call the cloud.
?
The unit economics of the cloud is undeniable because it lets you gradually tack on computing power for a lot less than the expense of buying another physical machine and plugging it into your system. Amazon, one of the largest cloud providers, even let you trial their service for free.
This is much less impressive than it sounds. They offer a tiny server with less power than your average laptop which you can use for a few months, at which point it starts incurring a fee. Our 2018 test website was hosted on the Windows version of this server. Despite its limited hardware, Amazon’s setup is so good that our test users found El Toco’s search as fast as Google’s.
?
There is one main downside to the cloud, which almost nobody talks about. It’s insanely complicated. There is really no hand holding. The learning curve is less like a curve and more like a vertical wall of some frictionless material which would challenge even the boldest gecko to climb.
?
The top three cloud providers are Amazon, Google, and Microsoft. Bearing in mind that El Toco is a search engine, it felt like a bit of a cheeky move to host our service on the competition’s server farm. So we went with the market leader, Amazon.
?
Early in 2019, I approached Amazon to ask if they could help with some cost projections for our service. That call contained a technobabble density of which even the writers of Star Trek would have been proud. The operator argued that they could achieve twenty terabits of throughput on a two point five megabits per second line by connecting it in a dedicated private subnet to something else I’d never heard of. But it was impossible for me to reply, having lost the ability to speak because I was trying so hard not to laugh at the lack of regular English in each sentence he delivered.
I was then shown something called the Simple Calculator. This is an online tool that aims to let potential customers price their service. It also has the secondary aim of confusing everybody who uses it, which it achieves with a very high success rate. After a few days wrestling with the Simple Calculator, I came to two conclusions. Firstly, that I felt sorry for anybody who had to use the Complex Calculator, if such a tool exists. Secondly, that we were going to need some help.
?
Most newcomers to cloud computing arrive at this point. To help them navigate the maze of Amazon's cloud, an extremely fast-growing industry of consulting companies has sprung up.
?
So it was that the last action of 2019 was a meeting with our chosen cloud consulting company, to outline what we want to do.
?
It took place in a WeWork, which was my first exposure to that company. Before being taken upstairs, we waited for a long time at a tiny reception desk with no receptionist. It would have been awkward but was actually ok because the reception formed a chokepoint to watch the flow of colourfully dressed miscreants passing in and out of the building, many of them led by dogs.
A touch screen reception system had been installed and then apparently abandond before being set up properly. A queue formed behind us as other people arrived, tried, and failed to fathom how it worked. Eventually, a receptionist materialised and we were shown upstairs into a small meeting room with glass walls looking onto a kitchen. I hope the interior designer had intended to evoke the atmosphere of a bustling IKEA showroom, because that is the ambience they had achieved. I broke the ice by spilling coffee all over the wobbly table while outlining El Toco’s plan for global control of the search market. But the people from the cloud consulting company were friendly and receptive to our plans while they mopped up all the coffee.
?
They seemed happy, but the cloud consulting company has not been named, for reasons which will come to light in the next episode.
For now, here is a summary of our actual progress towards launch, as of December 2019 when we finished working with Greenwich:
This episode is dedicated to everyone involved in the KEEP+ project with the University of Greenwich, and our two interns.