The First Production Application on Hadoop
This article provides the details of, to the best of my knowledge, the first production application, deployed on the first production Hadoop cluster at Yahoo! in the summer of 2006.
I joined Yahoo! Web Search in January of 2006. My first assignment was to rewrite and scale an application called "Key Site Monitoring (KSM)", originally written in the Perl programming language by Lakis Karmirantzos, a software engineer at Inktomi and later at Yahoo! Web Search after Yahoo! had acquired Inktomi.
Before describing the details of KSM, it is a good idea to give a very high-level overview of a web search engine, which resembles the Yahoo! web search engine at the time.
How a search engine works. A search engine can be thought of as a machine that takes queries from users and returns search results in response. The queries are the inputs to the machine and the search results are the outputs of or from the machine. For this machine to work well, it needs to collect most of the relevant resources that will be used to answer most of the user queries. In addition, these resources need to be organized in such a way that a response can be produced in a fraction of a second.
The relevant resources consist of a copy of the Web, meaning a copy of (potentially) every webpage and document as well as how they refer to or link to each other. This collection process is called “crawlingâ€, and it is achieved by a crawler, a software that continuously runs on 1000s of servers and accesses the servers of websites to extract the required information. The crawled content needs to be comprehensive, relevant, high quality, and fresh. The crawler needs to minimize the duplicate content to save computing resources. The crawler also needs to be “polite†to the web servers in that it needs to slow down to a level that websites can cope with the speed of crawling while also serving their own users.
The crawled content is next put into a graph structure to mimic the Web itself, called the web graph or web map. This graph structure has nodes and edges, where each node can be a category, a website, a webpage, and the like and each edge can be an actual hyperlink or another form of relationship between the nodes connected. The graph structure is key to finding out important or authoritative nodes, which can be given higher priority in responding to user queries. The graph structure is also key to detecting and avoiding spam.
Finally, the crawled content is indexed in that it is put into a structure called a search index. The search index is structured in such a way that the webpages and/or documents relevant for a given search query can be found really fast. Moreover, it is possible to have the search index composed of many other indexes for different content types such as news content, shopping content, wiki content, etc.
Both the web map and the search index can be created on a big data platform running on 1000s of servers. In fact, some of the key big data techniques come from the need for efficient creation of these structures.
The crawler, web graph, and index are usually regarded as the backend of a search engine. A search engine also has a web tier as its frontend to receive and respond to user queries.
Since KSM was an application for the backend of the Yahoo! web search engine, I will skip the rest of the search engine details.
Key Site Monitoring (KSM). KSM relied on a set of key sites, derived by the web map at every build every day. These sites, numbering in 100s of 1000s, reflected the most important or key sites that we thought the crawler should have always crawled well. The process to derive these sites used a PageRank-like algorithm.
Every action of the crawler was recorded in crawler logs. The crawler created a new log file every few minutes. Each line in the file roughly mapped to a timestamp, a URL, and the return code of the crawler on that URL. The return codes range from successful to various error codes and messages. Each server or node in the crawler cluster had a historical list of many log files, compressed and stored in an easy-to-navigate directory structure on the local Linux file system.
At a high level, KSM worked as follows:
- Pull in the day's log files from N crawler nodes in the crawler cluster. Here N was selected to ensure statistically significant estimates on the stats collected and presented.
- Scan each file line by line. Recall that each line contained a timestamp, a URL, and a return code for the crawling action.
- Extract the site part of each URL so that the site part can be matched against the key sites.
- For each key site, aggregate the stats (of successes and errors) over all its crawled URLs.
- Add the new stats to the existing time series for each key site.
- Display the time series on demand using a graphical user interface.
- Send an alert email to the crawler developers on the key sites that need attention.
KSM had to satisfy multiple requirements: It had to be highly available. It had to be robust against permanent or transient crawler node or site failures. It had to be scalable to many nodes and log files, usually large in size. It also had to be reliable as a tool with minimal false positive and false negative rates. Finally, it had to be easy to use and improve.
领英推è
KSM rewrite. Within a month or so, KSM was rewritten in the Perl programming language and delivered on all its project requirements. KSM was deployed in production on multiple servers over two data centers for high availability and disaster recovery. Through iteration with the crawler team, the false positive rate of KSM was also reduced significantly. So far so good.
KSM Hadoop version, KSM/Hadoop. Towards the summer of 2006, I hired Ruey-Lung Hsiao as an intern. Ruey-Lung was a PhD student at the Computer Science department of the University of California, Los Angeles (UCLA). As the intern project, I assigned him the project of rewriting the data processing part of KSM on Hadoop.
Why Hadoop? Three reasons.
- At the time Yahoo! had an internal system similar to Hadoop; it was actually far more scalable than the earlier versions of Hadoop. However, one disadvantage of the internal system was that it was not as robust to failures as Hadoop; any failures involved multiple manual repairs and reruns. Hadoop promised to free the programmers from such issues.
- The Hadoop team's cubicles were just next door to ours at the Yahoo! offices in Mission College, Santa Clara, CA. From the beginning of the Hadoop project (led by Eric Baldeschwieler at Yahoo!), we were interacting with the Hadoop team and witnessing the scaling of Hadoop. We were able to overhear the details shared during their daily standups and see the fun that they were having while building it. At the time the Hadoop team was also eager to find users within Yahoo!, partly to get real-world feedback and partly to show the benefit to the company. Due to our proximity and interactions, we (with support from Amit Kumar) enthusiastically decided to be among the first users.
- Ruey-Lung, Hung-Chih Yang (an engineer in my team), and I were able to quickly see how MapReduce within Hadoop could be used for relational and bioinformatics data processing. Within a month or so pondering on these, we were able to submit multiple patent applications (all granted) and a research paper (published in the SIGMOD conference in 2007) on our inventions. I think all our patents were later donated by Yahoo! to the Apache Foundation so that Hadoop could be open sourced via the Apache Foundation.
Due to the fit between the KSM data processing and MapReduce, Ruey-Lung was able to create a working program within a few short weeks. For the rest of his internship, he actually spent time generalizing beyond MapReduce, e.g., introducing pre- and post- functions for Map and Reduce steps. He also spent time contributing to the invention write-ups.
Before the end of the internship in the summer of 2006, I integrated part of Ruey-Lung's Hadoop code with my KSM production version. In the new version, the log files were directly pulled from the crawler cluster to the Hadoop cluster; mappers scanned the log files and generated partially aggregated stats while reducers produced the final stats. The results were written to files, which were far smaller than the original log files due to focus only on the key sites. The rest of the KSM application pulled these stats files from the Hadoop cluster to the existing KSM cluster for alerts generation, querying, and time series displaying.
At the time Hadoop was in constant flux, with its APIs changing every so often, maintaining the KSM/Hadoop application was not easy. I had to make the necessary changes in the KSM application to keep up with the API changes.
To the best of my knowledge, the KSM/Hadoop application was the first production application deployed on Hadoop. It was deployed towards the end of the summer of 2006.
The fate of KSM and KSM/Hadoop. The constant Hadoop API changes were a headache, especially for a production application that was required to provide daily reliable monitoring and alerting. So after some time, I realized a tighter integration with the crawler was needed to improve KSM further: What if I used the crawling cluster itself for the log file scanning and stats derivation? This way I did not have to move any large files at all. Also I could run the processing locally in every crawler cluster node.
I quickly rewrote the backend KSM application and asked Charles Thayer (who was the admin for the crawler) to include and run it in every crawler node. This actually improved the final accuracy due to the elimination of the sampling. Since the application was localized and very fast, its impact on the crawling activity was minimal. The rest of the KSM was still running on its own cluster. This time it had to pull in tiny summary files instead of large log files. Moreover, the final aggregation was a breeze. The success of this application of course meant that there was no need for KSM/Hadoop any more so I had to stop it.
Since the KSM application had a robust backend at the time, the rest of the improvements were focused more on the user interface and integration with other monitoring and alerting tools.
Fun times.
Note (Jan 1, 2024): I had to revise this article as it seems LinkedIn lost the original formatting.
Disclaimer: This article presents the opinions (and recollections) of the author. It does not necessarily reflect the views of the author's employer(s).
Experienced Technologist & Entrepreneur
7 å¹´I remember this vividly. I believe one of your colleagues, Hung-chih Yang worked closely with our team, and was one of the three ambassadors for Hadoop at Yahoo, with Arkady Borkovsky in 2006.