Incremental Computation on Hadoop and MapReduce at Scale
MapReduce framework is not designed for incremental computation. Systems with incremental computation require processing of the large-scale datasets on their own that get added over to the system and the existing and historic entries get deleted or modified due to the evolving dynamics. Google’s Percolator is a tool that can perform the incremental computation. Incoop is initial and generic MapReduce framework that can be leveraged for incremental computations. The advanced data analytics tasks performed by the search engines on world wide web such as web crawl to build a web index or to run the PageRank algorithm will only detect a normal data-scale data changes with the delta mechanism of old and new data with a speed range of 10 to 1000 times.
The incremental MapReduce framework can be applied to several fields such as web crawls, PageRanking, life science computing, graph processing, text processing, machine learning, data mining, and relational data processing. The development of parallel algorithms through IncMR framework that can embed within the original APIs of MapReduce to avoid redesigning the APIs or writing new application algorithms to leverage incremental MapReduce framework. These programmatic algorithms can aid incremental data processing by detecting the data modifications in the inputs and reverse the intermediate states of the data, bolster the map and reduce functions. The algorithms are quick to detect any new inputs to trigger autonomous jobs to the master node.