How to build large image processing analytic system in Data Lake:
Birendra Kumar Sahu
Senior Director Of Engineering | Head of Data Engineering and Science & integration platform, Ex-Razorpay, Ex-Teradata, Ex-CTO
Some use cases can be addressed using image processing system without support of Big Data. However, assume if you have thousands of images to be process, it is not easy to do without Big data support. Let’s take example if you wants to create an application of auto-tagging a photographs as that of Facebook, or wants to predict traffic in city:
We have to address two problems here:
a) How fast you can store the images and read the images in your data lake?
b) How to do image analytics on those images?
How fast you can store the images and read the images from your data lake?
HIPI is an image processing library designed to be used with the Apache Hadoop MapReduce parallel programming framework. HIPI facilitates efficient and high-throughput image processing with MapReduce style parallel programs typically executed on a cluster. It provides a solution for how to store a large collection of images on the Hadoop Distributed File System (HDFS) and make them available for efficient distributed processing.
The primary input object to a HIPI program is a HipiImageBundle (HIB). A HIB is a collection of images represented as a single file on the HDFS. The HIPI distribution includes several useful tools for creating HIBs, including a MapReduce program that builds a HIB from a list of images downloaded from the Internet.
The first processing stage of a HIPI program is a culling step that allows filtering the images in a HIB based on a variety of user-defined conditions like spatial resolution or criteria related to the image metadata. This functionality is achieved through the Culler class. Images that are culled are never fully decoded, saving processing time.
The images that survive the culling stage are assigned to individual map tasks in a way that attempts to maximize data locality, a cornerstone of the Hadoop MapReduce programming model. This functionality is achieved through the HibInputFormat class. Finally, individual images are presented to the Mapper as objects derived from the HipiImage abstract base class along with an associated HipiImageHeader object. For example, the ByteImage and FloatImage classes extend the HipiImage base class and provide access to the underlying raster grid of image pixel values as arrays of Java bytes and floats, respectively. These classes provide a number of useful functions like cropping, color space conversion, and scaling.
The records emitted by the Mapper are collected and transmitted to the Reducer according to the built-in MapReduce shuffle algorithm that attemps to minimize network traffic. Finally, the user-defined reduce tasks are executed in parallel and their output is aggregated and written to the HDFS.
We can build the steaming system which send the image continually to HIPI.
How to do image analytics on those images?
I will pick up a traffic management example of Image processing. We will use Python to do the image processing.
Step 1: Import the required library
Skimage package enables us to do image processing using Python. The language is extremely simple to understand but does some of the most complicated tasks. Here are a few library you need to import to get started,
from matplotlib import pyplot as plt
from skimage import data
from skimage.feature import blob_dog, blob_log, blob_doh
from math import sqrt
from skimage.color import rgb2gray
import glob
from skimage.io import imread
Step 2: Import the image
input_file = glob.glob(r"<Hadoop HDFS file Path>")[0]
input_file1 = glob.glob(r"<Hadoop HDFS file Path>")[1]
…..
im = imread(input_file, as_grey=True) // Lets images are imported in grey scale
plt.imshow(im, cmap=cm.gray)
plt.show()
Step 3: Find the number of Cars:
Below codes will searching continuous objects in the picture. Blobs_log gives three outputs for each object found. First two are the coordinates and the third one is the area of the object. The radius of each blob/object can be estimated using this column (area of the object).
blobs_log = blob_log(im, max_sigma=30, num_sigma=10, threshold=.1)
# Compute radii in the 3rd column.
blobs_log[:, 2] = blobs_log[:, 2] * sqrt(2) // Used your formula
numrows = len(blobs_log)
print("Number of cars: " ,numrows)
Software Engineer
4 年Birendra Kumar Sahu
Software Engineer
4 年Could you please provide me with a complete example that shows how Map and Reduce are working to process and serialize images using Python?