登录查看更多内容

How to build large image processing analytic system in Data Lake:

Birendra Kumar Sahu

Senior Director Of Engineering | Head of Data Engineering and Science & integration platform, Ex-Razorpay, Ex-Teradata, Ex-CTO

发布日期: 2017年9月18日

Some use cases can be addressed using image processing system without support of Big Data. However, assume if you have thousands of images to be process, it is not easy to do without Big data support. Let’s take example if you wants to create an application of auto-tagging a photographs as that of Facebook, or wants to predict traffic in city:

We have to address two problems here:

a) How fast you can store the images and read the images in your data lake?

b) How to do image analytics on those images?

How fast you can store the images and read the images from your data lake?

HIPI is an image processing library designed to be used with the Apache Hadoop MapReduce parallel programming framework. HIPI facilitates efficient and high-throughput image processing with MapReduce style parallel programs typically executed on a cluster. It provides a solution for how to store a large collection of images on the Hadoop Distributed File System (HDFS) and make them available for efficient distributed processing.

The primary input object to a HIPI program is a HipiImageBundle (HIB). A HIB is a collection of images represented as a single file on the HDFS. The HIPI distribution includes several useful tools for creating HIBs, including a MapReduce program that builds a HIB from a list of images downloaded from the Internet.

The first processing stage of a HIPI program is a culling step that allows filtering the images in a HIB based on a variety of user-defined conditions like spatial resolution or criteria related to the image metadata. This functionality is achieved through the Culler class. Images that are culled are never fully decoded, saving processing time.

The images that survive the culling stage are assigned to individual map tasks in a way that attempts to maximize data locality, a cornerstone of the Hadoop MapReduce programming model. This functionality is achieved through the HibInputFormat class. Finally, individual images are presented to the Mapper as objects derived from the HipiImage abstract base class along with an associated HipiImageHeader object. For example, the ByteImage and FloatImage classes extend the HipiImage base class and provide access to the underlying raster grid of image pixel values as arrays of Java bytes and floats, respectively. These classes provide a number of useful functions like cropping, color space conversion, and scaling.

The records emitted by the Mapper are collected and transmitted to the Reducer according to the built-in MapReduce shuffle algorithm that attemps to minimize network traffic. Finally, the user-defined reduce tasks are executed in parallel and their output is aggregated and written to the HDFS.

We can build the steaming system which send the image continually to HIPI.

How to do image analytics on those images?

I will pick up a traffic management example of Image processing. We will use Python to do the image processing.

Step 1: Import the required library

Skimage package enables us to do image processing using Python. The language is extremely simple to understand but does some of the most complicated tasks. Here are a few library you need to import to get started,

from matplotlib import pyplot as plt

from skimage import data

from skimage.feature import blob_dog, blob_log, blob_doh

from math import sqrt

from skimage.color import rgb2gray

import glob

from skimage.io import imread

Step 2: Import the image

input_file = glob.glob(r"<Hadoop HDFS file Path>")[0]

input_file1 = glob.glob(r"<Hadoop HDFS file Path>")[1]

…..

im = imread(input_file, as_grey=True) // Lets images are imported in grey scale

plt.imshow(im, cmap=cm.gray)

plt.show()

Step 3: Find the number of Cars:

Below codes will searching continuous objects in the picture. Blobs_log gives three outputs for each object found. First two are the coordinates and the third one is the area of the object. The radius of each blob/object can be estimated using this column (area of the object).

blobs_log = blob_log(im, max_sigma=30, num_sigma=10, threshold=.1)

# Compute radii in the 3rd column.

blobs_log[:, 2] = blobs_log[:, 2] * sqrt(2) // Used your formula

numrows = len(blobs_log)

print("Number of cars: " ,numrows)

Orwa Kassab

Software Engineer

4 年

Birendra Kumar Sahu

Orwa Kassab

Software Engineer

4 年

Could you please provide me with a complete example that shows how Map and Reduce are working to process and serialize images using Python?

查看更多评论

要查看或添加评论，请登录

Birendra Kumar Sahu的更多文章

Understanding Decision Science: A Guide with Real-Time Examples

2024年10月25日

Understanding Decision Science: A Guide with Real-Time Examples

Understanding Decision Science: A Guide with Real-Time Examples In today’s data-driven world, the ability to make…
Unlocking the Power of LLMs for Context-Aware SQL and Reporting/Visualization Generation

2024年10月13日

Unlocking the Power of LLMs for Context-Aware SQL and Reporting/Visualization Generation

Introduction In the ever-evolving landscape of financial technology, the ability to seamlessly translate natural…

2 条评论
The Modernization of Software Platforms: The Journey from Monolith to Microservices

2024年10月12日

The Modernization of Software Platforms: The Journey from Monolith to Microservices

In today's fast-paced digital landscape, software platforms must evolve to meet the demands of a growing user base. The…

1 条评论
Understanding Data Mesh: A Modern Approach to Data Architecture

2024年10月9日

Understanding Data Mesh: A Modern Approach to Data Architecture

In today’s data-driven world, organizations are inundated with information, often leading to challenges in managing…
How Generative AI is Transforming Data Engineering

2024年10月6日

How Generative AI is Transforming Data Engineering

In recent years, the rise of Generative AI has sparked a revolution across various fields, and data engineering is no…

1 条评论
Understanding the Data Lakehouse Engine: Bridging the Gap Between Data Lakes and Data Warehouses

2024年10月6日

Understanding the Data Lakehouse Engine: Bridging the Gap Between Data Lakes and Data Warehouses

In today’s data-driven world, organizations are inundated with vast amounts of data from various sources. To harness…
The Importance of Emotional Intelligence in Engineering Leadership

2024年10月6日

The Importance of Emotional Intelligence in Engineering Leadership

In the rapidly evolving world of engineering and technology, technical skills alone are not sufficient for effective…
From Data to Decisions: How Data Engineering Fuels AI Transformation and Common Pitfalls to Avoid?

2024年10月6日

From Data to Decisions: How Data Engineering Fuels AI Transformation and Common Pitfalls to Avoid?

Introduction In an era defined by rapid technological advancement, the volume of data generated daily is staggering…

1 条评论
Why Julia is better framework for AI?

2019年2月23日

Why Julia is better framework for AI?

What is Julia? Julia is a promising language focused mainly on the scientific computing domain. It provides execution…
K-Means Clustering Algorithm - Case Study

2018年5月18日

K-Means Clustering Algorithm - Case Study

Summary This experiment clusters similar companies into same group given their Wikipedia articles and can be used to…

See all articles

How to build large image processing analytic system in Data Lake:

Birendra Kumar Sahu

Senior Director Of Engineering | Head of Data Engineering and Science & integration platform, Ex-Razorpay, Ex-Teradata, Ex-CTO

How fast you can store the images and read the images from your data lake?

How to do image analytics on those images?

Birendra Kumar Sahu的更多文章

社区洞察

其他会员也浏览了

WHAT IS SPARK

Opensource for 5G A Neanderthal’s Guide : Corral Big Data with Hadoop and Apache Spark @

WHAT IS SPARK

Comparing Spark and MapReduce: The Pros and Cons of Two Popular Big Data Processing Frameworks on the Hadoop Ecosystem

Understanding the MapReduce Workflow: A Detailed Guide

HCatalog MapReduce Integration

BigData: The Bigger Picture #2

Hadoop's Data Engineering Process

How fast you can store the images and read the images from your data lake?

How to do image analytics on those images?

Birendra Kumar Sahu的更多文章

Understanding Decision Science: A Guide with Real-Time Examples

Unlocking the Power of LLMs for Context-Aware SQL and Reporting/Visualization Generation

The Modernization of Software Platforms: The Journey from Monolith to Microservices

Understanding Data Mesh: A Modern Approach to Data Architecture

How Generative AI is Transforming Data Engineering

Understanding the Data Lakehouse Engine: Bridging the Gap Between Data Lakes and Data Warehouses

The Importance of Emotional Intelligence in Engineering Leadership

From Data to Decisions: How Data Engineering Fuels AI Transformation and Common Pitfalls to Avoid?

Why Julia is better framework for AI?

K-Means Clustering Algorithm - Case Study

社区洞察

其他会员也浏览了

WHAT IS SPARK

Opensource for 5G A Neanderthal’s Guide : Corral Big Data with Hadoop and Apache Spark @

WHAT IS SPARK

Comparing Spark and MapReduce: The Pros and Cons of Two Popular Big Data Processing Frameworks on the Hadoop Ecosystem

Understanding the MapReduce Workflow: A Detailed Guide

HCatalog MapReduce Integration

BigData: The Bigger Picture #2

Hadoop's Data Engineering Process