Applications of Deep Learning in Logistics
A Deep Learning model finding people in part of a study to find high traffic areas in a warehouse.

Applications of Deep Learning in Logistics

Neural networks work very well for detecting objects in images

For decades engineers have worked to develop software that could perform the computer vision tasks needed for a wide range of safety, security and industrial applications. For the most part their success was limited to well-constrained tasks such as inspecting machine-made parts which had relatively simple shapes in well-lit settings and doing surveillance where they were able to detect “something moved over there”. They were not able to do the complex recognition tasks needed for finding a range of very different things in a picture such as cats, dogs, cars, coffee cups and people, let alone be useful as the core perception system of self-driving cars or safety systems in factories and warehouses.

The approach of humans writing software by hand to detect all of the important features of such different objects in the natural and man-made world just wasn’t getting us very far. In annual computer vision competitions using large sets of images [1][2][3] researchers would get a little better metrics each year with their algorithms, but overall these systems were not accurate enough for most real-world applications

Then about ten years ago a group of researchers who had been working on these tasks from a very different perspective had some very striking successes. These researchers were inspired by how animals see and their work was grounded in mathematics such as probability, calculus and linear algebra. This work went by names such as artificial neural networks [6] and more recently deep learning [7].

Instead of using hand-written software to detect the features that are important for recognizing something like a cat, neural networks learn the features that are most useful to perform a task well.?

What learning means in neural networks is this: Once trained, the neural network should output the correct answer when presented with an input. For example, if you feed a picture of a cat into a neural network that can recognize different animals it should output “Cat”. To train this model you feed the network many photos, each containing one animal (or none) and the network then produces its answer. If the answer is wrong you make small modifications to the model to nudge its answer a bit closer to the correct answer. You do this hundreds of times across sets of hundreds of thousands (or millions) of photos and the model becomes better and better at this animal recognition task.

One important success came in 2012. Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton at the University of Toronto developed and trained a neural network which they called AlexNet on the ImageNet dataset for the annual ImageNet object classification challenge [8]. The error of their model was over 10 percentage points better than the second place entry that used classical computer vision techniques. This was huge!

Machine learning researchers had been exploring neural networks for many years, but due to the limitations of compute power they had only had success on relatively small tasks. Finally by 2012 the computing resources that were available at a reasonable cost were enough to train neural networks for large-scale image classification tasks. Ever since then, these deep learning models have achieved much better metrics (accuracy, average precision) than other techniques. Of course classical computer vision techniques are still important for many image processing tasks [9].

What can Deep Learning models do?

Over the last ten years this statistical approach has had more and more successes. We now have deep learning neural network models that can:

  • Determine whether a picture contains a particular thing like a cat, dog, person or car.
  • Detect exactly where that thing is in the image.
  • Find multiple different objects in the image.
  • Give you the exact outline of the object.
  • Draw a stick figure showing you the pose of a person in the image.
  • Generate text describing the image.

Here is one example of object detection done by a deep learning model called Faster-RCNN [10]. We see that the model has done an excellent job of finding the people, dog, horse and truck in this image. The researchers showed the output of the models by displaying the class of each object that it found and a bounding box of its location on the image. This work is part of a series of progressively better models that these researchers developed for object detection. This was simply not possible ten years ago.

No alt text provided for this image
Object recognition from Faster-RCNN convolutional neural network.


We can now detect the people and cars in scenes well enough for many self-driving car tasks, as shown in this next example. We see that this newer model even does a good job detecting all of the different people standing next to each other.

No alt text provided for this image
Example of deep learning model detecting objects in a traffic scene.


Applications of Deep Learning models for computer vision

All of these capabilities are extremely useful in a wide range of applications such as:

  • Inspecting parts in a factory for defects.
  • Inspecting bridges and power plants for cracks and other potentially catastrophic defects.
  • Enabling robots to function more flexibly in industrial or outdoor environments.
  • Finding cancer cells in microscopy images of cells.
  • Being the eyes for self-driving cars.
  • Providing safety and security services.
  • Providing detailed analyses of an athlete's movements to help them perform better.

For #logistics and #warehouse needs in particular there are a number of applications such as:

  • Identifying traffic patterns and bottlenecks in the warehouse.
  • Identifying inefficiencies in the movement of goods from the shelves to the packing areas.
  • Optimizing picking patterns.
  • Determining free space on the shelves.
  • Automated warehouse robots use computer vision to place goods on shelves and pick boxes from shelves.
  • Determining the optimal size of box to pack goods in.
  • Safety of people (from moving machinery, unsafe lifting of boxes)


The above is just a short list of applications where these deep learning models provide a core enabling technology [11][12].

These Models Share a Core Architecture

Interestingly, when you look deeper into these models you find that most share a core architecture. This architecture, called a “backbone” in these models, is a convolutional neural network (CNN). This architecture was originally inspired by the way the vision system works in the brains of mammals. Machine learning researchers took this architecture and used statistics, linear algebra and calculus to understand it and make it work in real-world tasks.

This backbone generally does one task: object classification. The goal of this task is to state what kind of object (if any) is in the image. It doesn’t say where the object is in the image or how many of them there are, just whether one is there or not [13].

In the computer vision research community there are now sets of images that people tend to use to train their neural network models. You need not just an image, but images that are annotated with the kinds of objects that are in them. For these backbone neural networks we use image sets such that each image has just one key object in it and we label the image with the class of that object such as cat, dog, car or person. One of the image sets that we use commonly is called ImageNet. It has 1000 different classes of objects, and each class has approximately 1000 images [1]. The following figure shows examples of some of the images in several of the classes of objects. Note that the classes are hierarchical which is useful for training models at different levels of abstraction.

No alt text provided for this image
Example of images in the ImageNet labeled dataset.


Over the last ten years researchers have invented a number of better and better convolutional neural networks for this object classification task. The convolutions in convolutional neural networks are the operations in these models that extract features from the image that are useful for the particular computer vision task at hand. They are similar to the kinds of filters that you have used in Photoshop or other photo apps that do things like reduce blur or soften images. The key difference is that instead of a fixed set of hand-written filters, in a CNN the neural network learns the filters that it needs for the job.

These filters are organized into layers in a CNN. The output of one filter (called a kernel) produces a new “image” that contains the features the filter has computed. These are called feature maps. Then a new set of filters uses these feature maps as input and produces a new set of feature maps. There are usually a number of these layers models. The following figure shows the architecture of one well-known CNN for object detection called VGG-16, from the Visual Geometry Group at Oxford [14]. The original image is the first layer on the left. The output is represented by a vector of numbers, one for each class of object this neural network was trained to classify, shown on the far right.

No alt text provided for this image
Architecture of the VGG-16 convolutional neural network.


In the early stages of a deep learning model the feature maps detect features such as edges at various angles in the image. In some of the next layers of these neural network models the features are things like corners. And later in the model the features more closely resemble parts of the objects the model is being trained to detect, such as pointed ears of a cat or the eyes of a person. From a computational perspective, the neural network is creating representations of the features of the image that are most useful for the task.

Using deep learning classification models as backbones for richer computer vision tasks

On top of being useful models on their own, it turns out that in order to do this classification task well these models had to learn features that are useful for detecting the wide range of objects in the ImageNet dataset. So we are able to use these trained models as “backbones” for the more challenging tasks such as detecting multiple different kinds of objects in one image and determining the exact location of each object.

Object Detection

Here is the high level overview of how that can work for one of these: the task of determining where an object is located in an image [15]:

  • What you want is the coordinates of the corners of a box that tightly encloses the object of interest. You can use, for example, four numbers: the x and y coordinates of the top left corner in the image of this bounding box, and the width and height of the box.?
  • You build a set of annotated images to train and test your model where for each object in each image you store what kind of object it is (its class) and the coordinates of its bounding box. For years this was a very time consuming task – people used image annotation software to draw little boxes around each object in each image. Companies like Google and Facebook and Apple have spent millions of dollars on this.
  • So now you have specified what the right answer is. The answer you want your model to produce.
  • You start with the backbone model and (for the moment) ignore its last part that gives you the object classification results. You keep most of this model's neural network that computes the rich array of image features.
  • You bolt on a new, untrained neural network model that is designed to output the coordinates of the bounding box for the object once it has been trained.
  • The task is this: given all of these features computed from the image, output the correct bounding box coordinates.
  • When you are training the model you know the correct answer for this, so the task of going from image features to correct bounding box coordinates is a regression task.
  • And using calculus the neural network training algorithm slowly nudges the nascent neural network model into producing the correct bounding box coordinates.


Semantic Segmentation

Similarly, to determine all of the pixels that belong to an object in the image (called semantic segmentation) you append a new neural network to the end of the backbone model that is designed to output the class id for each pixel of the image. To train this model you need a set of labeled images where each pixel of each object of interest is annotated by a human. (There are now software tools to make this much easier.)

Here is an example of what a modern semantic segmentation model can do. They basically detect all of the pixels belonging to a particular class of object in the image. This capability has a wide range of applications. Since you can detect the pose of a person using semantic segmentation, one application is the warehouse is detecting when people are doing something unsafe like lifting a large box above their head.?

No alt text provided for this image
Deep learning models designed for semantic segmentation can determine which pixels in an image belong to each class of object.


Determining the pose of a person

Researchers have also trained deep learning models to product the pose of each person in an image in the form of the coordinates of important keypoints on the person such as their shoulders, elbows, hands, etc. One application in logistics is detecting when people are doing something unsafe like lifting a large box above their head.?

Generating text descriptions of images

One more important capability we now have: Once you have a list of all of the objects in an image and where they are, you can use a natural language generation model to produce a text description of the picture. For the following image one of these models produced the text “White and black cat sitting on a chair”. Not perfect, but not bad.

No alt text provided for this image
Caption from Image-to-Text model: White and black cat sitting on a chair.


In logistics operations this can be used to produce text summaries of noteworthy activities in a warehouse.

Takeaways

Using one core technology – deep learning neural networks, we now have the building blocks for richer capabilities, such as identifying the class of objects in an image and locating these objects precisely, which we can use in a wide array of very useful applications which were simply not possible ten years ago.

In our next post we will describe how we at sSy.ai are developing these to help our warehouse and manufacturing customers understand and improve the operation of their facilities.

References:

[1]? Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L. “Imagenet: A large-scale hierarchical image database”. In: 2009 IEEE conference on computer vision and pattern recognition. 2009.

https://ieeexplore.ieee.org/document/5206848

[2] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. “ImageNet Large Scale Visual Recognition Challenge”. IJCV, 2015.

[3] Jason Brownlee. “A Gentle Introduction to the ImageNet Challenge (ILSVRC)”. Machine Learning Mastery. 2019

https://machinelearningmastery.com/introduction-to-the-imagenet-large-scale-visual-recognition-challenge-ilsvrc/

[4]? Richard O. Duda, Peter E. Hart, David G. Stork. “Pattern Classification”, 2nd edition. Wiley 2000.

https://www.wiley.com/en-gb/Pattern+Classification,+2nd+Edition-p-9780471056690

[5] Ayush Pant. “Introduction to Machine Learning for Beginners”. TowardsDataScience. 2019.

https://towardsdatascience.com/introduction-to-machine-learning-for-beginners-eed6024fdb08

[6] Andrey Kurenkov. “A Brief History of Neural Nets and Deep Learning”. Skynet Today. 2020.

https://www.skynettoday.com/overviews/neural-net-history

[7] Nilesh Barla. “A Gentle Introduction to Deep Learning”. V7 Labs. 2023.

https://www.v7labs.com/blog/deep-learning-guide

[8] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”. Advances in Neural Information Processing Systems 25 (NIPS 2012).

https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

[9] Niall O’ Mahony, Sean Campbell, Anderson Carvalho, Suman Harapanahalli,

Gustavo Velasco Hernandez, Lenka Krpalkova, Daniel Riordan, Joseph Walsh. “Deep Learning vs. Traditional Computer Vision”. 2019

https://arxiv.org/pdf/1910.13796.pdf

[10] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. Advances in Neural Information Processing Systems 28 (NIPS 2015).

https://proceedings.neurips.cc/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf

[11] Gaudenz Boesch. “25+ Applications of Computer Vision in Logistics (2023 Guide)”. Viso AI. 2023.

https://viso.ai/applications/computer-vision-in-logistics/

[12] “Machine Vision in Intralogistics”. MV Tec. 2022.

https://www.mvtec.com/application-areas/intralogistics-and-automated-warehouses

[13] Hmrishav Bandyopadhyay. “Image Classification Explained”. V7 Labs. 2023.

https://www.v7labs.com/blog/image-classification-guide

[14]? S. Liu and W. Deng, "Very deep convolutional neural network based image classification using small training sample size," 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR)

https://arxiv.org/abs/1409.1556

[15] Alberto Rizzoli. “The Ultimate Guide to Object Detection”. V7 Labs. 2023.

https://www.v7labs.com/blog/object-detection-guide

要查看或添加评论,请登录

sSy.AI的更多文章

社区洞察

其他会员也浏览了