A detailed understanding about Crowd Counting using CNN.

A detailed understanding about Crowd Counting using CNN.

Crowd counting refers to the technique used to estimate the number of people in an image or video. It finds applications in various industries, hospitals, crowd gatherings, as well as automated public monitoring like surveillance and traffic control. Unlike object detection, crowd counting focuses on identifying arbitrarily sized targets in different scenarios, including sparse and cluttered scenes simultaneously.

Crowd counting tasks can be broadly categorized into two types:

  • Dense crowds: Occurs when a large number of people are densely packed in a specific area.
  • Sparse crowds: Refers to situations where people are scattered with significant gaps between them.

Sparse crowd counting is relatively easier compared to counting dense crowds, which requires more sophisticated algorithms.

Techniques

Several techniques have been developed to address the challenges of crowd counting. Initially, computer scientists employed basic machine learning and computer vision algorithms such as detection, regression, and density-based approaches to predict crowd density and density maps. However, these methods faced challenges such as scale and perspective variations, occlusions, non-uniform density, and more. Subsequently, researchers turned their attention to Convolutional Neural Networks (CNNs) due to their effectiveness in various computer vision tasks, aiming to leverage their capabilities in developing crowd counting algorithms.

Counting by Detection

This approach focuses on counting by detecting individual objects, specifically people. There are three types of crowd counting by detection, based on the features used to identify crowds in images and videos:

  • Integral-based Detection: This method uses the full-body appearance of people, extracting features like edges, shapelets, textures, Haar wavelets, histogram of oriented gradient (HOG), etc. Then, it uses learning approaches such as SVMs, boosting, random forests, clustering or other algorithms are employed to detect or classify objects and ergo count people.
  • Part-based Detection: Instead of considering the entire human body, this technique focuses on specific parts, such as the head or shoulders, and applies classifiers to those parts. Estimating the presence of a person solely based on the head is not reliable, so combining the head and shoulders provides better results, particularly for dense crowds.
  • Shape Matching: This method uses ellipses to draw boundaries around humans and then employs a stochastic process to estimate the number and configuration of shapes.

Disadvantage: However, counting by detection is not highly accurate when dealing with dense crowds and significant background clutter.

Counting by Regression

Counting by regression does not involve segmentation or tracking of individuals, but focuses on learning a mapping between image features to the number of individuals, which result in better performance, specially with dense crowds. Depending on the regression goals, this crowd counting method can be divided into two groups:

  • Individual based regression: This technique extracts low-level features such as edge details and foreground pixels, and applies regression modeling to map these features to the count.
  • Density based regression: This approach focuses on estimating density by learning the mapping between local features and object density maps, effectively incorporating spatial information. So, it avoids the dependence on the detector by learning the mapping of images to density maps. Instead of learning each individual separately, this technique tracks groups of individuals simultaneously. The mapping can be "linear or nonlinear".

Disadvantage: Although these techniques get better performance in both sparse and dense scenarios and alleviate the dependency on the detector, they still rely heavily on handcrafted features. As a result, the feature extraction algorithm became an essential limitation for regression-based methods.

Counting using CNN

The powerful feature extraction capabilities of CNNs in deep learning, it can be used to automatically extract features and train an end-to-end network to count individuals. The methods can adapt to changes in various factors, predict the number of individuals more accurately and achieve the state of the art on many popular evaluation benchmarks. CNN-based methods outperform other approaches in scenarios involving a wide range of human head scales, non-uniform density distributions, and significant variations in perspective and scene.

Although this technique gets better performance in both sparse and dense scenarios and alleviates the dependency on the detector, it still relies heavily on handcrafted features. As a result, the feature extraction algorithm became an essential limitation for regression-based methods.


  1. Multi-scale fusion: Methods such as MCNN, CrowdNet, and SaCNN focus on fusing features of different scales to handle varying head scales and crowd sizes. They use multi-column convolutional networks or scale-adaptive convolutional neural networks for feature extraction and fusion.
  2. Attention-based: Approaches like MSAN, SCAR, and SFANet utilize attention mechanisms to address challenges such as changes in head scales and complex crowd scenes. Attention is used to guide the network to focus on important regions and improve counting accuracy.
  3. Patch-based: Hydra-CNN, Switching CNN, and IG-CNN divide images into patches and count them separately, addressing uneven crowd density. They employ various techniques such as adaptive patch response, selective network branching, and incremental learning.
  4. Multi-density map fusion: Methods like ASD and DecideNet fuse density maps of multiple scales or levels to handle varying conditions. They use weight information or adaptive calibration to combine density maps and improve counting accuracy.
  5. GAN-based: GAN-based approaches, such as MS-GAN, leverage adversarial networks to generate more accurate density maps. Generative and discriminative models compete to understand the distribution of crowd data, leading to improved counting performance.
  6. Context-based: CP-CNN and other context-based methods utilize contextual and semantic information to constrain density maps. They integrate global and local context information to generate high-quality density maps.
  7. Coarse-to-fine: Coarse-to-fine approaches, including DRSAN and ic-CNN, initially generate a coarse density map and then refine it for finer counting results. They use recurrent spatial-aware networks or multi-stage fusion to enhance the density map quality.



要查看或添加评论,请登录

社区洞察

其他会员也浏览了