AUTOMATED GUN DETECTION WITH COMPUTER VISION

Опубликовано в журнале: Научный журнал «Интернаука» № 2(272)
Автор(ы): Saakyan Argo Armenovich
Рубрика журнала: 3. Информационные технологии
DOI статьи: 10.32743/26870142.2023.2.272.350761
Библиографическое описание
Saakyan A.A. AUTOMATED GUN DETECTION WITH COMPUTER VISION // Интернаука: электрон. научн. журн. 2023. № 2(272). URL: https://internauka.org/journal/science/internauka/272 (дата обращения: 03.01.2025). DOI:10.32743/26870142.2023.2.272.350761

AUTOMATED GUN DETECTION WITH COMPUTER VISION

Saakyan Argo Armenovich

Computer Vision Researcher at Diagnocat,

Armenia, Yerevan

ABSTRACT

Neural networks and computer vision has a wide range of use in different spheres. In this article, we are going to discuss how to create an automated system for gun detection with Computer Vision, based on object detection algorithm and image classification used in a cascade.

 

Keywords: Gun detection, computer vision, deep learning, data science.

 

What is Computer Vision?

Let's begin with the basics. To understand what is computer vision, we firstly need to understand the more general term - machine learning. The main thing in machine learning is that we use an algorithm and fit it based on certain data in an attempt to get predictions on new similar but never seen data.

Computer vision - is a section of machine learning, with main goal of analyzing images. Video is also in this section, as it is a row of frames/images. Deep neural networks are often used in computer vision, and that's what we are going to talk about. There are several main tasks in computer vision - classification, detection and segmentation.

Classification is the basic task. It's not really a widespread task just by itself, but other tasks are based on it. Its goal is to predict what class or classes are shown on the image. Given an image to the neural net, we will get a prediction of classes which were found on the image.

Object detection is a hugely important task in real cases. Its goal is to localize the object and to classify it. If there are several objects on the image - they all will be detected. Given an image to the neural net, we will get all objects localized and classified. It's usually shown with a bounding box - a rectangle around the object.

Segmentation is a pixel-by-pixel classification. So, it shows exactly where the object is. Given an image to the neural net, we will get all objects painted over (it's called a mask). Segmentation is important when we really need to know where are all edges of the object, for example, in medical images and 3D volumes.

What is our task?

Let's get closer to our task. We want to be able to detect a gun (small or big) with security cameras. That should happen automatically. Our neural net needs to process every frame (or close to that) from every camera and trigger an alarm in a case of a suspicious thing. This system can be used anywhere: school, shopping center, hospital, on the streets, maybe around people's houses. We are not going to have a perfect system, so we will need an operator for validating the alarms in corporate usage.

This system can be interesting for businesses, government and individuals. For business owners, it's good to have a stable security system without a ton of operators, especially if it's a private school.

For government, it's important to keep safe the streets and social places, especially in the US, where a gun problem is present.

For individuals, it's great to have more than an ordinary security system around the house.

So, the problem is important and relevant.

What challenges do we have?

It's obvious now, that an object detection should help in this task, but there are some difficulties:

- Our solution needs to be fast. It means, that it should be able to process a lot of frames per second. And more cameras we have - more frames we need to process. We can't use infinite amount of hardware for processing the frames, because if the solution is expensive - it doesn't make sense for business. Faster neural net we have - more frames per second we can process on one server.

- A lot of false alerts. We need to process video streams 24/7. That's more than 2 million frames per day per camera (25 frames per second * 60 seconds *60 minutes * 24 hours). That means that even if we have a really low rate of false positives (it's when the model thinks that for example a cellphone is a handgun and gives a false alert), we are still going to get a ton of alerts with that number of frames to process. And we can't afford a lot of operators to check if every alert is real, our solution is going to be too expensive. In 24/7 systems, it is critically important to have as low false positives as possible.

- The first two problems don't play well with each other, as it's always a tradeoff between accuracy and speed. Faster neural nets we have - less accuracy we get and more false alerts we create. Usually, if we want more accurate models, we need more hardware for running bigger models at needed speed.

Cascade models as a solution

So, can we increase our performance without losing processing speed and using the same amount of computing power? We can get pretty close to that with this technique. The idea is to use two neural nets in cascade (one after another). But both nets are tuned in s specific way to complement each other:

- The first neural net is a fast detector. It runs on the whole frame, but downscaled to 640x640 and localizes interested objects and classifies it. This model is tuned for having higher recall sacrificing precision, which means, that model will try to always detect a gun, but will often give a false positive on a cellphone or something else. The detector was trained on relevant to inference images with guns.

- The second neural net is a classifier. It runs on the crop, made by a detector (it runs only if the detector finds anything at the frame). Specifically saying, we take coordinates of the crop from the detector and get that crop from original image, so we can use a higher res image for the classifier. Classifier's task is to validate if chosen by detector object is a gun or not. Classifier is trained on the crops from the detector, but it has several classes. Not only a gun, but some other objects which are often detected as a gun. For example: cellphone, umbrella, bag and other.

The goal of the detector is to detect an object which might be a gun. Classifier should be able to correctly validate the object. So, training both nets is done based on that. I should mention that classifier is easier to retrain based on real cases after deployment, so that's a great thing to do.

 

Figure 1. Pipeline

 

Why does it work?

Let's firstly discuss why it is fast:

- Basically, the speed of our solution is tied to the speed of the first neural network - the detector. It's because our second net runs just in cases when first net finds anything, so 99% of the time second net is idle. When we need to run second net - we can do that as a separate process, so our detector does not lose fps. And as classifiers are fast and ours work on the crop - there's not that much computing power needed, because image size is pretty small.

- The second thing is that because we have our classifier for accuracy, we can use fast detector.

And now let's discuss why we gain accuracy:

- We use an ensemble of two models.

- Second model gets full res part of the image.

- We use different datasets for our models and our classifier is tuned to distinguish guns and other similar objects.

- From detector we need good detection, from classifier - classification. We can choose models based on that. We don't need to sacrifice classification power of the detector just because it is fast. So, we will end up with the best fit on the detection side and on the classification side.

- We can easily retrain classifier after deployment, add new objects which are often detected as a gun, and we should not affect detection scores.

Metrics

Here are some metrics of real data (this is not a data from dataset, not even test part, this is randomly taken data). As you can see on the right side, we have slightly decreased Recall [1], but noticeably increased Precision [1]. That's good, because we don't want to spam operators.

 

Figure 2. Metrics

 

Dataset

Now let's touch a little bit on the specifics of collection of the dataset for this technique. As there is no open and good dataset for gun detection, we have collected our own. Of course, we kept all general recommendations in mind like collecting balanced dataset, diverse, representative, having a good consistence labeling and so on. But here are a couple of things that I want to mention.

For the detector, we have collected data as close to inference as possible. If your inference is going to be on security cameras - you want to collect your dataset on cameras close to that ones. For sure, we placed cameras correctly and had correct angles, just how inference cameras were going to have. Our objects in the frame should be close to inference, too. In an example with a gun, you don't need to bring a gun to 10 centimeters in front of the camera - it's not realistic. Summing up, we tried to be as close to realistic scenarios as possible.

 

Figure 3. Bad example of image for detector training

 

Figure 4. Good example of image for detector training

 

For the classifier, we have created a lot of crops with our detector. It's a good idea to run a detector with a low threshold to get a lot of false positives. Then we analyzed images and selected groups from objects like cellphones if they show up often and created a separate class. With that, we have collected crops of guns and some other objects. After that we created a class named 'other' and put there all images, which don't belong to any created class.

Models

Now let's get more specific on models we used.

Our first model - detector - ended up being a YOLOv5s [2]. We have tried some other models, but YOLOv5 was chosen based on accuracy and performance. It was trained with customized hyperparameters with ~30 thousand images, 45 epochs, mAP [3] got to 0.97 on the test set.

Classification model was chosen a second neural net, though different solutions were tested (like having two detectors in a cascade). EfficientNetB0 [4] ended up being the best choice based on accuracy and performance. Our F1 score [5] was around 0.98. Besides known architectures, custom CNN was tested too. For cases, when the solution needs to be deployed on edge devices, we created lightweight classifier. Model was 3 times smaller and 2 times faster, but F1 score was around 0.94. So, usually EfficientNet is used, and custom CNN is ready if speed is really important.

Metrics summary:

YOLOv5s - mAP: 0.97

EfficientNetB0 - F1: 0.98

Deployment

As I have mentioned before, we need our solution to be fast and scalable. That's why we used two important things: TensorRT [6] and TritonServer [7].

TensorRT helps to optimize and speed up neural net. Here are key features of TensorRT:

- Horizontal and vertical merging (for example: convolution layer, batch norm layer and activation function would merge into one block) and removal of layers.

- Quantization - lowering precision of weights to fp16 or int8.

- Automatic selection of the best kernel for exact operations and exact hardware.

At the end we get 2-5 times speed up and noticeable model size decrease.

TritonServer is a serving software, it helps to deploy your models in a scalable way. Here are some key features:

- Runs your models as a backend, so you can send gRPC or HTTP requests with your frames to the model from different clients and get predictions.

- Supports different models (TensorRT, TensorFlow, PyTorch, ONNX...).

- GPU usage is optimized.

You can have several models in TritonServer and just send your frames for processing from several clients.

Pipeline

Here are steps of our system operation:

1) Grab the frame.

2) Preprocess the frame and send for the detection.

3) If there is an object, get its coordinates and crop the original frame. Then the loop starts over.

4) If at the step above an object was found - in parallel process run a classification model on the crop.

5) If classifier validates the object - send an alarm.

6) The operator gets the alarm at the UI and operates according to the protocol.

Conclusion

Automated system like this one are really important to have, and for sure that kind of systems are going to be implemented everywhere in the future, as a smart home / smart city projects.

 

References:

  1. Precision and Recall - https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall?hl=en
  2. YOLOv5 - https://github.com/ultralytics/yolov5
  3. Mean Average Precision - https://blog.paperspace.com/mean-average-precision/
  4. F1 score - https://www.educative.io/answers/what-is-the-f1-score
  5. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks - https://arxiv.org/abs/1905.11946
  6. TensorRT - https://developer.nvidia.com/tensorrt
  7. Triton Inference Server - https://github.com/triton-inference-server/server