Features extraction from multi-spectral remote sensing images based on multi-threshold binarization – Scientific Reports

admin
13 Min Read

In this paper, we propose a solution to resolve the limitation of deep CNN models in real-time applications. The proposed approach uses multi-threshold binarization over the whole multi-spectral remote sensing image to extract the vector of discriminative features for classification. We compare the classification accuracy and the training time of the proposed approach with ResNet and Ensemble CNN models. The proposed approach shows a significant advantage in accuracy for small datasets, while keeping very close recall score to both deep CNN models for larger datasets. On the other hand, regardless of the dataset size, the proposed multi-threshold binarization provides approximately 5 times lower training and inference time than both ResNet and Ensemble CNN models.

Remote sensing tasks are currently vital for a wide range of applications in our life. The workflow of remote sensing is typically connected to the analysis of large data flows by expert systems and deriving the corresponding inferences, evaluations and predictions. Naturally, such workflows heavily rely on artificial intelligence (AI) solutions. By using deep learning (DL), modern AI algorithms achieved a significant boost in terms of speed and accuracy of decision making, as well as in terms of the overall volume of the processed data in real time. Notwithstanding the already terrific abilities of state-of-the-art AI-empowered remote sensing solutions, which are able to process images in quasi-real-time with 15-20 fps, there is still a large subset of tasks that require processing time in the range of milliseconds. The most important aspect of the computational performance of the deep learning algorithm is the feature extraction process, which significantly affects the classification performance. According to the literature, feature extraction typically takes up to 80% of all computational loads. The most widely adopted DL solution for image and video processing is based on convolutional neural networks (CNNs), which use a mix of convolutional and dense (fully connected) layers. The convolutional layers work as feature extractors, while dense layers are used for classification or regression decisions based on the extracted features. Sometimes, feature extractors can use recurrent layers, which provide generative capabilities to the model. Thus, the internal structure and deepness of the CNN model is usually optimized for the specific task to achieve the best performance in terms of accuracy, precision, sensitivity and inference time.

By nature, deeper CNNs are typically more powerful and can be used for more complex tasks in remote sensing. However, the size of the CNN, or any other DL model, is directly connected to the computational requirements and processing time, which constrain its practical applicability. Thus, the alternative approach is to use models based on binary neural networks, i.e., neural networks, where all weights are binary, instead of floating or integers. Such an approach allows the use of much simpler hardware and provides much faster inference on segmentation tasks based on the generalization of image features.

Typically, CNNs are very likely to run into overfitting by lacking generalization abilities and being biased to the training data. There are several techniques to avoid overfitting, such as adding additive noise to the training data, using dropout layers and optimizing the ratio between model complexity, number of training samples and target accuracy. Feature extraction is a far more complicated process than classification and requires additional optimization steps for training. The training time for the model depends on the internal number of trainable parameters and the size of the dataset. Therefore, the initial optimization direction is to determine the optimal dataset, which is relatively small but provides the target variability, discriminative ability and statistical completeness.

The distinctive property of the typical remote sensing task compared to conventional computer vision tasks is the multi-spectral image processing. Multi-spectral images are composed of a set of monochrome subimages of the same scene from different sensors (cameras) with different wavelengths. The commonly used RGB images are the particular case of the multi-spectral image composed of subimages of red, green and blue colors. Industrial remote sensing satellites, such as Landsat 5, work with 7 subimages, which cover both the visible and infrared ranges from 450 to 1250 nm.

All known approaches for image processing and recognition can be applied for the multi-spectral images by processing each subimage separately. For example, the multi-spectral image can be considered a combination of monochrome images, which are processed separately to extract the distinctive features and landmarks and combine them within a single image. However, better results can be achieved by multi-spectral processing of the image as a whole. This is especially important for the classification tasks, when we need to extract the feature vectors and create a dataset for deep learning. To fully exploit the additional information from several spectral bands, we need to analyze a multi-spectral image as a whole rather than a combination of multiple grayscale images. Thus, each pixel of a multi-spectral image can be represented in a n-dimensional hyperspace as a vector of length k.

There are specific methods dedicated to the processing of the multi-spectral images. In the segmentation of multi-spectral images, each pixel in a different spectrum band forms a vector of features with different intensities, which represent its position in the k-dimensional feature space. The simplest approach to determine a class is to choose the upper and lower thresholds for each spectrum band to represent an m-dimensional hypercube in a feature space. If the feature vector of the pixel fits into the corresponding class localization in a hypercube, the pixel is classified to that class accordingly. In many tasks, segmentation can be simplified to binarization, which reduces the computational complexity and uncertainty of decisions.

Usually, the classification of the multi-spectral images by deep learning methods is implemented in the same way for any problem. The key difference is all the time hidden in the feature extraction part, which is implemented by the convolutional neural networks. As an alternative, multi-level approaches are sometimes used, such as decision trees and approaches based on gradient boosting.

From the perspective of processing and feature extraction of multispectral images, decomposition methods have proven to be effective, as they enable the detailing of potential hidden features. In particular, approaches utilizing processing in the spectral domain and methods based on the use of subpixel data for multispectral images can be distinguished. Feature formation technology based on threshold approaches works well for multichannel images, making it possible to apply it to multispectral images.

In the majority of cases, approaches to feature extraction of multispectral images rely on local descriptor methods that do not account for possible relationships in homogeneous regions. To address this issue, a method is proposed in that is based on the extraction of structural features with multiple characteristics of spectral-spatial structures of different objects. Here, local and global structural features are formed and combined into a general feature vector.

In, it is further suggested to utilize multiscale features, which enable effective representation in the feature space of image regions containing objects of arbitrary size or varying scales, thereby enhancing the ability to analyze multispectral images at different levels of detail. This approach allows for the efficient handling of multispectral images, catering to the diverse requirements of various applications in the realm of remote sensing, pattern recognition, and image processing.

The main problem, that occurs during the recognition of multi-spectral images, is that the overall data volume for processing is significantly higher, which negatively affects the processing time and memory consumption. In many tasks, it results in significant constraints for real-time computer vision applications. Recently, we observe numerous developments of smaller and faster CNN architectures such as MobileNet, SqueezeNet, ShuffleNet, etc, which are suitable for resource-constrained environments, such as microcontrollers and mobile devices. However, the main limitation of those models is that they still use convolutional layers for the feature extraction, but their number is significantly reduced and some layers may be simplified by replacing regular convolutions with depthwise convolutions. With depthwise convolution we separate all channels of the image, perform independent convolutions for each channel, and stack the result afterwards. This approach has been proven to be effective for many tasks on RGB images, where number of channels is limited to 3. Typically, those models are just 1-5% below their much larger counterparts in the key performance metrics. However, multi-spectral images sometimes can exceed 100 contiguous spectral bands (channels), which makes depthwise convolutions less favourable in this context. Moreover, a variable number of channels in multi-spectral images does not align well with highly determined architecture of any CNN, where even small change of image resolution has large impact on the overall network performance. Thus, such architectures are not yet suitable for the remote sensing applications.

Therefore, further research on feature extraction approaches for multi-spectral images to address real-time requirements is timely and relevant to the overall field of deep learning-based computer vision and remote sensing applications in particular.

To address the aforementioned problem, in this paper we propose a new approach for the features extraction from multi-spectral images. Proposed approach is aimed to decrease the complexity of the computer vision applications in remote sensing. In particular, we assess the training complexity of the deep learning models on multi-spectral images due to the convolutional layers. We also prove that employing preliminary feature extraction, based on multi-threshold binarization, allows to speed-up the model training process. In addition, the proposed approach is considerably faster and requires a smaller training set compared to conventional training of the convolutional neural networks. The main contribution of the paper are the following:

The remainder of this paper is organized as follows. In section “Introduction”, we explain in detail the proposed approach for feature extraction from multi-spectral images based on the multi-threshold binarization. In section “Multi-threshold binarization for feature extraction from multi-spectral images”, we provide an experimental evaluation and performance analysis. Finally, we conclude this article in section “Experimental evaluations and performance analysis”.

Share This Article
By admin
test bio
Please login to use this feature.