基于轻量级CNN-Transformer混合网络的梯田图像语义分割 Semantic segmentation of terrace image regions based on lightweight CNN-Transformer hybrid networks期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于轻量级CNN-Transformer混合网络的梯田图像语义分割

引用本文：	刘茜,易诗,李立,程兴豪,王铖. 基于轻量级CNN-Transformer混合网络的梯田图像语义分割[J]. 农业工程学报, 2023, 39(13): 171-181

作者姓名：	刘茜易诗李立程兴豪王铖

作者单位：	1. 成都理工大学机电工程学院，成都 610059

基金项目：	汽车测控与安全四川省重点实验室开放基金（QCCK2021-008）；成都理工大学高等教育人才培养质量和教学改革项目（JG2130216）

摘要：	梯田是一种传统的农业种植方式，发挥着稳定作物生产与水土保持效能，修筑梯田是发展农业生产的重要措施之一。快速、准确地对梯田区域分布信息进行采集，对提高粮食产量、治理水土流失以及规划区域生态等方面具有重要的作用与意义。无人机图像梯田道路边界模糊、具有较长的带状结构，为了更准确地获取梯田的边缘信息，受MobileVit启发，该研究在MobileViT block中引入了轴向注意力机制(axial attention)，并采用编码器-解码器结构，提出了基于轻量级CNN-Transformer混合构架网络模型。模型编码器部分由改进的MobileViT block、融入了条形池化的逆残差模块和空洞空间金字塔池化模块构成，再通过有效设计摆放各模块的位置顺序来实现局部与全局的视觉表征信息交互，得到完整的全局特征表达；利用解码器对编码器提取到的多尺度特征图进行采样和卷积操作得到语义分割结果图。选取PSPNet、LiteSeg、BisNetv2、Deeplabv3Plus、MobileViT在相同测试集上进行对比试验，结果表明，该研究所提模型在精度与速度方面均具有一定的优势，其像素精度可达95.79%，频权交并比可达94.86%，模型参数量为8.32 M，实现了使用较少的参数和简单的方法对复杂无规则的无人机图像梯田区域对象较为准确的分割，将其部署到无人机上可以进一步获取梯田的形状、位置、轮廓等信息，及时准确地掌握梯田边缘信息为预防和修护加固梯田提供重要的依据，同时有助于梯田区域种植面积和范围的统计，以期为梯田和旱作区农业建设的发展提供参考。
关键词：	图像处理语义分割轻量化模型轴向注意力梯田数据集
收稿时间：	2023-04-05
修稿时间：	2023-05-26
Semantic segmentation of terrace image regions based on lightweight CNN-Transformer hybrid networks

LIU Xi,YI Shi,LI Li,CHENG Xinghao,WANG Cheng. Semantic segmentation of terrace image regions based on lightweight CNN-Transformer hybrid networks[J]. Transactions of the Chinese Society of Agricultural Engineering, 2023, 39(13): 171-181

Authors:	LIU Xi YI Shi LI Li CHENG Xinghao WANG Cheng

Affiliation:	1. School of Mechanical and Electrical Engineering, Chengdu University of Technology, Chengdu 610059, China

Abstract:	Terracing has been widely used in conventional cultivation modes to stabilize crop production, as well as soil and water conservation. The construction of terraces can be one of the most important measures to develop agricultural production. However, some terraces often face the risk of being destroyed, due to the influence of construction quality during management and maintenance. Therefore, it is a high demand to quickly and accurately detect the distribution of terraced areas under high food production, soil erosion control, and planning regional ecology. Alternatively, unmanned aerial vehicle (UAV) aerial camera system has been widely used to obtain high-resolution remote sensing images in the field of intelligent agriculture. Among them, semantic segmentation has promoted the development of several fields using deep learning, particularly with the rapid development of information technology. Inspired by MobileVit, an axial attention mechanism (Axial attention) was introduced in the MobileViT block. In this study, an encoder-decoder structure was proposed for a lightweight CNN-Transformer hybrid architecture-based network model. The encoder part of the model consisted of an improved MobileViT block. An inverse residual module was first incorporated into the strip pooling and a void space pyramidal pooling module. And then the local and global visual representation information interaction was achieved to effectively design the placement order of each module, in order to obtain a complete global feature representation. Strip pooling was introduced to effectively capture the remote dependencies. The high-level semantic information was then efficiently extracted from a large amount of data. The bar pools were introduced to effectively capture the remote dependencies, in order to extract the high-level semantic feature maps from a large amount of semantic information. The introduction of the void space pyramid pooling module was to capture contextual information from multiple scales. The perceptual field of the model was improved to obtain a denser semantic feature map. PSPNet, LiteSeg, BisNetv2, Deeplabv3Plush, and MobileViT were selected for comparison experiments on the same test set. The results show that the improved model performed the best, in terms of accuracy and speed. More importantly, better performance of achieved in the more accurate recognition and region delineation of complex and irregular UAV image terraces. Specifically, the pixel accuracy of the lightweight CNN-Transformer hybrid architecture network model was 95.79%, the average pixel accuracy was 87.82%, the average intersection ratio was 80.91%, and the frequency power intersection ratio was 94.86%. Furthermore, the improved model was only 8.32 M parameters with a small size, and low computational complexity, as well as a frame rate of 51.91 FPS,indicating the real-time and lightweight model. A comprehensive analysis was also made of the performance indexes of each segmentation model. It was found that the segmentation accuracy was higher and faster using the lightweight CNN-Transformer hybrid architecture network model with a small model size and low computational complexity. Therefore, the improved model can be expected to deploy on the UAVs, in order to fully meet the requirements of lightweight, high accuracy, and low latency for mobile vision tasks. The semantic segmentation of the terrace area was used to further obtain the information of shape, location, and outline of terraces. A timely and accurate detection was also achieved in the information of terrace edge for the prevention and reinforcement of terraces. At the same time, the statistics of cultivation area and scope of terrace area can be expected to promote the development of terraces and dry farming area agriculture construction.

Keywords:	Image process semantic segmentation lightweight model axial attention terraced dataset

	点击此处可从《农业工程学报》浏览原始摘要信息
	点击此处可从《农业工程学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏