首页 | 本学科首页   官方微博 | 高级检索  
     

基于RGB和深度双模态的温室番茄图像语义分割模型
引用本文:张羽丰,杨景,邓寒冰,周云成,苗腾. 基于RGB和深度双模态的温室番茄图像语义分割模型[J]. 农业工程学报, 2024, 40(2): 295-306
作者姓名:张羽丰  杨景  邓寒冰  周云成  苗腾
作者单位:沈阳农业大学信息与电气工程学院,沈阳 110866;沈阳农业大学信息与电气工程学院,沈阳 110866;辽宁农业信息化工程技术研究中心,沈阳 110866
基金项目:国家重点研发计划项目子课题(2022YFD2002303-01);辽宁省教育厅基本科研项目面上项目(JYTMS20231303);国家自然科学基金项目(31901399);十四五国家重点研发计划项目子课题(2021YFD1500204)
摘    要:图像语义分割作为计算机视觉领域的重要技术,已经被广泛用于设施环境下的植物表型检测、机器人采摘、设施场景解析等领域。由于温室环境下未成熟番茄果实与其茎叶之间具有相似颜色,会导致图像分割精度不高等问题。本研究提出一种基于混合Transformer编码器的“RGB+深度”(RGBD)双模态语义分割模型DFST(depth-fusion semantic transformer),试验在真实温室光照情况下获得深度图像,对深度图像做HHA编码并结合彩色图像输入模型进行训练,经过HHA编码的深度图像可以作为一种辅助模态与RGB图像进行融合并进行特征提取,利用轻量化的多层感知机解码器对特征图进行解码,最终实现图像分割。试验结果表明,DFST模型在测试集的平均交并比可达96.99%,对比不引入深度图像的模型,其平均交并比提高了1.37个百分点;对比使用卷积神经网络作为特征提取主干网络的RGBD语义分割模型,其平均交并比提高了2.43个百分点。结果证明,深度信息有助于提高彩色图像的语义分割精度,可以明显提高复杂场景语义分割的准确性和鲁棒性,同时也证明了Transformer结构作为特征提取网络在图像语义分割中也表现出了良好的性能,可为温室环境下的番茄图像语义分割任务提供解决方案和技术支持。

关 键 词:温室|作物|语义分割|注意力机制|设施环境|番茄图像|RGBD|Transformer
收稿时间:2023-09-22
修稿时间:2024-01-08

Semantic segmentation model for greenhouse tomato images using RGB and depth bimodal
ZHANG Yufeng,YANG Jing,DENG Hanbing,ZHOU Yuncheng,MIAO Teng. Semantic segmentation model for greenhouse tomato images using RGB and depth bimodal[J]. Transactions of the Chinese Society of Agricultural Engineering, 2024, 40(2): 295-306
Authors:ZHANG Yufeng  YANG Jing  DENG Hanbing  ZHOU Yuncheng  MIAO Teng
Affiliation:College of Information and Electrical Engineering, Shenyang Agricultural University, Shenyang 110866, China;College of Information and Electrical Engineering, Shenyang Agricultural University, Shenyang 110866, China;Liaoning Engineering Research Center for Information Technology in Agricultural, Shenyang 110866, China
Abstract:Image semantic segmentation has been widely used in various applications, such as plant phenotyping, robot harvesting, and facility scene analysis. Periodic fruit status of tomato is required for phenotypic information, such as shape and color. Tomato can be one of the most important vegetable crops in greenhouse environments. However, manual sampling and detection fail to meet the requirements of high throughput and precision, due to the time-consuming, labor-intensity, and low efficiency. Computer vision can be expected for image semantic segmentation in recent years. This image segmentation has been frequently used to distinguish the crop fruits (foreground) and growth environment (background) in complex environments. It is still necessary to improve the accuracy of semantic segmentation in the complex environments of the greenhouse, for example, the uneven lighting in greenhouse environments, overlapping and occlusion between crop fruits and leaves, and the similarity in texture and color between immature crops and leaves. Traditional semantic segmentation of deep convolutional networks has been used only in the RGB modality of images for training. The accuracy of semantic segmentation can be achieved by the bottleneck using only RGB modality for training, with the continuous evolution of deep learning models. In this study, an "RGB + Depth" model of multimodal semantic segmentation (called DFST, depth-fusion Semantic Transformer) was proposed using a hybrid Transformer encoder (mix transformer encoder). Mit (mix transformer encoder) was adopted as the main feature extraction network of the DFST model. Mit was a Transformer encoder feature extraction backbone network more suitable for semantic segmentation. Compared with the ordinary Vision Transformers (ViTs), Mit shared the following advantages: 1) A hierarchical Encoder structure was employed to output the multi-scale features. The Decoder was also combined to capture and optimize segmentation for both high-resolution coarse- and low-resolution fine-grained features; 2) Computational complexity was reduced using sequence reduction, instead of an ordinary Self-Attention structure. Positional Embedding was removed to replace the Mix FFN. 3×3 Deepwise Conv was introduced in the Feed-Forward Network (Mix-FFN), in order to convey the position information. Depth images were obtained under real greenhouse lighting in a specific experiment. The depth images were encoded into the HHA (horizontal disparity, height above ground, angle) format before training. The HHA-encoded depth images were fused with the RGB images as an auxiliary modality for feature extraction. A lightweight multi-layer perceptron decoder was used to decode and segment the feature maps. The experimental results show that: 1) The DFST model improved the segmentation accuracy of crops in greenhouse environments. The depth was introduced as an auxiliary modality in addition to RGB semantic segmentation. mIoU was improved by 1.37% than before. 2) The depth images were encoded into the HHA three-channel images. The high-quality depth images were obtained in the conditions of equipment, environment, and lighting. The training accuracy was improved by 1.21% using HHA-encoded depth images, compared with the non-encoded ones. 3) A transformer was used as the feature extraction backbone network, instead of traditional convolutional neural networks. The reason was the weak global modeling and easy overfitting of previous convolutional neural networks. The transformer Mit feature extraction backbone network improved the mIoU by 2.43%, compared with the ShapeConv. In summary, the DFST model can be expected to serve as the semantic segmentation task of tomato images in greenhouse environments. The rapid and accurate segmentation was achieved in complex environments, such as various lighting conditions. The findings can provide theoretical assistance for crop detection and intelligent harvesting robots in greenhouse environments.
Keywords:greenhouse|crop|semantic segmentation|attention mechanism|facility environment|tomato|RGBD|Transformer
点击此处可从《农业工程学报》浏览原始摘要信息
点击此处可从《农业工程学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号