Segmentation Method for Glass-like Object Based on Cross-Modal Fusion

doi:10.12068/j.issn.1005-3026.2025.20230204

Abstract

Abstract:

Due to the lack of distinct textures and shapes， objects such as glass and mirrors pose challenges to traditional semantic segmentation algorithms， compromising the accuracy of visual tasks. A Transformer‑based RGBD cross‑modal fusion method is proposed for segmenting glass‑like objects. The method utilizes a Transformer network that extracts self‑attention features of RGB and depth through a cross‑modal fusion module and integrates RGBD features using a multi‑layer perceptron （MLP） mechanism to achieve the fusion of three types of attention features. RGB and depth features are fed back to their respective branches to enhance the network's feature extraction capabilities. Finally， a semantic segmentation decoder combines the features from four stages to output the segmentation results of glass‑like objects. Compared with the EBLNet method， the intersection‑and‑union ratio of the proposed method on the GDD， Trans10k and MSD datasets is improved by 1.64%， 2.26%， and 7.38%， respectively. Compared with the PDNet method on the RGBD-Mirror dataset， the intersection‑and‑union ratio is improved by 9.49%， verifying its effectiveness.

Key words: attention, semantic segmentation, glass?like object（GLO）, cross?modal, depth estimation

CLC Number:

TP 753

Ying-cai WAN, Li-jin FANG, Qian-kun ZHAO. Segmentation Method for Glass-like Object Based on Cross-Modal Fusion[J]. Journal of Northeastern University(Natural Science), 2025, 46(1): 1-8.

Figures/Tables 11

Fig.1 The framework of RGBD cross‑modal fusion for glass‑like object segmentation

Fig.2 Structure diagram of channel attention and space attention feature extraction of GLO

Table 1 Quantitative comparison with other methods on the GDD and Trans10k datasets.

方法	骨干网络	GDD				Trans10k
方法	骨干网络	$R m I o U / %$	$F β$	$M A E$	$R b e / %$	$R m I o U / %$	$F β$	$M A E$	$R b e / %$
ICNet^[1]	ResNet-50	69.59	0.747	0.164	16.10	74.94	0.784	0.110	10.92
DeepLabv3+^[16]	ResNet-50	69.95	0.767	0.147	15.49	51.52	0.602	0.229	23.80
MINet-R^[17]	ResNet-50	82.03	0.847	0.092	8.55	85.88	0.881	0.060	6.03
ITSD^[18]	ResNet-50	83.72	0.862	0.087	7.77	85.44	0.871	0.063	6.26
MirrorNet^[6]	ResNeXt-101	85.07	0.866	0.083	7.67	88.30	0.907	0.047	4.95
TransLab^[19]	ResNet-50	81.64	0.849	0.097	9.70	87.10	0.897	0.051	5.44
GDNet^[9]	ResNeXt-101	87.63	0.898	0.063	5.62	88.68	0.907	0.046	4.72
GSD^[7]	ResNeXt-101	88.07	0.932	0.059	5.71	89.16	0.937	0.043	4.50
PGSNet^[8]	ResNeXt-101	87.81	0.901	0.062	5.56	89.79	0.917	0.042	4.39
EBLNet^[10]	ResNeXt-101	88.16	0.939	0.059	5.58	90.28	0.947	0.048	4.14
本文方法	Swin-s	89.61	0.942	0.060	5.02	92.32	0.949	0.035	2.98

Table 1 Quantitative comparison with other methods on the GDD and Trans10k datasets.

方法	骨干网络	GDD				Trans10k
方法	骨干网络	$R m I o U / %$	$F β$	$M A E$	$R b e / %$	$R m I o U / %$	$F β$	$M A E$	$R b e / %$
ICNet^[1]	ResNet-50	69.59	0.747	0.164	16.10	74.94	0.784	0.110	10.92
DeepLabv3+^[16]	ResNet-50	69.95	0.767	0.147	15.49	51.52	0.602	0.229	23.80
MINet-R^[17]	ResNet-50	82.03	0.847	0.092	8.55	85.88	0.881	0.060	6.03
ITSD^[18]	ResNet-50	83.72	0.862	0.087	7.77	85.44	0.871	0.063	6.26
MirrorNet^[6]	ResNeXt-101	85.07	0.866	0.083	7.67	88.30	0.907	0.047	4.95
TransLab^[19]	ResNet-50	81.64	0.849	0.097	9.70	87.10	0.897	0.051	5.44
GDNet^[9]	ResNeXt-101	87.63	0.898	0.063	5.62	88.68	0.907	0.046	4.72
GSD^[7]	ResNeXt-101	88.07	0.932	0.059	5.71	89.16	0.937	0.043	4.50
PGSNet^[8]	ResNeXt-101	87.81	0.901	0.062	5.56	89.79	0.917	0.042	4.39
EBLNet^[10]	ResNeXt-101	88.16	0.939	0.059	5.58	90.28	0.947	0.048	4.14
本文方法	Swin-s	89.61	0.942	0.060	5.02	92.32	0.949	0.035	2.98

Table 2 Quantitative comparison with other methods

方法	$R m I o U / %$	$F β$	$M A E$	$R b e / %$
ICNet^[1]	57.25	0.710	0.124	18.75
DeepLabv3+^[16]	78.81	0.872	0.054	8.95
MirrorNet^[6]	78.95	0.857	0.065	6.39
EBLNet^[10]	80.33	0.883	0.049	8.63
本文方法	86.26	0.909	0.045	8.03

Table 2 Quantitative comparison with other methods

方法	$R m I o U / %$	$F β$	$M A E$	$R b e / %$
ICNet^[1]	57.25	0.710	0.124	18.75
DeepLabv3+^[16]	78.81	0.872	0.054	8.95
MirrorNet^[6]	78.95	0.857	0.065	6.39
EBLNet^[10]	80.33	0.883	0.049	8.63
本文方法	86.26	0.909	0.045	8.03

Table 3 Quantitative comparison with other methods on the RGBD-Mirror dataset

方法	$R m I o U / %$	$F β$	$M A E$	$R b e / %$
F3Net^[20]	65.15	0.707	0.069	14.25
MirrorNet^[6]	68.37	0.723	0.062	8.66
PMD^[8]	72.27	0.775	0.054	10.71
PDNet^[11]	77.77	0.825	0.042	7.77
本文方法	85.15	0.922	0.037	6.13

Table 3 Quantitative comparison with other methods on the RGBD-Mirror dataset

方法	$R m I o U / %$	$F β$	$M A E$	$R b e / %$
F3Net^[20]	65.15	0.707	0.069	14.25
MirrorNet^[6]	68.37	0.723	0.062	8.66
PMD^[8]	72.27	0.775	0.054	10.71
PDNet^[11]	77.77	0.825	0.042	7.77
本文方法	85.15	0.922	0.037	6.13

Table 5 Experimental results of different types of

方法	$R m I o U / %$	$F β$	$M A E$	$R b e / %$
PDNet^［11］（相机采集深度） PDNet（网络估计深度）	77.77	0.825	0.042	7.77
PDNet^［11］（相机采集深度） PDNet（网络估计深度）	78.58	0.849	0.041	7.01
本文方法（相机采集深度）	84.15	0.908	0.042	6.50
本文方法（网络估计深度）	85.15	0.922	0.037	6.13

Table 5 Experimental results of different types of

方法	$R m I o U / %$	$F β$	$M A E$	$R b e / %$
PDNet^［11］（相机采集深度） PDNet（网络估计深度）	77.77	0.825	0.042	7.77
PDNet^［11］（相机采集深度） PDNet（网络估计深度）	78.58	0.849	0.041	7.01
本文方法（相机采集深度）	84.15	0.908	0.042	6.50
本文方法（网络估计深度）	85.15	0.922	0.037	6.13

Fig.3 Visualization of attention distribution

Table7 Results of mean intersection‑over‑union ratio at different fusion stages

第1阶段	第2阶段	第3阶段	第4阶段	$R m I o U$
􀳫				87.71
􀳫	􀳫			87.31
		􀳫		87.29
			􀳫	87.89
􀳫	􀳫			88.69
􀳫	􀳫	􀳫		89.48
􀳫	􀳫	􀳫	􀳫	89.61

Table7 Results of mean intersection‑over‑union ratio at different fusion stages

第1阶段	第2阶段	第3阶段	第4阶段	$R m I o U$
􀳫				87.71
􀳫	􀳫			87.31
		􀳫		87.29
			􀳫	87.89
􀳫	􀳫			88.69
􀳫	􀳫	􀳫		89.48
􀳫	􀳫	􀳫	􀳫	89.61

Fig.4 Results of glass‑like objects in real scenes

Fig.5 The process of depth recovery

References 20

1	Zhao H S， Qi X J， Shen X Y，et al.ICNet for real‑time semantic segmentation on high‑resolution images［C］//Proceedings of the European Conference on Computer Vision （ECCV 2018）.Munich：Springer International Publishing，2018：418‑434.
2	Wang D Q， Zhang T， Süsstrunk S.NEMTO：neural environment matting for novel view and relighting synthesis of transparent objects［C］//2023 IEEE/CVF International Conference on Computer Vision （ICCV）.Paris：IEEE，2023：317-327.
3	王璐，王帅，张国峰，等.基于语义分割注意力与可见区域预测的行人检测方法［J］.东北大学学报（自然科学版），2021，42（9）：1261-1267.
	Wang Lu， Wang Shuai， Zhang Guo‑feng，et al. Pedestrian detection based on semantic segmentation attention and visible region prediction［J］.Journal of Northeastern University （Natural Science ），2021，42（9）：1261-1267.
4	张之敏，乔建忠，林树宽，等.一种基于深度网络的视图重建方法［J］.东北大学学报（自然科学版），2020，41（8）：1065-1069.
	Zhang Zhi‑min， Qiao Jian‑zhong， Lin Shu‑kuan，et al.A view reconstruction method based on deep network［J］.Journal of Northeastern University （Natural Science），2020，41（8）：1065-1069.
5	Wang Z Y， Li Y C， Cheng X N，et al.Key points trajectory and multi‑level depth distinction based refinement for video mirror and glass segmentation［J］.Multimedia Tools and Applications，2024，83（39）：86513-86535.
6	Yang X， Mei H Y， Xu K，et al.Where is my mirror？［C］//2019 IEEE/CVF International Conference on Computer Vision （ICCV）.Seoul：IEEE，2019：8808-8817.
7	Lin J Y， He Z B， Lau R W H.Rich context aggregation with reflection prior for glass surface detection［C］//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）.Nashville：IEEE，2021：13410-13419.
8	Lin J Y， Wang G D， Lau R W H.Progressive mirror detection［C］//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）.Seattle：IEEE，2020：3694-3702.
9	Mei H Y， Yang X， Wang Y，et al.Don’t hit me！glass detection in real‑world scenes［C］//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）.Seattle：IEEE，2020：3684-3693.
10	He H， Li X T， Cheng G L，et al.Enhanced boundary learning for glass‑like object segmentation［C］//2021 IEEE/CVF International Conference on Computer Vision （ICCV）.Montreal：IEEE，2021：15839-15848.
11	Mei H Y， Dong B， Dong W，et al.Depth‑aware mirror segmentation［C］//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）.Nashville：IEEE，2021：3043-3052.
12	Chang Q L， Liao H H， Meng X F，et al.PanoGlassNet：glass detection with panoramic RGB and intensity images［J］.IEEE Transactions on Instrumentation and Measurement，2024，73：5019015.
13	Liu Z， Lin Y T， Cao Y，et al.Swin transformer：hierarchical vision transformer using shifted windows［C］//2021 IEEE/CVF International Conference on Computer Vision （ICCV）.Montreal：IEEE，2021：9992-10002.
14	Yin W， Zhang J M， Wang O，et al.Learning to recover 3D scene shape from a single image［C］//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）.Nashville：IEEE，2021：204-213.
15	Taud H， Mas J F.Multilayer perceptron （MLP）［M］//Cámacho O M T，Paegelow M，Mas J F，et al.Geomatic Approaches for Modeling Land Change Scenarios.Cham：Springer，2018：451-455.
16	Zhao H S， Shi J P， Qi X J，et al.Pyramid scene parsing network［C］//2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）.Honolulu：IEEE，2017：6230-6239.
17	Deng J J， Pan Y W， Yao T，et al.MINet：meta‑learning instance identifiers for video object detection［J］.IEEE Transactions on Image Processing，2021，30：6879-6891.
18	Zhou H J， Xie X H， Lai J H，et al.Interactive two‑stream decoder for accurate and fast saliency detection［C］//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）.Seattle：IEEE，2020：9138-9147.
19	Xie E Z， Wang W J， Wang W H，et al.Segmenting transparent objects in the wild［C］//Computer Vision and Pattern Recognition.Cham：Springer International Publishing，2020：696-711.
20	Wei　J， Wang　S H， Huang Q M.F3Net：fusion，feedback and focus for salient object detection［C］//Proceedings of the AAAI Conference on Artificial Intelligence.New York：IEEE，2020：12321-12328.

[1]	Yan LIU, Qi-jie BU, Hong-chen ZHAO, Xin GUO. Operating Performance Assessment of Flotation Process Based on Multi-source Heterogeneous Information [J]. Journal of Northeastern University(Natural Science), 2024, 45(9): 1217-1226.
[2]	An-lin TIAN, Wei-min LEI, Peng ZHANG, Wei ZHANG. A Multi-scale Edge Detection Method Based on Encoder-Decoder [J]. Journal of Northeastern University(Natural Science), 2024, 45(7): 936-943.
[3]	Wei-wei LIU, Jia-he QIU, Guang-da HU, Ze-yuan LIU. Surface Damage Detection Method for Retired Shaft Parts Based on Improved YOLOv5 [J]. Journal of Northeastern University(Natural Science), 2024, 45(7): 1002-1010.
[4]	Yuan MA, Li-huang SHE, Jia-wei LI, Xi-rong BAO. Adaptive Graph Convolutional 3D Point Cloud Recognition Algorithm Based on Attention Mechanism [J]. Journal of Northeastern University(Natural Science), 2024, 45(6): 786-792.
[5]	Li-xin GUO, Su-tao BI, Ming-yang ZHAO. State Detection Algorithm of Manipulator Based on Improved YOLOv4 Lightweight Network [J]. Journal of Northeastern University(Natural Science), 2024, 45(6): 769-775.
[6]	Hu FENG, Ke-chen SONG, Wen-qi CUI, Yun-hui YAN. Few-Shot Semantic Segmentation of Strip Steel Surface Defects Based on Meta-Learning [J]. Journal of Northeastern University(Natural Science), 2024, 45(3): 354-360.
[7]	Peng SHAN, Lin ZHANG, Hong-ming XIAO, Yu-liang ZHAO. CT Diagnosis Method for Coronavirus Pneumonia with Integrated Multi-scale Attention Mechanism [J]. Journal of Northeastern University(Natural Science), 2024, 45(12): 1673-1679.
[8]	Bo HAO, Xin-yan XU, Yu-xin ZHAO, Jun-wei YAN. Surface Defect Detection of Riveting Holes Based on Improved YOLOv8 [J]. Journal of Northeastern University(Natural Science), 2024, 45(11): 1595-1603.
[9]	Zhi-jin ZHANG, He LI, Yu-shi HUANG, Wen-xue WANG. Application of Deep Residual Shrinkage Network in Rolling Bearing Fault Diagnosis [J]. Journal of Northeastern University(Natural Science), 2024, 45(11): 1587-1594.
[10]	Meng-yuan LIU, Zhao-xia WU, Jin-yang WANG, Guang-lei XIA. Air Permeability Prediction of Sinter Layer Based on TST-LSTM Model [J]. Journal of Northeastern University(Natural Science), 2024, 45(10): 1379-1385.
[11]	Ying SUN, Ya-ru ZHOU, Xue-ying ZHANG. Speech Emotion Recognition Fusing Functional Paralanguage Proportion Coefficient [J]. Journal of Northeastern University(Natural Science), 2024, 45(1): 40-48.
[12]	Hao SUN, Zong-sheng DAI, Ai-bing JIN, Yan CHEN. Intelligent Identification and Parameter Extraction of Key Joints in Rock （Mass） Based on AttentionR2U-net [J]. Journal of Northeastern University(Natural Science), 2024, 45(1): 101-110.
[13]	JIANG Yang， LIU Cheng， DING Qi-chuan， WANG Li. Segmentation of COVID-19 CT Images Based on Dual Attention Mechanism [J]. Journal of Northeastern University(Natural Science), 2023, 44(9): 1259-1268.
[14]	ZHOU Song， GAO Tian-han. EEG Recognition Method for Epileptic Patients Based on RNN Model with Attention Mechanism [J]. Journal of Northeastern University(Natural Science), 2023, 44(8): 1098-1103.
[15]	DING Qi-chuan， WANG Li， LIU Cheng. Classification of Pulmonary Nodule by Combining Long-Distance Channel Attention and Pathological Feature [J]. Journal of Northeastern University(Natural Science), 2023, 44(4): 476-485.