小样本细粒度图像分类的Mamba-小波多尺度建模方法

仝傲; 任劼; 孟宗阳; 鲁磊

doi:10.11959/j.issn.2096-6652.202606

Chinese

您当前的位置：

首页 >

文章列表页 >

小样本细粒度图像分类的Mamba-小波多尺度建模方法

学术论文 | 更新时间：2026-04-27

- 小样本细粒度图像分类的Mamba-小波多尺度建模方法
- Mamba-wavelet-based multi-scale modeling method for few-shot fine-grained image classification
- 智能科学与技术学报 2026年8卷第1期页码：72-82
- 作者机构：
  
  1.西安工程大学电子信息学院，陕西西安 710600
  2.西安交通大学信息与通信工程学院，陕西西安 710049
- 作者简介：
  
  [ "仝傲（2002- ），女，西安工程大学电子信息学院硕士生，主要研究方向为小样本细粒度图像分类、深度学习。" ]
  [ "任劼（1984- ），女，西安工程大学电子信息学院副教授，主要研究方向为小样本细粒度图像分类、兴趣点检测、高光谱图像处理、深度学习。" ]
  [ "孟宗阳（2005- ），男，西安工程大学电子信息学院本科生，主要研究方向为小样本细粒度图像分类。" ]
  [ "鲁磊（1988- ），男，博士，西安交通大学信息与通信工程学院讲师，主要研究方向为计算机视觉、机器学习、图像处理。" ]
- 基金信息：
  
  陕西省自然科学基础研究计划(2025JC-YBMS-765);陕西省教育厅重点项目(23JY029)
- DOI：10.11959/j.issn.2096-6652.202606
  中图分类号： TP391
- 收稿：2025-12-05，
  
  修回：2026-03-16，
  
  录用：2026-03-25，
  
  纸质出版：2026-03-15
- 稿件说明：
移动端阅览
仝傲,任劼,孟宗阳等.小样本细粒度图像分类的Mamba-小波多尺度建模方法[J].智能科学与技术学报,2026,08(01):72-82.

Tong Ao,Ren Jie,Meng Zongyang,et al.Mamba-wavelet-based multi-scale modeling method for few-shot fine-grained image classification[J].Chinese Journal of Intelligent Science and Technology,2026,08(01):72-82.
仝傲,任劼,孟宗阳等.小样本细粒度图像分类的Mamba-小波多尺度建模方法[J].智能科学与技术学报,2026,08(01):72-82. DOI： 10.11959/j.issn.2096-6652.202606.

Tong Ao,Ren Jie,Meng Zongyang,et al.Mamba-wavelet-based multi-scale modeling method for few-shot fine-grained image classification[J].Chinese Journal of Intelligent Science and Technology,2026,08(01):72-82. DOI： 10.11959/j.issn.2096-6652.202606.

摘要

小样本细粒度图像分类旨在在有限标注样本条件下识别类别间细微差异，广泛应用于智能识别、生态监测及自动驾驶等领域。现有卷积结构受限于固定感受野和局部建模方式，对多尺度特征的关联描述不足，注意力或频域方法虽提升了细粒度特征的判别性，但在跨尺度依赖建模与特征融合方面仍存在局限。为提升多尺度细粒特征的表达能力，提出了一种小样本细粒度图像分类的Mamba-小波多尺度建模方法，该方法构建了Mamba状态空间建模的多尺度特征关系网络（MSFRNet）。该网络包含两大核心创新模块：小波引导动态Mamba多尺度特征提取（WDMFE）模块与交叉尺度注意力融合（CAF）模块。其中，WDMFE模块通过小波引导的动态自适应Mamba结构强化不同尺度下的频率感知与上下文建模，CAF模块采用通道与空间注意力机制整合多尺度特征以实现跨尺度补充。实验结果在CUB-200-2011、Stanford-Dogs和Stanford-Cars等基准数据集上获得了较高分类准确率，并呈现出稳定的性能提升。结果表明，该网络能够有效增强细粒度特征表达与跨任务泛化能力，并为小样本细粒度识别模型的多尺度建模提供可拓展框架。

Abstract

Fine-grained few-shot image classification aims to recognize subtle inter-class differences under limited annotated samples and has been widely applied in intelligent recognition

ecological monitoring

and autonomous driving. However

existing convolutional architectures are constrained by fixed receptive fields and local modeling schemes

resulting in insufficient characterization of multi-scale feature relationships. Although attention-based or frequency-domain methods have improved the discriminability of fine-grained features

limitations still exist in modeling cross-scale dependencies and feature fusion. To address these issues

a Mamba-wavelet-based multi-scale modeling method for few-shot fine-grained image classification was proposed. Specifically

a multi-scale feature relation network (MSFRNet) based on Mamba state space modeling was constructed. The proposed network consisted of two core modules

namely a wavelet-guided dynamic Mamba multi-scale feature extraction (WDMFE) module and a cross-scale attention fusion (CAF) module. In the WDMFE module

a wavelet-guided dynamic adaptive Mamba structure was introduced to enhance frequency perception and contextual modeling across different scales. In the CAF module

multi-scale features were integrated through channel and spatial attention mechanisms to achieve cross-scale feature complementation. Experimental results on benchmark datasets

including CUB-200-2011

Stanford Dogs

and Stanford Cars

demonstrated that higher classification accuracy was achieved and stable performance improvements were obtained. It is concluded that the proposed network effectively enhances fine-grained feature representation and cross-task generalization ability

and provides a scalable framework for multi-scale modeling in few-shot fine-grained classification.

关键词

Keywords

references

Zhang Y B, Tang H, Jia K. Fine-grained visual categorization using meta-learning optimization with sample selection of auxiliary data[C]//Proceedings of the European Conference on Computer Vision (ECCV). Berlin: Springer, 2018: 241-256.

Lu Z Y, Xie H T, Liu C B, et al. Bridging the gap between vision transformers and convolutional neural networks on small datasets[C]//Proceedings of the Advances in Neural Information Processing Systems. New York: ACM Press, 2022: 14663-14677.

Xie T, Wang L, Wang K, et al. FARP-net: local-global feature aggregation and relation-aware proposals for 3D object detection[J]. IEEE Transactions on Multimedia, 2024, 26: 1027-1040.

黎拓新, 项凤涛, 陈君海, 等. 基于跨空间多尺度信息聚合和推理一致性的域泛化方法[J]. 智能科学与技术学报, 2025, 7(2): 200-210.

Li T X, Xiang F T, Chen J H, et al. Domain generalization method based on cross-space multi-scale information aggregation and inference consistency[J]. Chinese Journal of Intelligent Science and Technology, 2025, 7(2): 200-210.

崔家豪, 江涛, 徐梦瑶. 基于同质多层图卷积的多尺度网络对齐模型[J]. 智能科学与技术学报, 2024, 6(4): 522-532.

Cui J H, Jiang T, Xu M Y. Multiscale network alignment model based on convolution of homogeneous multilayer graphs[J]. Chinese Journal of Intelligent Science and Technology, 2024, 6(4): 522-532.

Han M Y, Wang R G, Yang J, et al. Multi-scale feature network for few-shot learning[J]. Multimedia Tools and Applications, 2020, 79(17/18): 11617-11637.

Wu J J, Chang D L, Sain A, et al. Bi-directional ensemble feature reconstruction network for few-shot fine-grained classification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(9): 6082-6096.

Li W B, Wang L, Xu J L, et al. Revisiting local descriptor based image-to-class measure for few-shot learning[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE Press, 2019: 7253-7260.

Wertheimer D, Tang L M, Hariharan B. Few-shot classification with feature map reconstruction networks[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE Press, 2021: 8008-8017.

Li X X, Wu J J, Sun Z, et al. BSNet: bi-similarity network for few-shot fine-grained image classification[J]. IEEE Transactions on Image Processing, 2021, 30: 1318-1331.

张杨, 程智宇, 陈允降, 等. 注意力机制增强的输煤传送带异物检测[J]. 智能科学与技术学报, 2025, 7(2): 268-276.

Zhang Y, Cheng Z Y, Chen Y J, et al. Foreign object detection on coal conveyor belt enhanced by attention mechanism[J]. Chinese Journal of Intelligent Science and Technology, 2025, 7(2): 268-276.

姚云, 胡振虓, 邓涛, 等. 基于自适应池化注意力Transformer的唇语识别方法[J]. 智能科学与技术学报, 2025, 7(2): 211-220.

Yao Y, Hu Z X, Deng T, et al. A lip reading method based on adaptive pooling attention Transformer[J]. Chinese Journal of Intelligent Science and Technology, 2025, 7(2): 211-220.

He J, Chen J N, Liu S, et al. TransFG: a transformer architecture for fine-grained recognition[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(1): 852-860.

Lin X, Yan Z Q, Deng X B, et al. ConvFormer: plug-and-play CNN-style transformers forImproving medical image segmentation[C]//Medical Image Computing and Computer Assisted Intervention-MICCAI 2023. Cham: Springer, 2023: 642-651.

Dai Z, Liu H, Le Q V, et al. Coatnet: marrying convolution and attention for all data sizes[J]. Advances in neural information processing systems, 2021, 34: 3965-3977.

Zhu L H, Liao B C, Zhang Q, et al. Vision mamba: efficient visual representation learning with bidirectional state space model[PP]. V3. (2024-11-14) [2025-12-05]. arXiv: arXiv.2401.09417.

Wang Z Y, Zheng J Q, Zhang Y C, et al. Mamba-UNet: UNet-like pure visual mamba for medical image segmentation[PP]. V2. (2024-03-30) [2025-12-05]. arXiv: arXiv.2402.05079.

Lu S, Zhang M, Huo Y, et al. SSUM: spatial-spectral unified mamba for hyperspectral image classification[J]. Remote Sensing, 2024, 16(24): 4653.

Snell J, Swersky K, Zemel R. Prototypical networks for few-shot learning[C]//Proceedings of the 31st International Conference on Neural Information Processing System. Massachusetts: MIT Press, 2017: 4080-4090.

Ye H J, Hu H X, Zhan D C, et al. Few-shot learning via embedding adaptation with set-to-set functions[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE Press, 2020: 8805-8814.

Ding Y M, Tian X, Yin L R, et al. Multi-scale relation network for few-shot learning based on meta-learning[C]//International Conference on Computer Vision Systems. Berlin: Springer, 2019: 343-352.

Zhang C, Cai Y J, Lin G S, et al. DeepEMD: differentiable earth mover's distance for few-shot learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(5): 5632-5648.

Chen Y B, Liu Z, Xu H J, et al. Meta-baseline: exploring simple meta-learning for few-shot learning[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE Press, 2021: 9042-9051.

Jiao J B, Liu Y, Liu Y F, et al. VMamba: visual state space model[C]//Proceedings of the Advances in Neural Information Processing Systems. New York: ACM Press, 2024: 103031-103063.

Kong L S, Dong J X, Tang J H, et al. Efficient visual state space model for image deblurring[C]//Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE Press, 2025: 12710-12719.

Baaziz N, Abahmane O, Missaoui R. Texture feature extraction in the spatial-frequency domain for content-based image retrieval[PP]. V1. (2010-12-23) [2025-12-05]. arXiv: arXiv.1012.5208.

Ahmad M, Usama M, Mazzara M, et al. WaveMamba: spatial-spectral wavelet mamba for hyperspectral image classification[J]. IEEE Geoscience and Remote Sensing Letters, 2025, 22: 5500505.

Zhu H G, Gao Z, Wang J Y, et al. Few-shot fine-grained image classification via multi-frequency neighborhood and double-cross modulation[J]. IEEE Transactions on Multimedia, 2024, 26: 10264-10278.

Guo Y F, Li B, Zhang W Y, et al. Spatial-frequency feature fusion network for small dataset fine-grained image classification[J]. Scientific Reports, 2025, 15: 9332.

Sun Z H, Lin Y Y, Li Y, et al. Crossed wavelet convolution network for few-shot defect detection of industrial chips[J]. Sensors, 2025, 25(14): 4377.

Welinder P, Branson S, Mita T, et al. Caltech-UCSD Birds 200-2011 dataset[R]. 2010.

Khosla A, Jayadevaprakash N, Yao B, et al. Novel dataset for fine-grained image categorization: Stanford dogs[C]//Proceedings of the CVPR Workshop on Fine-Grained Visual Categorization (FGVC). Piscataway: IEEE Press, 2011: 1.

Krause J, Stark M, Jia D, et al. 3D object representations for fine-grained categorization[C]//Proceedings of the 2013 IEEE International Conference on Computer Vision Workshops. Piscataway: IEEE Press, 2013: 554-561.

Doersch C, Gupta A, Zisserman A. Crosstransformers: spatially-aware few-shot transfer[J]. Advances in Neural Information Processing Systems, 2020, 33: 21981-21993.

Sung F, Yang Y X, Zhang L, et al. Learning to compare: relation network for few-shot learning[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 1199-1208.

Wu Z Y, Li Y W, Guo L H, et al. PARN: position-aware relation networks for few-shot learning[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE Press, 2019: 6658-6666.

Hao F S, He F X, Cheng J, et al. Collect and select: semantic alignment metric learning for few-shot learning[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE Press, 2019: 8459-8468.

Lee S, Moon W, Heo J P. Task discrepancy maximization for fine-grained few-shot classification[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE Press, 2022: 5321-5330.

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

HCANet：基于分层Transformer架构的微表情识别模型

基于自适应池化注意力Transformer的唇语识别方法

基于跨空间多尺度信息聚合和推理一致性的域泛化方法

注意力机制增强的输煤传送带异物检测

Cerberus：基于深度学习的跨站社交网络机器人检测系统