3 s-STNet: three-stream spatial–temporal network with appearance and skeleton information learning for action recognition
Department
Computer Science
Document Type
Article
Publication Date
1-1-2023
Abstract
Human action recognition (HAR) is one of the active research areas in computer vision. Although significant progress has been made in the field of action recognition in recent years, most research methods focus on classification of action through single type of data, and with a need to explore spatial–temporal features systematically. Therefore, this paper proposes a three-stream spatial–temporal network with appearance and skeletal information learning for action recognition, briefly coined as 3 s-STNet, which aims to fully learn action spatial–temporal features by extracting, learning and fusing different types of data. The method is divided into two consecutive stages; the first stage uses spatial–temporal graph convolutional network (ST-GCN) and two Res2Net-101 to extract the spatial–temporal features of the action from the spatial–temporal graph, RGB appearance image, and tree-structure-reference-joints image (TSRJI), respectively. The spatial–temporal graph and TSRJI image are converted from human skeleton data. The second stage fine-tunes and fuses the spatial–temporal features obtained by the independent learning of the three-stream network to make full use of the complementarity and diversity among the three output features. The action recognition method proposed in this paper is tested on the challenging NTU RGB + D 60 and NTU RGB + D 120 dataset, and the accuracy of 97.63% (cross-subject), 99.30% (cross-view) and 95.17% (cross-subject), 96.20%(cross-setup), respectively, are obtained, which achieves the state-of-the-art action recognition results in our experiments.
Journal Title
Neural Computing and Applications
Journal ISSN
09410643
Volume
35
Issue
2
First Page
1835
Last Page
1848
Digital Object Identifier (DOI)
10.1007/s00521-022-07763-8