چکیده
|
Human action retrieval as a challenging research area has wide-spreading applications in surveillance, search engines, and human-computer interactions. Current methods seek to represent actions and create a model with global and local features. These methods do not consider the semantics of actions to create the model, so they do not have proper final retrieval results. Each action is not considered a sequence of sub-actions, and their model is created using scattered local or global features. Furthermore, current action retrieval methods ignore incorporating Convolutional Neural Networks (CNN) in the representation procedure due to a lack of training data for training them. At the same time, CNNs can help them improve the final representation. In the present paper, we propose a CNN-based human action representation method for retrieval applications. In this method, the video is initially segmented into sub-actions to represent each action based on their sequence using keyframes extracted from the segments. Then, the sequence of keyframes is given to a pre-trained CNN to extract deep spatial features of the action. Next, a 1D average pooling is designed to combine the sequence of spatial features and represent the temporal changes by a lower-dimensional vector. Finally, the Dynamic Time Wrapping technique is used to find the best match between the representation vectors of two videos. Experiments on real video datasets for both retrieval and recognition applications indicate how created models for the actions can outperform other representation methods.
|