Modeling Spatio-Temporal Human Track Structure for Action Localization
This paper addresses spatio-temporal localization of human actions in video.
In order to localize actions in time, we propose a recurrent localization
network (RecLNet) designed to model the temporal structure of actions on the
level of person tracks. Our model is trained to simultaneously recognize and
localize action classes in time and is based on two layer gated recurrent units
(GRU) applied separately to two streams, i.e. appearance and optical flow
streams. When used together with state-of-the-art person detection and
tracking, our model is shown to improve substantially spatio-temporal action
localization in videos. The gain is shown to be mainly due to improved temporal
localization. We evaluate our method on two recent datasets for spatio-temporal
action localization, UCF101-24 and DALY, demonstrating a significant
improvement of the state of the art.