Video

Video Recognition with Sparse Points

Video Recognition with Space-Time Interest Points
- Space-time interest point (STIP)
  - STIP detector
  - Corner point in a spatio-temporal domain
  - Bag of visual words

Dense Trajectory
- Sparse -> Dense
- Point -> Trajectory
- Trajectory extraction

3D Convolutional Neural Network
C3D
- A modern 3D CNN architecture
- VGGNet style network
Visual Information Fusion Across Time
- A video = A bag of short fixed-sized clips
Time Information Fusion
- early fusion
- late fusion
- slow fusion
Explicit Motion Estimation and Utilization
- Two-Stream Convolutional Network
  - spatial stream
    - ImageNet pretrained network, whose architecture is similar to ZFNet
  - temporal stream
    - Multi-task learning

Video Representation Learning
SlowFast Networks
- Low frame rate branch와 high frame rate branch를 나누어서 action recognition
- Time axis 에 대한 탐구
- Slow pathway
- Fast pathway
MoViNet
- Video를 좀 더 efficient하게 처리하기 위한 구조 탐구.
- Video recognition에서의 efficientNet (ICML 2019) 와 같은 논문
MaskFeat
- ransformer 와 self-supervised representation learning 을 통한 video recognition 의 최신 기법
- Action recognition 을 위한 representation learning을 할 때 꼭 temporal sequence 가 필요할까?
- Pretext task로 masked input으로 HOG를 prediction 해보자
- HOG 는 이미지의 pixel intensity 등의 정보를 없애서 motion 그 자체의 학습에 좀 더 집중하게 도움을 줌
Self-supervised learning 기반의 representation learning 방식이 video recognition 쪽에서 현재 가장 잘 동작한다. (MaskFeat)
Transformer architecture를 활용한 구조.
Video 를 이해하는 데에 꼭 sequential 한 information 이 필요하지 않을 수 있다
Single key frame 으로도 일부의 action 정보 학습/파악 가능