Neural Re-rendering for Full-frame Video Stabilization

Liu, Yu-Lun, et al. "Neural Re-rendering for Full-frame Video Stabilization." arXiv preprint arXiv:2102.06205 (2021).

Abstract

지금까지의 video stabilization method는 frame boundary를 cropping하거나 그 부분에서 심한 distortion artifact가 생성된다. 이 논문에서는, dense warp field를 추정함에 따라 full-frame video stabilization 알고리즘을 제안한다. 주요 novelty로는 learning-based hybrid-space fusion을 사용하여 artifact를 줄인다. 이 방법은 NUS dataset, selfie video dataset에 유용함을 실험을 통해 보인다.

Introduction

휴대폰으로 찍는 video가 많아짐에 따라 video stabilization는 video content에 있어서 중요한 task가 되었다. 하지만, Hand-held captured video는 motion blur, wobble artifacts등이 발생한다.

지금까지의 video stabilization method는 3가지 step으로 구성되어있다.

1) motion estimation

2) motion smoothing

3) stable frame generation

Motion estimation은 2D feature tracking, dense flow, recovering camera motion & scene structure 등의 방법을 사용한다. Motion smoothing은 estimated motion의 high-frequency jittering을 제거하는 step이다. 이는 각 frame에 대해 stabile frame으로 변환하는 spatial transformation을 계산하는 것으로도 볼 수 있다. Stable frame generation은 위에서 구한 transform을 통해 stabilized image를 생성하는 step이다. 하지만, frame border부분은 정보가 없기 때문에 보통은 기존 frame에서 crop하고 zoom in한다. 이로 인해 resolution이 감소한다.

Full-frame video stabilization은 crop이 없는 같은 field of view로 stabilized video를 생성하는 것을 목표로 한다. 이 논문에서는 stabilized video를 먼저 계산한 다음, flow-based video completion method를 사용하여 missing content를 채운다.

이 방법은 core idea는 multiple neighboring frame들을 fusing하는 것이다. Image-level fusion과 Feature-level fusion의 장점을 합친 hybrid fusion을 제안하여 flow inaccuracy의 sensitivity를 줄인다. 그리고 spatially varying blending weight를 사용하고 blurry input을 제거 등을 사용함으로써 visual quality를 증가시킨다. 그 결과 이 방법은 artifact와 distortion이 매우 적으면서도 stabilized video를 생성한다.

Related Work

Motion estimation and smoothing

많은 video stabilization method는 frame과 smoothing motion사이의 motion을 estimate하는데 초점을 맞추고 있다. Motion estimation을 위해, feature tracking 또는 dense optical flow를 사용한다. 2D motion 추정뿐만 아니라 3D reconstruction을 진행하고 projection을 사용하는 방법이 있다. Deep learning based method으로는 warping field를 추정하거나, optical flow를 deep learning network를 통해 추정하는 방법이 있다.

여기서는 최신 2D motion estimation 방법을 사용하고 warping한다.

Image fusion and composition

Video stabilization의 마지막 단계는 rendering frame이다. 많은 방법은 image를 직접 warping한다. 하지만 이 방법을 사용하면 frame boundary 부분이 없기 때문에 frame을 crop한다. 이를 방지하기 위해 본 제안에서는 2D motion inpainting을 사용한다. (일반적인 inpainting과는 다름)

View synthesis

View synthesis는 single image or multiple posed image를 이용해 새로운 viewpoint에서 rendering을 진행하는 것이다. 본 제안에서는 view synthesis가 포함되어 있지만 scene geometry를 추정하지는 않는다. Rendering novel views for dynamic scenes from a single video에 대해 많은 연구가 되어 이를 video stabilization에 이용할 수 있을 것이다. 하지만, 이는 per-video training이 필요하고 camera pos estimation이 필요하다.

Full-Frame Video Stabilization

$I_{t}$를 real (unstabilized) camera space, $I_{\hat{t}}$를 virtual (stabilized) camera space라 하자.

Preprocessing

Motion estimation and smoothing

여기서 motion estimation 방법으로 [Yu and Ramamoorthi 2020]을 사용한다. 이를 통해 backward dense warping field $F_{k \rightarrow \hat{k}}$를 얻는다. 하지만 이는 irregular boudary와 missing pixel을 가지고 있다.

Optical flow estimation

위에서 언급한 점들을 보완하기 위해 매 frame마다 주변 frame에서 해당 frame으로의 optical flow를 계산한다. 여기서는 RAFT [Teed and Deng 2020]을 사용하였다.

Blurry frames removal

몇몇 input frame은 blur를 가진다. 이러한 frame에 대해서는 좋지 않은 output이 출력된다. [Pech-Pacheco et al.2000]을 사용하여 neighbor frame 중에서 top 50% sharpness를 가진 frame을 이용한다.

Warping and Fusion

Warping

위의 과정을 통해 $F_{\hat{k} \rightarrow k}$와 $\{F_{k \rightarrow n } \}_{n \in \Omega_{k}}$를 가지고 있다. 그러면 $\{ F_{\hat{k} \rightarrow n } \}_{n \in \Omega_k}$를 계산할 수 있다. 이를 통해 각 주변 frame들을 stabilized frame에 warping할 수 있다. 더불어, warping한 이후에 visibility mask $\{ \alpha_n \}_{n \in \Omega_k}$를 계산한다.

Fusion space

Warping field를 사용하여 하나의 image를 만드는데 크게 두가지 방법이 존재한다. 첫 번째는 image-level fusion으로 모든 이미지를 warping한 다음에 만들어진 이미지로 fusion하는 방법이다. 이 방법은 쉽고 많이 사용되지만 ghosting artifact이 생성될 가능성이 있다. 두 번째는 feature-level fusion으로 encoder, decoder를 사용하여 high-dimensional feature space에서 warping한다음에 fusion하는 방식이다. 이 방식은 artifact를 줄일 수 있으나 전체적으로 blurry한 output이 출력된다.

이 논문에서 제안한 것은 hybrid-space fusion이다. 위의 그림에서 아랫 부분이 hybrid fusion에 대한 그림이다. feature-level fusion과 동일하게 fused feature map을 생성한다. 하지만, 각 frame마다의 warped feature과 생성한 fused feature map을 concatenate한 뒤 decoder에 넣는다. Decoder는 각 frame의 output frame과 그에 대한 confidence map을 출력한다. 그리고 confidence map에 따라 각 output frame을 더한다.

아래 그림은 각 fusion method를 사용했을 때의 결과이다. Hybrid space fusion이 좀더 detail 살리면서 artifact나 distortion이 없는 결과를 출력한다.

Fusion function

이러한 fusion방법은 여러가지가 존재하는데, 논문에서 5가지를 소개하였다. Mean fusion, Gaussian-wieghted fusion, Argmax fusion, Flow error-weighted fusion, CNN-based fusion이 있다. 그 중에서 CNN-based fusion은 learning-based으로, 다음 function으로 나타낼 수 있다.

그리고 Decoder를 통해 color frame과 confidence map을 출력한다.

마지막으로 weighted sum을 통해 predicted frame을 생성한다.

전체 과정은 아래 그림과 같다.

Implementation details

High frequency detail을 살리기 위해 residual detail transfer를 제안한다. (아래 그림)

Image encoder를 통해 얻은 $f_{k}$를 바로 Image decoder를 통해 reconstruction 시킨다. 그리고 원래 이미지와의 차이를 통해 residual detail정보를 가지고 있다가 마지막에 더함으로써 high frequency detail을 유지한다.

Training 의 경우 pixel loss 뿐만 아니라 perceptual loss도 추가하여 더 자연스러운 이미지를 생성하게 한다.

Experiment Results

Ablation study

위에서 언급한 Fusion function, Fusion space, Residual detail transfer에 대한 ablation study를 진행해보았다.

Fusion function

Fusion space

Residual detail transfer

Quantitative evaluation

이 논문에서는 video stabilization 에서 주로 사용되는 3가지 metric을 사용하였다. (cropping ratio, distortion value, stability score) 추가로 [Yu et al.]에서 사용한 accumulated optical flow를 metric으로 사용하였다. 아래 표에서 볼 수 있듯이 대부분의 지표에서 뛰어난 성능을 보인다. DIFRINT와 같이 cropping ratio는 항상 1인데 이는 full-frame video stabilization이기 때문이다.

User study

Video stabilization에서 위의 평가 지표들이 stabilization 성능을 바로 보여주지 않기 때문에 user study를 진행하였다. 설문조사에서 다음 3가지 question을 제공한다.

1) Which video preserves the most content?

2) Which video contains fewer artifacts or distortions?

3) Which video is more stable?

아래 그래프에서 볼 수 있듯이 3가지 method에 대해 많은 표를 얻었음을 볼 수 있다.

Limitations

Wooble : 카메라 또는 물체의 속도가 빠를 때, rolling shutter wobble이 발생한다. 이러한 input으로 생성된 결과는 output이 일그러져 보이게 된다.

Visible seems : lighting variation이 큰 경우에는, 여러 영상을 합성하기 때문에 한 image내에 밝기정도가 달라 어색해보이는 결과를 초래할 수 있다.

Temporal flicker and distortion : occlusion이나 inconstant foreground/background으로 인해 왜곡이 형성될 수 있다. 이는 motion inpainting 성능에 의존한다.

Speed : 제시한 알고리즘은 여러가지 stage를 거치고 무겁다. 10sec/frame 정도의 속도로 속도 개선이 필요하다.

Conclusions

이 논문의 core idea는 다음과 같다. 주변 여러개의 frame을 warping하는 방법으로 learning-based fusion을 제안한다. Fusion space, fusion function, residual detail transfer에 대한 실험을 진행함으로써 design choice를 진행하였다. Full-frame video stabilization에 대해 많은 연구가 이루어지지 않았는데, 이 논문은 딥러닝 방법으로 앞으로의 연구 방향을 제시한다.

Hygenie Study Note