Burst Photography for Learning to Enhance Extremely Dark Images

Paper

Ahmet Serdar Karadeniz, Erkut Erdem, and Aykut Erdem. "Burst Photography for Learning to Enhance Extremely Dark Images", IEEE Transactions on Image Processing, in press.
Paper (high-res) | Paper (low-res) | Bibtex

Code: Tensorflow implementation

Abstract

Capturing images under extremely low-light conditions poses significant challenges for the standard camera pipeline. Images become too dark and too noisy, which makes traditional enhancement techniques almost impossible to apply. Recently, learning-based approaches have shown very promising results for this task since they have substantially more expressive capabilities to allow for improved quality. Motivated by these studies, in this paper, we aim to leverage burst photography to boost the performance and obtain much sharper and more accurate RGB images from extremely dark raw images. The backbone of our proposed framework is a novel coarse-to-fine network architecture that generates high-quality outputs progressively. The coarse network predicts a low-resolution, denoised raw image, which is then fed to the fine network to recover fine- scale details and realistic textures. To further reduce the noise level and improve the color accuracy, we extend this network to a permutation invariant structure so that it takes a burst of low-light images as input and merges information from multiple images at the feature-level. Our experiments demonstrate that our approach leads to perceptually more pleasing results than the state-of-the-art methods by producing more detailed and considerably higher quality images.

Introduction

Capturing images in low-light conditions is a challenging task -- the main difficulty being that the level of the signal measured by the camera sensors is generally much lower than the noise in the measurements . The fundamental factors causing the noise are the variations in the number of photons entering the camera lens and the sensor-based measurement errors occurred when reading the signal . In addition, noise present in a low-light image also affects various image characteristics such as fine-scale structures and color balance, further degrading the image quality.

While the previous methods obtain an RGB image from a single dark raw image, we further explore whether the results can be improved by integrating multiple observations regarding the scene. Despite the remarkable progress of previous studies, there is still large room for improvement, regarding various issues such as unwanted blur, noise and color inaccuracies in the end results – especially for the input images which are extremely dark. In a nutshell, to alleviate these shortcomings, in this study, we propose a learning-based framework that takes a burst of extremely low-light raw images of a scene as input and generates an enhanced RGB image. The use of burst images has been previously investigated before . That said, in our work, we develop a coarse-to-fine network architecture which allows for simultaneous processing of a burst of dark raw images as input to obtain a high quality RGB image.

System Overview

To recover fine-grained details from dark images, we propose to employ a two-step coarse-to-fine training procedure. Our coarse network outputs a denoised image in rawRGB space. We utilize the output of the coarse network not just for guidance in assisting the fine network but also in approximating the noise by computing the difference between the upsampled coarse prediction and the raw low-light input. The fine network takes the concatenation of the low-light raw input image, the output from the coarse network and the noise approximation as inputs and processes them to generate the final RGB output.

We extend our coarse-to-fine model to a novel permutation invariant CNN architecture which takes multiple images of the scene as input and predicts an enhanced image. In particular, first, low-resolution coarse outputs are obtained for each frame in the burst sequence, using our coarse network. Then, our set-based network accepts a set of tensors as input, each instance corresponding to the concatenation of one of raw burst images, its noise approximation and the upsampled version of the coarse prediction and produces final output.

To obtain robustness to small motions, we apply max fusion between the features of burst frames after the second convolution block. As the features are downsampled, their alignment becomes much easier and the network benefits from the fusion of the higher-level features. To deal with large motions in the scene, however, we can utilize the outputs of our coarse network to estimate optical flows between consecutive frames. In our experiments, we employ the method in to obtain the optical flows, which are then used to compensate motion by selectively performing fusion at the input level only over the regions with little or no motion. We also compare our model with the Seeing Motion in the Dark (SMID) method of Chen et al. on the DRV dataset .

Results

Single Image Results

Sony a7s II, ISO 12800 1/10s