Burst Photography for Learning to Enhance Extremely Dark Images

A new image enhancement method for extremely low-light images.

A sample result obtained with our proposed burst-based extremely low-light image enhancement method. The standard camera output and its scaled version are shown at the top left corner. For comparison, the zoomed-in details from the outputs produced by the existing approaches are given in the subfigures. The results of the single image enhancement models, denoted with (S), are shown on the right. The results of the multiple image enhancement methods are presented at the bottom, with (B) denoting the burst and (E) indicating the ensemble models. Our single image model recovers finer-scale details much better than its state-of-the-art counterparts. Moreover, our burst model gives perceptually the most satisfactory result, compared to all the other methods.
paper thumbnail
Paper

Ahmet Serdar Karadeniz, Erkut Erdem, and Aykut Erdem. "Burst Photography for Learning to Enhance Extremely Dark Images", IEEE Transactions on Image Processing, in press.
Paper (high-res) | Paper (low-res) | Bibtex

Code: Tensorflow implementation

Abstract

Capturing images under extremely low-light conditions poses significant challenges for the standard camera pipeline. Images become too dark and too noisy, which makes traditional enhancement techniques almost impossible to apply. Recently, learning-based approaches have shown very promising results for this task since they have substantially more expressive capabilities to allow for improved quality. Motivated by these studies, in this paper, we aim to leverage burst photography to boost the performance and obtain much sharper and more accurate RGB images from extremely dark raw images. The backbone of our proposed framework is a novel coarse-to-fine network architecture that generates high-quality outputs progressively. The coarse network predicts a low-resolution, denoised raw image, which is then fed to the fine network to recover fine- scale details and realistic textures. To further reduce the noise level and improve the color accuracy, we extend this network to a permutation invariant structure so that it takes a burst of low-light images as input and merges information from multiple images at the feature-level. Our experiments demonstrate that our approach leads to perceptually more pleasing results than the state-of-the-art methods by producing more detailed and considerably higher quality images.

Introduction

Capturing images in low-light conditions is a challenging task -- the main difficulty being that the level of the signal measured by the camera sensors is generally much lower than the noise in the measurements . The fundamental factors causing the noise are the variations in the number of photons entering the camera lens and the sensor-based measurement errors occurred when reading the signal . In addition, noise present in a low-light image also affects various image characteristics such as fine-scale structures and color balance, further degrading the image quality.

While the previous methods obtain an RGB image from a single dark raw image, we further explore whether the results can be improved by integrating multiple observations regarding the scene. Despite the remarkable progress of previous studies, there is still large room for improvement, regarding various issues such as unwanted blur, noise and color inaccuracies in the end results – especially for the input images which are extremely dark. In a nutshell, to alleviate these shortcomings, in this study, we propose a learning-based framework that takes a burst of extremely low-light raw images of a scene as input and generates an enhanced RGB image. The use of burst images has been previously investigated before . That said, in our work, we develop a coarse-to-fine network architecture which allows for simultaneous processing of a burst of dark raw images as input to obtain a high quality RGB image.

System Overview

Network architectures of the proposed single-frame coarse-to-fine model (left), and set-based burst model (right).

To recover fine-grained details from dark images, we propose to employ a two-step coarse-to-fine training procedure. Our coarse network outputs a denoised image in rawRGB space. We utilize the output of the coarse network not just for guidance in assisting the fine network but also in approximating the noise by computing the difference between the upsampled coarse prediction and the raw low-light input. The fine network takes the concatenation of the low-light raw input image, the output from the coarse network and the noise approximation as inputs and processes them to generate the final RGB output.

We extend our coarse-to-fine model to a novel permutation invariant CNN architecture which takes multiple images of the scene as input and predicts an enhanced image. In particular, first, low-resolution coarse outputs are obtained for each frame in the burst sequence, using our coarse network. Then, our set-based network accepts a set of tensors as input, each instance corresponding to the concatenation of one of raw burst images, its noise approximation and the upsampled version of the coarse prediction and produces final output.

An example night photo captured with 0.1 sec exposure and its enhanced versions by the proposed coarse, fine and burst networks. As the cropped images demonstrate, the fine network enhances both the color and the details of the coarse result. The burst network produces even much sharper and perceptually more pleasing output.

To obtain robustness to small motions, we apply max fusion between the features of burst frames after the second convolution block. As the features are downsampled, their alignment becomes much easier and the network benefits from the fusion of the higher-level features. To deal with large motions in the scene, however, we can utilize the outputs of our coarse network to estimate optical flows between consecutive frames. In our experiments, we employ the method in to obtain the optical flows, which are then used to compensate motion by selectively performing fusion at the input level only over the regions with little or no motion. We also compare our model with the Seeing Motion in the Dark (SMID) method of Chen et al. on the DRV dataset .

Results

Single Image Results



Sony a7s II, ISO 12800 1/10s



iPhone 6s, ISO 400 1/20s

Burst Results



Sony a7s II, ISO 640 1/10s, 8 Frames



Sony a7s II, ISO 1600 1/10s, 8 Frames

Video Results




Acknowledgements

This work was supported in part by GEBIP 2018 Award of the Turkish Academy of Sciences to E. Erdem, BAGEP 2021 Award of the Science Academy to A. Erdem. We would like to thank KUIS AI Center for letting us use their High Performance Computing Cluster and NVIDIA Corporation for the donation of GPUs used in this research.