Recent learning-based approaches have achieved impressive results in the field of single-shot camera localization. However, how best to fuse multiple modalities and to deal with degraded or missing input are less well studied. In this paper, we propose an end-to-end framework, termed VMLoc, to fuse different sensor inputs into a common latent space through a variational Product-of-Experts (PoE) followed by attention-based fusion. Unlike previous multimodal variational works directly adapting the objective function of vanilla variational auto-encoder, we show how camera localization can be accurately estimated through an unbiased objective function based on importance weighting.
The problem is formulated as a Bayesian inference model which is used to maximize the posterior probability conditioned on input data. We find that directly optimizing ELBO as the objective function cannot provide a tight estimation.