A challenging problem in visual scene understanding is to detect and track moving objects, which is made significantly more challenging when the camera itself is moving. In our project, we tackle this problem by adopting self-supervised learning. Our detection model uses a multiview reconstruction error as a supervision signal to learn confidence scores and bounding boxes for moving objects. The demo videos show our models performing camera pose estimation and moving object detection with thresholds 0.15 and 0.01 in KITTI.