End-to-end autonomous driving systems must reliably interpret complex urban environments using both visual and geometric information. However, conventional models often struggle to fuse camera and LiDAR data effectively, leading to failures in dense traffic, occlusions, or low-light conditions.
Transfuser addresses this limitation by introducing a Transformer-based multi-sensor fusion architecture that jointly processes RGB images and LiDAR point cloud features. The model aligns spatial cues from both modalities through cross-attention, enabling richer scene understanding and more accurate waypoint prediction.
In this study, Transfuser is trained and evaluated on the CARLA autonomous driving benchmark. Experimental results show that its attention-based fusion significantly improves performance in challenging scenarios—especially those involving dynamic agents, complex intersections, or partial sensor degradation. Compared to single-modality models, Transfuser demonstrates higher route completion rates, better collision avoidance, and more stable lateral control.
These findings highlight that Transformer-driven multi-modal fusion is a powerful approach for building reliable, real-world autonomous driving systems. Transfuser’s ability to integrate complementary sensor information makes it a strong foundation for secure, robust, and scalable autonomous driving pipelines.