End-to-end autonomous driving systems must reliably interpret complex urban environments using both visual and geometric information. However, conventional models often struggle to fuse camera and LiDAR data effectively, leading to failures in dense traffic, occlusions, or low-light conditions.
Transfuser addresses this limitation by introducing a Transformer-based multi-sensor fusion architecture that jointly processes RGB images and LiDAR point cloud features. The model aligns spatial cues from both modalities through cross-attention, enabling richer scene understanding and more accurate waypoint prediction.
In this study, Transfuser is trained and evaluated on the CARLA autonomous driving benchmark. Experimental results show that its attention-based fusion significantly improves performance in challenging scenarios—especially those involving dynamic agents, complex intersections, or partial sensor degradation. Compared to single-modality models, Transfuser demonstrates higher route completion rates, better collision avoidance, and more stable lateral control.
These findings highlight that Transformer-driven multi-modal fusion is a powerful approach for building reliable, real-world autonomous driving systems. Transfuser’s ability to integrate complementary sensor information makes it a strong foundation for secure, robust, and scalable autonomous driving pipelines.
End-to-end autonomous driving models have achieved promising performance by directly mapping raw sensor inputs to driving actions. However, many existing approaches remain limited in interpretability and robustness, as their fusion mechanisms often obscure how different sensor modalities and intermediate representations influence decision-making.
InterFuser is proposed to address this challenge by introducing an intermediate-aware sensor fusion architecture for end-to-end autonomous driving. Instead of performing fusion only at a single latent level, InterFuser explicitly integrates multi-modal features across multiple semantic stages, enabling more structured interaction between perception, reasoning, and control.
The architecture leverages Transformer-based fusion blocks to combine RGB camera inputs and LiDAR-derived features while preserving modality-specific information at intermediate layers. Through staged cross-attention and hierarchical feature alignment, InterFuser facilitates clearer information flow and improves the model’s ability to reason under ambiguous or partially degraded sensor conditions.
In this study, InterFuser is implemented and evaluated in the CARLA autonomous driving benchmark. Qualitative and quantitative results indicate that the model exhibits more stable driving behavior, improved handling of complex intersections, and increased resilience to occlusions and dynamic traffic compared to conventional single-stage fusion approaches. Notably, the intermediate fusion design also enables deeper analysis of internal representations, offering insights into how different sensor cues contribute to final control decisions.
These results suggest that intermediate-level multi-modal fusion is a promising direction for building autonomous driving systems that are not only robust but also more interpretable. InterFuser provides a flexible research framework for exploring explainable end-to-end driving and serves as a strong foundation for future work on secure, reliable, and transparent autonomous driving systems.