SGAP-Gaze: Scene Grid Attention Based Point-of-Gaze Estimation Network for Driver Gaze

cs.CV updates on arXiv.org

Pavan Kumar Sharma, Pranamesh Chakraborty

Apr 23, 2026, 12:00 AM

arXiv:2604.19888v1 Announce Type: new Abstract: Driver gaze estimation is essential for understanding the driver's situational awareness of surrounding traffic. Existing gaze estimation models use driver facial information to predict the Point-of-Gaze (PoG) or the 3D gaze direction vector. We propose a benchmark dataset, Urban Driving-Face Scene Gaze (UD-FSG), comprising synchronized driver-face and traffic-scene images. The scene images provide cues about surrounding traffic, which can help improve the gaze estimation model, along with the face images. We propose SGAP-Gaze, Scene-Grid Attention based Point-of-Gaze estimation network, trained and tested on our UD-FSG dataset, which explicitly incorporates the scene images into the gaze estimation modelling. The gaze estimation network integrates driver face, eye, iris, and scene contextual information. First, the extracted features from facial modalities are fused to form a gaze intent vector. Then, attention scores are computed over the spatial scene grid using a Transformer-based attention mechanism fusing face and scene image features to obtain the PoG. The proposed SGAP-Gaze model achieves a mean pixel error of 104.73 on the UD-FSG dataset and 63.48 on LBW dataset, achieving a 23.5% reduction in mean pixel error compared to state-of-the-art driver gaze estimation models. The spatial pixel distribution analysis shows that SGAP-Gaze consistently achieves lower mean pixel error than existing methods across all spatial ranges, including the outer regions of the scene, which are rare but critical for understanding driver attention. These results highlight the effectiveness of integrating multi-modal gaze cues with scene-aware attention for a robust driver PoG estimation model in real-world driving environments.