ESGLens: An LLM-Based RAG Framework for Interactive ESG Report Analysis and Score Prediction

cs.CL updates on arXiv.org

Tsung-Yu Yang, Meng-Chi Chen

Apr 23, 2026, 12:00 AM

arXiv:2604.19779v1 Announce Type: new Abstract: Environmental, Social, and Governance (ESG) reports are central to investment decision-making, yet their length, heterogeneous content, and lack of standardized structure make manual analysis costly and inconsistent. We present ESGLens, a proof-of-concept framework combining retrieval-augmented generation (RAG) with prompt-engineered extraction to automate three tasks: (1)~structured information extraction guided by Global Reporting Initiative (GRI) standards, (2)~interactive question-answering with source traceability, and (3)~ESG score prediction via regression on LLM-generated embeddings. ESGLens is purpose-built for the domain: a report-processing module segments heterogeneous PDF content into typed chunks (text, tables, charts); a GRI-guided extraction module retrieves and synthesizes information aligned with specific standards; and a scoring module embeds extracted summaries and feeds them to a regression model trained against London Stock Exchange Group (LSEG) reference scores. We evaluate the framework on approximately 300 reports from companies in the QQQ, S\&P~500, and Russell~1000 indices (fiscal year 2022). Among three embedding methods (ChatGPT, BERT, RoBERTa) and two regressors (Neural Network, LightGBM), ChatGPT embeddings with a Neural Network achieve a Pearson correlation of 0.48 ($R^{2} \approx 0.23$) against LSEG ground-truth scores -- a modest but statistically meaningful signal given the ${\sim}300$-report training set and restriction to the environmental pillar. A traceability audit shows that 8 of 10 extracted claims verify against the source document, with two failures attributable to few-shot example leakage. We discuss limitations including dataset size and restriction to environmental indicators, and release the code to support reproducibility.