Skip to content

Team publications

Weiqi Sun, PhD
Weiqi Sun, PhD Publications 19
Allan J. Pantuck, MD
Allan J. Pantuck, MD Publications 268
shubh-headshot-2025
Shubh Jaroria Publications 1
vaibhav-headshot-2025
Vaibhav Mavi Publications 1

arXiv:2511.07364, 2025

Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection

Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating \textit{LLM-scorer systems}, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.

Read more
July, 2022

Compositional Task-Oriented Parsing as Abstractive Question Answering

Task-oriented parsing (TOP) aims to convert natural language into machine-readable representations of specific tasks, such as setting an alarm. A popular approach to TOP is to apply seq2seq models to generate linearized parse trees. A more recent line of work argues that pretrained seq2seq2 models are better at generating outputs that are themselves natural language, so they replace linearized parse trees with canonical natural-language paraphrases that can then be easily translated into parse trees, resulting in so-called naturalized parsers. In this work we continue to explore naturalized semantic parsing by presenting a general reduction of TOP to abstractive question answering that overcomes some limitations of canonical paraphrasing. Experimental results show that our QA-based technique outperforms state-of-the-art methods in full-data settings while achieving dramatic improvements in few-shot settings.

Read more
April, 2022

Unfreeze with Care: Space-Efficient Fine-Tuning of Semantic Parsing Models

Semantic parsing is a key NLP task that maps natural language to structured meaning representations. As in many other NLP tasks, SOTA performance in semantic parsing is now attained by fine-tuning a large pretrained language model (PLM). While effective, this approach is inefficient in the presence of multiple downstream tasks, as a new set of values for all parameters of the PLM needs to be stored for each task separately. Recent work has explored methods for adapting PLMs to downstream tasks while keeping most (or all) of their parameters frozen. We examine two such promising techniques, prefix tuning and bias-term tuning, specifically on semantic parsing. We compare them against each other on two different semantic parsing datasets, and we also compare them against full and partial fine-tuning, both in few-shot and conventional data settings. While prefix tuning is shown to do poorly for semantic parsing tasks off the shelf, we modify it by adding special token embeddings, which results in very strong performance without compromising parameter savings.

Read more
Urol Oncol 33(5):204.e25-33, 2015.

Carbonic anhydrase-IX score is a novel biomarker that predicts recurrence and survival for high-risk, nonmetastatic renal cell carcinoma: Data from the phase III ARISER clinical trial.

The largest, multicenter, prospective analysis of patients with high-risk nonmetastatic ccRCC demonstrates the utility of CAIX score as a statistically significant prognostic biomarker for survival. It recommends that CAIX score be quantified for all patients with high-risk disease after nephrectomy.

Read more
Eur Urol. 61(5):888-95, 2012.

Clinical, molecular, and genetic correlates of lymphatic spread in clear cell renal cell carcinoma.

Predictive model consisting of smoking history (p=0.040), T stage (p<0.0001), Fuhrman grade (p<0.0001), Eastern Cooperative Oncology Group performance status (p<0.0001), and microvascular invasion (p<0.0001) was independently associated with lymphatic spread. After adjustment with these clinical variables, low carbonic anhydrase IX (CAIX) (p=0.043) and high epithelial vascular endothelial growth factor receptor 2 (p=0.033) protein expression were associated with a higher risk of lymphatic spread, and loss of chromosome 3p (p<0.0001) with a lower risk.

Read more
Cancer, 115(7):1448-58, 2009.

Carbonic anhydrase IX in bladder cancer: a diagnostic, prognostic, and therapeutic molecular marker.

CAIX was expressed differentially in noninvasive versus invasive tumors, in low-grade versus high-grade bladder cancer, and in primary tumors versus metastases. The current results indicated that CAIX is a strong predictor of recurrence, progression, and overall survival of patients with bladder cancer; and the integration of CAIX expression into conventional prognostic models significantly improved their predictive accuracy. The data suggest a tripartite role of CAIX as a diagnostic, prognostic, and therapeutic molecular marker in bladder cancer.

Read more
arXiv:2511.07364, 2025

Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection

Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating \textit{LLM-scorer systems}, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.

Read more
arXiv:2511.07364, 2025

Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection

Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating \textit{LLM-scorer systems}, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.

Read more