Weiqi Sun, PhD Chief AI Officer Publications 19
Team publications
arXiv:2511.07364, 2025
Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection
Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.
Read more
July, 2022
Compositional Task-Oriented Parsing as Abstractive Question Answering
Task-oriented parsing (TOP) aims to convert natural language into machine-readable representations of specific tasks, such as setting an alarm. A popular approach to TOP is to apply seq2seq models to generate linearized parse trees. A more recent line of work argues that pretrained seq2seq2 models are better at generating outputs that are themselves natural language, so they replace linearized parse trees with canonical natural-language paraphrases that can then be easily translated into parse trees, resulting in so-called naturalized parsers. In this work we continue to explore naturalized semantic parsing by presenting a general reduction of TOP to abstractive question answering that overcomes some limitations of canonical paraphrasing. Experimental results show that our QA-based technique outperforms state-of-the-art methods in full-data settings while achieving dramatic improvements in few-shot settings.
Read more
April, 2022
Unfreeze with Care: Space-Efficient Fine-Tuning of Semantic Parsing Models
Semantic parsing is a key NLP task that maps natural language to structured meaning representations. As in many other NLP tasks, SOTA performance in semantic parsing is now attained by fine-tuning a large pretrained language model (PLM). While effective, this approach is inefficient in the presence of multiple downstream tasks, as a new set of values for all parameters of the PLM needs to be stored for each task separately. Recent work has explored methods for adapting PLMs to downstream tasks while keeping most (or all) of their parameters frozen. We examine two such promising techniques, prefix tuning and bias-term tuning, specifically on semantic parsing. We compare them against each other on two different semantic parsing datasets, and we also compare them against full and partial fine-tuning, both in few-shot and conventional data settings. While prefix tuning is shown to do poorly for semantic parsing tasks off the shelf, we modify it by adding special token embeddings, which results in very strong performance without compromising parameter savings.
Read more
Dimitrios Iliopoulos, PhD, MBA Senior Scientific Advisor & Board Member Publications 135
Team publications
J Crohns Colitis. jjab051, 2021.
Results of the Seventh Scientific Workshop of ECCO: Precision medicine in IBD – what, why, and how.
Many diseases that affect modern humans fall in the category of complex diseases, thus called because they result from a combination of multiple aetiological and pathogenic factors. Regardless of the organ or system affected, complex diseases present major challenges in diagnosis, classification, and management. Current forms of therapy are usually applied in an indiscriminate fashion based on clinical information, but even the most advanced drugs only benefit a limited number of patients and to a variable and unpredictable degree. This ‘one measure does not fit all’ situation has spurred the notion that therapy for complex disease should be tailored to individual patients or groups of patients, giving rise to the notion of ‘precision medicine’ [PM]. Inflammatory bowel disease [IBD] is a prototypical complex disease where the need for PM has become increasingly clear. This prompted the European Crohn’s and Colitis Organisation to focus the 7 th Scientific Workshop on this emerging theme. The articles in this special issue of the Journal address the various complementary aspects of PM in IBD, including what is PM; why it is needed and how it can be used; how PM can contribute to prediction and prevention of IBD; how IBD PM can aid in prognosis and improve response to therapy; and the challenges and future directions of PM in IBD. This first article of this series is structured on three simple concepts [what, why, and how] and addresses the definition of PM, discusses the rationale for the need of PM in IBD, and outlines the methodology required to implement PM in IBD in a correct and clinically meaningful way.
Read more
Gut 68(7):1271-1286, 2019.
Lysine methyltransferase 2D regulates pancreatic carcinogenesis through metabolic reprogramming.
We define a new antitumorous function of the histone lysine (K)-specific methyltransferase 2D (KMT2D) in pancreatic cancer. KMT2D is transcriptionally repressed in human pancreatic tumours through DNA methylation. Clinically, lower levels of this methyltransferase associate with poor prognosis and significant weight alterations. RNAi-based genetic inactivation of KMT2D promotes tumour growth and results in loss of H3K4me3 mark. In addition, KMT2D inhibition increases aerobic glycolysis and alters the lipidomic profiles of pancreatic cancer cells. Further analysis of this phenomenon identified the glucose transporter SLC2A3 as a mediator of KMT2D-induced changes in cellular, metabolic and proliferative rates.
Read more
14(12):739-749, 2017.
The IBD interactome: an integrated view of aetiology, pathogenesis and therapy. Nat Rev Gastroenterol Hepatol.
Crohn’s disease and ulcerative colitis are prototypical complex diseases characterized by chronic and heterogeneous manifestations, induced by interacting environmental, genomic, microbial and immunological factors. These interactions result in an overwhelming complexity that cannot be tackled by studying the totality of each pathological component (an ‘-ome’) in isolation without consideration of the interaction among all relevant -omes that yield an overall ‘network effect’. The outcome of this effect is the ‘IBD interactome’, defined as a disease network in which dysregulation of individual -omes causes intestinal inflammation mediated by dysfunctional molecular modules. To define the IBD interactome, new concepts and tools are needed to implement a systems approach; an unbiased data-driven integration strategy that reveals key players of the system, pinpoints the central drivers of inflammation and enables development of targeted therapies. Powerful bioinformatics tools able to query and integrate multiple -omes are available, enabling the integration of genomic, epigenomic, transcriptomic, proteomic, metabolomic and microbiome information to build a comprehensive molecular map of IBD. This approach will enable identification of IBD molecular subtypes, correlations with clinical phenotypes and elucidation of the central hubs of the IBD interactome that will aid discovery of compounds that can specifically target the hubs that control the disease.
Read more
Inflamm Bowel Dis 21(11):2533-9, 2015.
Assessment of Circulating MicroRNAs for the Diagnosis and Disease Activity Evaluation in Patients with Ulcerative Colitis by Using the Nanostring Technology.
We have identified a signature of 12 circulating microRNAs that differentiate patients with UC from control subjects. Moreover, 6 of these microRNAs significantly correlated with UC disease activity. Importantly, a set of 4 microRNAs (hsa-miR-4454, hsa-miR-223-3p, hsa-miR-23a-3p, and hsa-miR-320e), which correlated with UC disease activity were found to have higher sensitivity and specificity values than C-reactive protein. Circulating microRNAs provide a novel diagnostic and prognostic marker for patients with UC. The use of an FDA-approved platform could accelerate the application of microRNA screening in a gastrointenstinal clinical setting. When used in combination with current diagnostic and disease activity assessment modalities, microRNAs could improve both IBD screening and care management.
Read more
Gastroenterology 145(4):842-52.e2, 2013.
MicroRNA-124 regulates STAT3 expression and is down-regulated in colon tissues of pediatric patients with ulcerative colitis.
miR-124 appears to regulate the expression of STAT3. Reduced levels of miR-124 in colon tissues of children with active UC appear to increase expression and activity of STAT3, which could promote inflammation and the pathogenesis of UC in children. Thus, the activation of miR-124 could have a therapeutic potential in IBD patients.
Read more
Allan J. Pantuck, MD Senior Medical Advisor Publications 268
Team publications
N Engl J Med,375(23):2246-2254, 2016.
Adjuvant Sunitinib in High-Risk Renal-Cell Carcinoma after Nephrectomy.
Among patients with locoregional clear-cell renal-cell carcinoma at high risk for tumor recurrence after nephrectomy, the median duration of disease-free survival was significantly longer in the sunitinib group than in the placebo group, at a cost of a higher rate of toxic events.
Read more
Urol Oncol 33(5):204.e25-33, 2015.
Carbonic anhydrase-IX score is a novel biomarker that predicts recurrence and survival for high-risk, nonmetastatic renal cell carcinoma: Data from the phase III ARISER clinical trial.
The largest, multicenter, prospective analysis of patients with high-risk nonmetastatic ccRCC demonstrates the utility of CAIX score as a statistically significant prognostic biomarker for survival. It recommends that CAIX score be quantified for all patients with high-risk disease after nephrectomy.
Read more
Eur Urol. 61(5):888-95, 2012.
Clinical, molecular, and genetic correlates of lymphatic spread in clear cell renal cell carcinoma.
Predictive model consisting of smoking history (p=0.040), T stage (p<0.0001), Fuhrman grade (p<0.0001), Eastern Cooperative Oncology Group performance status (p<0.0001), and microvascular invasion (p<0.0001) was independently associated with lymphatic spread. After adjustment with these clinical variables, low carbonic anhydrase IX (CAIX) (p=0.043) and high epithelial vascular endothelial growth factor receptor 2 (p=0.033) protein expression were associated with a higher risk of lymphatic spread, and loss of chromosome 3p (p<0.0001) with a lower risk.
Read more
Cancer. 1;118(7):1795-802, 2012.
Smoking negatively impacts renal cell carcinoma overall and cancer-specific survival.
In patients with RCC, a history of smoking was associated with worse pathologic features and survival outcomes and with an increased risk of having mutated p53. Further investigation of the genetic and molecular mechanisms associated with decreased CSS in patients with RCC who have a history of smoking is indicated.
Read more
Cancer, 115(7):1448-58, 2009.
Carbonic anhydrase IX in bladder cancer: a diagnostic, prognostic, and therapeutic molecular marker.
CAIX was expressed differentially in noninvasive versus invasive tumors, in low-grade versus high-grade bladder cancer, and in primary tumors versus metastases. The current results indicated that CAIX is a strong predictor of recurrence, progression, and overall survival of patients with bladder cancer; and the integration of CAIX expression into conventional prognostic models significantly improved their predictive accuracy. The data suggest a tripartite role of CAIX as a diagnostic, prognostic, and therapeutic molecular marker in bladder cancer.
Read more
Shubh Jaroria NLP Applied Scientist Publications 1
Team publications
arXiv:2511.07364, 2025
Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection
Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.
Read more
Vaibhav Mavi Sr. NLP Applied Scientist Publications 1
Team publications
arXiv:2511.07364, 2025
Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection
Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.
Read more

Team publications

arXiv:2511.07364, 2025

Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection

Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.

July, 2022

Compositional Task-Oriented Parsing as Abstractive Question Answering

Task-oriented parsing (TOP) aims to convert natural language into machine-readable representations of specific tasks, such as setting an alarm. A popular approach to TOP is to apply seq2seq models to generate linearized parse trees. A more recent line of work argues that pretrained seq2seq2 models are better at generating outputs that are themselves natural language, so they replace linearized parse trees with canonical natural-language paraphrases that can then be easily translated into parse trees, resulting in so-called naturalized parsers. In this work we continue to explore naturalized semantic parsing by presenting a general reduction of TOP to abstractive question answering that overcomes some limitations of canonical paraphrasing. Experimental results show that our QA-based technique outperforms state-of-the-art methods in full-data settings while achieving dramatic improvements in few-shot settings.

April, 2022

Unfreeze with Care: Space-Efficient Fine-Tuning of Semantic Parsing Models

Semantic parsing is a key NLP task that maps natural language to structured meaning representations. As in many other NLP tasks, SOTA performance in semantic parsing is now attained by fine-tuning a large pretrained language model (PLM). While effective, this approach is inefficient in the presence of multiple downstream tasks, as a new set of values for all parameters of the PLM needs to be stored for each task separately. Recent work has explored methods for adapting PLMs to downstream tasks while keeping most (or all) of their parameters frozen. We examine two such promising techniques, prefix tuning and bias-term tuning, specifically on semantic parsing. We compare them against each other on two different semantic parsing datasets, and we also compare them against full and partial fine-tuning, both in few-shot and conventional data settings. While prefix tuning is shown to do poorly for semantic parsing tasks off the shelf, we modify it by adding special token embeddings, which results in very strong performance without compromising parameter savings.

J Crohns Colitis. jjab051, 2021.

Results of the Seventh Scientific Workshop of ECCO: Precision medicine in IBD – what, why, and how.

Many diseases that affect modern humans fall in the category of complex diseases, thus called because they result from a combination of multiple aetiological and pathogenic factors. Regardless of the organ or system affected, complex diseases present major challenges in diagnosis, classification, and management. Current forms of therapy are usually applied in an indiscriminate fashion based on clinical information, but even the most advanced drugs only benefit a limited number of patients and to a variable and unpredictable degree. This ‘one measure does not fit all’ situation has spurred the notion that therapy for complex disease should be tailored to individual patients or groups of patients, giving rise to the notion of ‘precision medicine’ [PM]. Inflammatory bowel disease [IBD] is a prototypical complex disease where the need for PM has become increasingly clear. This prompted the European Crohn’s and Colitis Organisation to focus the 7 th Scientific Workshop on this emerging theme. The articles in this special issue of the Journal address the various complementary aspects of PM in IBD, including what is PM; why it is needed and how it can be used; how PM can contribute to prediction and prevention of IBD; how IBD PM can aid in prognosis and improve response to therapy; and the challenges and future directions of PM in IBD. This first article of this series is structured on three simple concepts [what, why, and how] and addresses the definition of PM, discusses the rationale for the need of PM in IBD, and outlines the methodology required to implement PM in IBD in a correct and clinically meaningful way.

Gut 68(7):1271-1286, 2019.

Lysine methyltransferase 2D regulates pancreatic carcinogenesis through metabolic reprogramming.

We define a new antitumorous function of the histone lysine (K)-specific methyltransferase 2D (KMT2D) in pancreatic cancer. KMT2D is transcriptionally repressed in human pancreatic tumours through DNA methylation. Clinically, lower levels of this methyltransferase associate with poor prognosis and significant weight alterations. RNAi-based genetic inactivation of KMT2D promotes tumour growth and results in loss of H3K4me3 mark. In addition, KMT2D inhibition increases aerobic glycolysis and alters the lipidomic profiles of pancreatic cancer cells. Further analysis of this phenomenon identified the glucose transporter SLC2A3 as a mediator of KMT2D-induced changes in cellular, metabolic and proliferative rates.

14(12):739-749, 2017.

The IBD interactome: an integrated view of aetiology, pathogenesis and therapy. Nat Rev Gastroenterol Hepatol.

Crohn’s disease and ulcerative colitis are prototypical complex diseases characterized by chronic and heterogeneous manifestations, induced by interacting environmental, genomic, microbial and immunological factors. These interactions result in an overwhelming complexity that cannot be tackled by studying the totality of each pathological component (an ‘-ome’) in isolation without consideration of the interaction among all relevant -omes that yield an overall ‘network effect’. The outcome of this effect is the ‘IBD interactome’, defined as a disease network in which dysregulation of individual -omes causes intestinal inflammation mediated by dysfunctional molecular modules. To define the IBD interactome, new concepts and tools are needed to implement a systems approach; an unbiased data-driven integration strategy that reveals key players of the system, pinpoints the central drivers of inflammation and enables development of targeted therapies. Powerful bioinformatics tools able to query and integrate multiple -omes are available, enabling the integration of genomic, epigenomic, transcriptomic, proteomic, metabolomic and microbiome information to build a comprehensive molecular map of IBD. This approach will enable identification of IBD molecular subtypes, correlations with clinical phenotypes and elucidation of the central hubs of the IBD interactome that will aid discovery of compounds that can specifically target the hubs that control the disease.

Inflamm Bowel Dis 21(11):2533-9, 2015.

Assessment of Circulating MicroRNAs for the Diagnosis and Disease Activity Evaluation in Patients with Ulcerative Colitis by Using the Nanostring Technology.

We have identified a signature of 12 circulating microRNAs that differentiate patients with UC from control subjects. Moreover, 6 of these microRNAs significantly correlated with UC disease activity. Importantly, a set of 4 microRNAs (hsa-miR-4454, hsa-miR-223-3p, hsa-miR-23a-3p, and hsa-miR-320e), which correlated with UC disease activity were found to have higher sensitivity and specificity values than C-reactive protein. Circulating microRNAs provide a novel diagnostic and prognostic marker for patients with UC. The use of an FDA-approved platform could accelerate the application of microRNA screening in a gastrointenstinal clinical setting. When used in combination with current diagnostic and disease activity assessment modalities, microRNAs could improve both IBD screening and care management.

Gastroenterology 145(4):842-52.e2, 2013.

MicroRNA-124 regulates STAT3 expression and is down-regulated in colon tissues of pediatric patients with ulcerative colitis.

miR-124 appears to regulate the expression of STAT3. Reduced levels of miR-124 in colon tissues of children with active UC appear to increase expression and activity of STAT3, which could promote inflammation and the pathogenesis of UC in children. Thus, the activation of miR-124 could have a therapeutic potential in IBD patients.

N Engl J Med,375(23):2246-2254, 2016.

Adjuvant Sunitinib in High-Risk Renal-Cell Carcinoma after Nephrectomy.

Among patients with locoregional clear-cell renal-cell carcinoma at high risk for tumor recurrence after nephrectomy, the median duration of disease-free survival was significantly longer in the sunitinib group than in the placebo group, at a cost of a higher rate of toxic events.

Urol Oncol 33(5):204.e25-33, 2015.

Carbonic anhydrase-IX score is a novel biomarker that predicts recurrence and survival for high-risk, nonmetastatic renal cell carcinoma: Data from the phase III ARISER clinical trial.

The largest, multicenter, prospective analysis of patients with high-risk nonmetastatic ccRCC demonstrates the utility of CAIX score as a statistically significant prognostic biomarker for survival. It recommends that CAIX score be quantified for all patients with high-risk disease after nephrectomy.

Eur Urol. 61(5):888-95, 2012.

Clinical, molecular, and genetic correlates of lymphatic spread in clear cell renal cell carcinoma.

Predictive model consisting of smoking history (p=0.040), T stage (p<0.0001), Fuhrman grade (p<0.0001), Eastern Cooperative Oncology Group performance status (p<0.0001), and microvascular invasion (p<0.0001) was independently associated with lymphatic spread. After adjustment with these clinical variables, low carbonic anhydrase IX (CAIX) (p=0.043) and high epithelial vascular endothelial growth factor receptor 2 (p=0.033) protein expression were associated with a higher risk of lymphatic spread, and loss of chromosome 3p (p<0.0001) with a lower risk.

Cancer. 1;118(7):1795-802, 2012.

Smoking negatively impacts renal cell carcinoma overall and cancer-specific survival.

In patients with RCC, a history of smoking was associated with worse pathologic features and survival outcomes and with an increased risk of having mutated p53. Further investigation of the genetic and molecular mechanisms associated with decreased CSS in patients with RCC who have a history of smoking is indicated.

Cancer, 115(7):1448-58, 2009.

Carbonic anhydrase IX in bladder cancer: a diagnostic, prognostic, and therapeutic molecular marker.

CAIX was expressed differentially in noninvasive versus invasive tumors, in low-grade versus high-grade bladder cancer, and in primary tumors versus metastases. The current results indicated that CAIX is a strong predictor of recurrence, progression, and overall survival of patients with bladder cancer; and the integration of CAIX expression into conventional prognostic models significantly improved their predictive accuracy. The data suggest a tripartite role of CAIX as a diagnostic, prognostic, and therapeutic molecular marker in bladder cancer.

arXiv:2511.07364, 2025