Publications

An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems

Published in Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI 26), 2026

Large Language Models (LLMs) are transforming scholarly tasks like search and summarization, but their reliability remains uncertain. Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practice. We developed and validated a schema for evaluating LLM errors in scholarly question-answering systems that reflects the assessment strategies of practicing scientists. In collaboration with domain experts, we identified 20 error patterns across seven categories through thematic analysis of 68 question-answer pairs. We validated this schema through contextual inquiries with 10 additional scientists, which showed not only which errors experts naturally identify but also how structured evaluation schemas can help them detect previously overlooked issues. Domain experts use systematic assessment strategies, including technical precision testing, value-based evaluation, and meta-evaluation of their own practices. We discuss implications for supporting expert evaluation of LLM outputs, including opportunities for personalized, schema-driven tools that adapt to individual evaluation patterns and expertise levels.

Recommended citation: Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, and Harmanpreet Kaur. (2026). "An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems." In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI 26), April 13–17, 2026, Barcelona, Spain. ACM.

PaperTrail: A Claim-Evidence Interface for Grounding Provenance in LLM-based Scholarly Q&A

Published in Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI 26), 2026

Large language models (LLMs) are increasingly used in scholarly question-answering (QA) systems to help researchers synthesize vast amounts of literature. However, these systems often produce subtle errors (e.g., unsupported claims, errors of omission), and current provenance mechanisms like source citations are not granular enough for the rigorous verification that scholarly domain requires. To address this, we introduce PaperTrail, a novel interface that decomposes both LLM answers and source documents into discrete claims and evidence, mapping them to reveal supported assertions, unsupported claims, and information omitted from the source texts. We evaluated PaperTrail in a within-subjects study with 26 researchers who performed two scholarly editing tasks using PaperTrail and a baseline interface. Our results show that PaperTrail significantly lowered participants’ trust compared to the baseline. However, this increased caution did not translate to behavioral changes, as people continued to rely on LLM-generated scholarly edits to avoid a cognitively burdensome task. We discuss the value of claim-evidence matching for understanding LLM trust-worthiness in scholarly settings, and present design implications for cognition-friendly communication of provenance information.

Recommended citation: Anna Martin-Boyle, Cara Leckey, Martha Brown, and Harmanpreet Kaur. (2026). "PaperTrail: A Claim-Evidence Interface for Grounding Provenance in LLM-based Scholarly Q&A." In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI 26), April 13–17, 2026, Barcelona, Spain. ACM.

Complex Mathematical Symbol Definition Structures: A Dataset and Model for Coordination Resolution in Definition Extraction

Published in arXiv, 2023

Mathematical symbol definition extraction is important for improving scholarly reading interfaces and scholarly information extraction (IE). However, the task poses several challenges: math symbols are difficult to process as they are not composed of natural language morphemes; and scholarly papers often contain sentences that require resolving complex coordinate structures. We present SymDef, an English language dataset of 5,927 sentences from full-text scientific papers where each sentence is annotated with all mathematical symbols linked with their corresponding definitions. This dataset focuses specifically on complex coordination structures such as “respectively” constructions, which often contain overlapping definition spans. We also introduce a new definition extraction method that masks mathematical symbols, creates a copy of each sentence for each symbol, specifies a target symbol, and predicts its corresponding definition spans using slot filling. Our experiments show that our definition extraction model significantly outperforms RoBERTa and other strong IE baseline systems by 10.9 points with a macro F1 score of 84.82. With our dataset and model, we can detect complex definitions in scholarly documents to make scientific writing more readable.

Recommended citation: Anna Martin-Boyle, Andrew Head, Kyle Lo, Risham Sidhu, Marti A. Hearst, and Dongyeop Kang. (2023). "Complex Mathematical Symbol Definition Structures: A Dataset and Model for Coordination Resolution in Definition Extraction." arXiv:2305.14660. https://arxiv.org/pdf/2305.14660.pdf

Decoding the End-to-end Writing Trajectory in Scholarly Manuscripts

Published in The Second Workshop on Intelligent and Interactive Writing Assistants at CHI 2023, 2023

Scholarly writing presents a complex space that generally follows a methodical procedure to plan and produce both rationally sound and creative compositions. Recent works involving large language models (LLM) demonstrate considerable success in text generation and revision tasks; however, LLMs still struggle to provide structural and creative feedback on the document level that is crucial to academic writing. In this paper, we introduce a novel taxonomy that categorizes scholarly writing behaviors according to intention, writer actions, and the information types of the written data. We also provide ManuScript, an original dataset annotated with a simplified version of our taxonomy to show writer actions and the intentions behind them. Motivated by cognitive writing theory, our taxonomy for scientific papers includes three levels of categorization in order to trace the general writing flow and identify the distinct writer activities embedded within each higher-level process. ManuScript intends to provide a complete picture of the scholarly writing process by capturing the linearity and non-linearity of writing trajectory, such that writing assistants can provide stronger feedback and suggestions on an end-to-end level.

Recommended citation: Ryan Koo, Anna Martin, Linghe Wang, and Dongyeop Kang. (2023). "Decoding the End-to-end Writing Trajectory in Scholarly Manuscripts." The Second Workshop on Intelligent and Interactive Writing Assistants at CHI 2023. https://cdn.glitch.global/d058c114-3406-43be-8a3c-d3afff35eda2/paper34_2023.pdf

Published in , 1900

Duluth at SemEval-2021 Task 11: Applying DeBERTa to Contributing Sentence Selection and Dependency Parsing for Entity Extraction

Published in Proceedings of the Fifteenth Workshop on Semantic Evaluation (SemEval-2021), 2021

This paper describes the Duluth system that participated in SemEval-2021 Task 11, NLP Contribution Graph. It details the extraction of contribution sentences and scientific entities and their relations from scholarly articles in the domain of Natural Language Processing. Our solution uses deBERTa for multi-class sentence classification to extract the contributing sentences and their type, and dependency parsing to outline each sentence and extract subject-predicate-object triples. Our system ranked fifth of seven for Phase 1: end-to-end pipeline, sixth of eight for Phase 2 Part 1: phrases and triples, and fifth of eight for Phase 2 Part 2: triples extraction.

Recommended citation: Anna Martin and Ted Pedersen. (2021). "Duluth at SemEval-2021 Task 11: Applying DeBERTa to Contributing Sentence Selection and Dependency Parsing for Entity Extraction." Proceedings of the Fifteenth Workshop on Semantic Evaluation (SemEval-2021),. pages 490-501, Online. Association for Computational Linguistics. 1(1). https://aclanthology.org/2021.semeval-1.60.pdf