Document Type



This paper explores new methods for locating the sources used to write a text by 昀椀ne-tuning a variety of language models to rerank candidate sources. These methods promise to shed new light on traditions with complex citational practices, such as in medieval Arabic where citations are ambiguous and boundaries of quotation are poorly defined. After retrieving candidates sources using a baseline BM25 retrieval model, a variety of reranking methods are tested to see how effective they are at the task of source attribution. We conduct experiments on two datasets—English Wikipedia and medieval Arabic historical writing—and employ a variety of retrieval- and generation-based reranking models. In particular, we seek to understand how the degree of supervision required affects the performance of various reranking models. We find that semi-supervised methods can be nearly as effective as fully supervised methods while avoiding potentially costly span-level annotation of the target and source documents.

Publication (Name of Journal)

CEUR Workshop Proceedings