Using an Advanced Text Index Structure for Corpus Exploration in Digital Humanities

在本文中,我们展示了对称压缩有向无环词图(scdawg)——后缀树的一种细化——为语料库探索提供了一个理想的基础,帮助以一种优雅的方式回答DH研究中提出的许多问题。

作者:Tobias Englmeier, CIS, Ludwig-Maximilians University, Munich, Germany
Marco Büchler, Institute of Computer Science, University of Göttingen, Göttingen, Germany
Stefan Gerdjikov, FMI, University of Sofia “St. Kliment Ohridski”, Sofia, Bulgaria
Klaus U. Schulz , CIS, Ludwig-Maximilians University, Munich, Germany

转载来源:Digital Humanities Quarterly, 2021, Volume 15 Number 1, http://www.digitalhumanities.org/dhq/vol/15/1/000526/000526.html

通过适当的索引结构,可以有效地解决许多语料库搜索任务,而无需在线重新扫描文本存储库。在本文中,我们展示了对称压缩有向无环词图(scdawg)——后缀树的一种细化——为语料库探索提供了一个理想的基础,帮助以一种优雅的方式回答DH研究中提出的许多问题。从简化的角度来看,scdawg的优点依赖于两个特性。首先,需要线性计算时间,索引提供了关于所有文本之间的相似性(在公共子字符串方面)和差异的联合视图。其次,索引的结构规律有助于在不使用先验语言知识的情况下,以一种语言独立的方式挖掘文本中有趣的部分(如短语和概念名)及其关系。为了证明这些原则的威力,我们将研究文本对齐、不同文本或不同作者之间的文本重用、概念的自动检测、历时语料库中短语的时间分布以及相关问题。

作者简介:

Tobias Englmeier

 Tobias Englmeier is a PhD candidate at the Centrum für Informations- und Sprachverarbeitung (CIS) at the Ludwig Maximilians University of Munich. His PhD project is centered around the topics of string matching and OCR postcorrection. Additionally he has been involved in the conception and implementation of numerous Digital Humanities projects coordinated by the IT Gruppe Geisteswissenschaften (ITG) at the Ludwig Maximilians University of Munich.

Marco Büchler 

Marco Büchler holds a Diploma in Computer Science. From 2006 to 2014 he worked as a Research Associate in the Natural Language Processing Group at Leipzig University. From April 2008 to March 2011 Marco served as the technical Project Manager for the eAQUA project and continued to work in that capacity for the following eTRACES project. In March 2013 he received his PhD in eHumanities. Since May 2014 he leads a Digital Humanities Research Group at the Göttingen Centre for Digital Humanities. His research includes Natural Language Processing on Big Humanities Data. Specifically, he works on Historical Text Reuse Detection and its application in the business world. In addition to his primary responsibilities, Marco manages the Medusa project (Big Scale co-occurrence and NGram framework) as well as the TRACER machine for detecting historical text reuse.

Stefan Gerdjikov 

Stefan Gerdjikov is an Assistent Professor at the Faculty for Informatics and Mathematics in the University of Sofia. He holds a PhD degree in Mathematics from the University of Sofia. His prime research area is Natural Language Processing where he studies approximate search techniques and index structures for text mining.

Klaus U. Schulz 

Klaus U. Schulz is Professor in Computational Linguitics and since 1992 the technical director of the Centrum für Informations- und Sprachverarbeitung (CIS) at the Ludwig Maximilians University of Munich. The work of Professor Schulz concentrates on Semantic Search, Construction of Ontologies and Taxonomies, Digital Libraries, Language Technology for Optical Character Recognition and Document Analysis and Finite-State Technology.

en_GBEnglish