EncycNet, funded by the German Research Foundation (DFG) and begun in 2019, aims to create a new semantic resource for historical German in the form of a richly annotated knowledge graph (a nodes and edges network). By linking the knowledge contained within thousands of encyclopedia entries via machine learning classification and semantic web annotation, EncycNet provides a unique resource for a variety of historical, cultural, and computational linguistic research goals.
Publications and Code
- Thora Hagen, Erik Ketzan, Fotis Jannidis and Andreas Witt, “Twenty-Two Historical Encyclopedias Encoded in TEI: A New Resource for the Digital Humanities”, LaTeCH-CLfL 2020: 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, December 2020.
- TEI-transformed corpus and ODD
- Original XML corpus
- TEI transformation with XSLT
What we have done so far (as of April 2021)
- ingested 22 historical German encyclopedias from the early 18th to early 20th century, over 49,300,000 word tokens, in XML format, provided by TextGrid and Zeno.org
- transformed the corpus to TEI Lex-0, to unify and disambiguate the original XML encoding, and to suggest changes to TEI methodology for encyclopedic texts
- published the TEI corpus and transformation code
- performed corpus annotation including Part-of-Speech tagging, lemmatization, orthographic normalization (see Acknowledgements, below) as well as Named Entity Recognition and transformed the TEI to CSV files
- splitting encyclopedia entries that contain multiple concepts
- developing Hearst patterns (defined as lexico-syntactic recognition of hyponymy; but for EncycNet, several relation types are of importance) from the encyclopedia texts to aid with triple extraction
- creating a list of concepts from the encyclopedia entries by linking the entries to Wikipedia pages (including semantic disambiguation)
- linking the concepts to hypernyms using Wikidata
- extracting triples from the encyclopedia entries by utilizing explicitly annotated references, Hearst patterns and typography tags in general
The EncycNet knowledge graph also aims to meet the following requirements: each triple can be traced back to its origin so that temporal information will be retained. As the triples will be extracted automatically, each triple will also be annotated with a confidence score. The graph will be linked to Wikipedia and the GND and can potentially be connected to other Linked Open Data resources as well. Lastly, tutorials for different use cases will be made available in the future to demonstrate how to work with EncycNet, with special attention to improving word embeddings with newfound historical-semantic knowledge.
- Fotis Jannidis, University of Würzburg
- Andreas Witt, IDS Mannheim & University of Cologne
- Thora Hagen, University of Würzburg
- Erik Ketzan, University of Cologne
- Corinna Keupp
- Maximilian Supplieth
- Nicolas Werner
Thanks to the German Text Archive (Deutsches Textarchiv), especially Bryan Jurish, who generously performed the orthographic normalization, lemmatization and Part-of-Speech tagging of all encyclopedia texts using CAB.