EncycNet, funded by the German Research Foundation (DFG) and begun in 2019, aims to create a new semantic resource for historical German in the form of a richly annotated knowledge graph (a nodes and edges network). By linking the knowledge contained within thousands of encyclopedia entries via machine learning classification and semantic web annotation, EncycNet provides a unique resource for a variety of historical, cultural, and computational linguistic research goals.

EncycNet - Knowledge Graph

Creative Commons

Publications and Code

What we have done so far (as of April 2021)

  • ingested 22 historical German encyclopedias from the early 18th to early 20th century, over 49,300,000 word tokens, in XML format, provided by TextGrid and Zeno.org
  • transformed the corpus to TEI Lex-0, to unify and disambiguate the original XML encoding, and to suggest changes to TEI methodology for encyclopedic texts
  • published the TEI corpus and transformation code
  • performed corpus annotation including Part-of-Speech tagging, lemmatization, orthographic normalization (see Acknowledgements, below) as well as Named Entity Recognition and transformed the TEI to CSV files

Ongoing steps

  • splitting encyclopedia entries that contain multiple concepts
  • developing Hearst patterns (defined as lexico-syntactic recognition of hyponymy; but for EncycNet, several relation types are of importance) from the encyclopedia texts to aid with triple extraction
  • creating a list of concepts from the encyclopedia entries by linking the entries to Wikipedia pages (including semantic disambiguation)
  • linking the concepts to hypernyms using Wikidata
  • extracting triples from the encyclopedia entries by utilizing explicitly annotated references, Hearst patterns and typography tags in general

The EncycNet knowledge graph also aims to meet the following requirements: each triple can be traced back to its origin so that temporal information will be retained. As the triples will be extracted automatically, each triple will also be annotated with a confidence score. The graph will be linked to Wikipedia and the GND and can potentially be connected to other Linked Open Data resources as well. Lastly, tutorials for different use cases will be made available in the future to demonstrate how to work with EncycNet, with special attention to improving word embeddings with newfound historical-semantic knowledge.

Team

Research assistants

  • Corinna Keupp
  • Maximilian Supplieth
  • Nicolas Werner

Contact

Thora Hagen

thora.hagen@uni-wuerzburg.de

Acknowledgements

Thanks to the German Text Archive (Deutsches Textarchiv), especially Bryan Jurish, who generously performed the orthographic normalization, lemmatization and Part-of-Speech tagging of all encyclopedia texts using CAB.