Estonian Digital Humanities Conference 2018

Estonian Digital Humanities Conference 2018


For the last decade, automatic detection of word sense change has primarily focused on detecting the main changes in meaning of a word. Most current methods rely on new, powerful embedding technologies, but do not differentiate between different senses of a word, which is needed in many applications in the digital humanities. Of course, this radically reduces the complexity, but often fails to answer questions like: what changed and how, and when did the change occur?

In this talk, I will present methods for automatically detecting sense change from large amounts of diachronic data. I will focus on a study on a Historical Swedish Newspaper Corpus, the Kubhist dataset with digitized Swedish newspapers from 1749-1925. I will present our work with detecting and correcting OCR errors, normalizing spelling variations, and creating representations for individual words using a popular neural embedding method, namely Word2Vec.

Methods for creating (neural) word embeddings are the state-of-the-art in sense change detection, and many other areas of study, and mainly studied on English corpora where the size of the datasets are sufficiently large. I will discuss the limitations of such methods for this particular context; fairly small-sized data with a high error rate as is common in a historical context for most languages. In addition, I will discuss the particularities of text mining methods for digital humanities and what is needed to bridge the gap between computer science and the digital humanities.

Tartu, Estonia
Nina Tahmasebi
Researcher in Natural Language Processing