Recently, scholars have begun to draw on the massive amounts of text data made available through Google’s large-scale book digitization project in order to track the development of cultural concepts and words over the last two centuries, announcing a new field of research named “culturomics” by its originators.
However, these initial studies have been (rightly) criticized for not referring to relevant work in linguistics and language technology. Nevertheless, the basic premise of this endeavor is eminently timely. Even if we restrict ourselves to the written language, vast amounts of new Swedish text are produced every year and older texts are being digitized apace in cultural heritage projects. This abundance of text contains enormous riches of information. However, the volumes of Swedish text available in digital form have grown far beyond the capacity of even the the fastest reader, leaving automated semantic processing of the texts as the only realistic option for accessing and using this information.
The main aim of this research program is to advance the state of the art in language technology resources and methods for semantic processing of Swedish text, in order to provide researchers and others with more sophisticated tools for working with the information contained in large volumes of digitized text, e.g., by being able to correlate and compare the content of texts and text passages on a large scale.
The project will focus on the theoretical and methodological advancement of the state of the art in extracting and correlating information from large volumes of Swedish text using a combination of knowledge-based and statistical methods. One central aim of this project will be to develop methodology and applications in support of research in disciplines where text is an important primary research data source, primarily the humanities and social sciences (HSS).
The innovative core of the project will be the exploration of how to best combine knowledge-rich but sometimes resource-consuming LT processing with statistical machine-learning and data-mining approaches.