Web content plays an increasingly important role in the knowledge-based society, and the preservation and long-term accessibility of Web history has high value (e.g., for scholarly studies, market analyses, intellectual property disputes, etc.). There is strongly growing interest in its preservation by library and archival organizations as well as emerging industrial services. Web content characteristics (high dynamics, volatility, contributor and format variety) make adequate Web archiving a challenge.

LiWA will look beyond the pure “freezing” of Web content snapshots for a long time, transforming pure snapshot storage into a “Living” Web Archive. “Living” refers to a) long term interpretability as archives evolve, b) improved archive fidelity by filtering out irrelevant noise and c) considering a wide variety of content. LiWA will extend the current state of the art and develop the next generation of Web content capture, preservation, analysis, and enrichment services to improve fidelity, coherence, and interpretability of web archives. By developing methods which improve archive fidelity, the project will contribute to adequate preservation of complete and high-quality content. By developing methods for improved archive coherence and interpretability, the project contributes to ensuring its long-term usability. LiWA RTD will focus on innovative methods for content capturing, filtering out spam and other noise, improving temporal archive coherence, and dealing with semantic and terminology evolution. Two exemplary LiWA applications - focusing on audiovisual streams and social web content, respectively - will show the benefits of advanced Web archiving to interested stakeholders. To ensure demand-driven RTD development and broad, sustained project impact, the LiWA consortium will closely work with the International Internet Preservation Consortium (IIPC) as well as important library and archiving organizations, two of which are members of LiWA.

The LiWA project was funded by the European Commission under Project No. 21626

Project pages

Nina Tahmasebi
Associate Professor in Natural Language Processing