Overview
Machine Learning Tools for data enrichment Digital Humanities Newspapers and magazines Platform for data aggregation or retrieval
NewspAIper: AI-Based Metadata Enrichment of Historical Newspaper Collections
Historical newspaper collections are among the richest sources of information for humanities research. Many of these collections have been digitized and automatically transcribed via OCR in the past years to allow digital access to historical material. However, extracting specific articles from collections, finding related images, and linking related articles is often time-consuming and labor-intensive. Automatic article extraction and semantic enrichment of these historical newspaper collections would greatly improve their accessibility – this is exactly what we investigate in the DATA-KBR-BE project, i.e, AI-based methods to facilitate article-level search in historical collections.
Our annotation pipeline aims to enhance searchability with semantic data enrichment and cross collection linkage of information. The newspAIper demo, which is based in this pipeline, allows the user to interactively query the collection. Starting from the given OCR results of the collection, our pipeline performs article segmentation, named entity recognition, and semantic linking.
Using the NewspAIper platform, humanities researchers can more easily extract information relevant to their research interests from the collection. Moreover, it facilitates interactive filters based on article date, language, found entities, and allows users to browse similar articles and illustrations. The pipeline will be improved and used to construct corpora for subsequent research. As future work, we want to improve the semantic text enrichment by providing both topic detection and trend analysis (detect subsequent articles on the same topic). Furthermore, toponyms found in the articles could be used to geolocate the historical images. These improvements will allow for even more fine-grained filtering capabilities.