DIVID-DJ: Data Extraction and Interactive Visualization of Unexplored Textual Datasets for Investigative Data-Driven Journalism


Journalists investigate novel and relevant stories in order to report them in article publications. Often, such stories need to be discovered within large text document collections with different levels of confidentiality (e.g. public parliament records, collections like WikiLeaks, or secret documents from an informer). Typically, those resources largely consist of unstructured textual content with additional meta-data such as date, location, sender and receiver of messages, etc.
Investigation of unstructured document collections is a laborious task: The sheer amount of content can be vast, e.g., the WikiLeaks PlusD 1 dataset released in 2013 contains around 1.7 million cables. The plain text parts mostly contain uninteresting content that conceals the crucial storylines, and if journalists do not know in advance what to look for in the document collections, they can only vaguely target all people and organizations (“named entities”) of public interest. There is a plethora of statistical tools used in data journalism, which automatically reveal and visualize interesting correlations hidden within large structured datasets. Unlike for structured data, there are hardly any automated tools that reveal relevant named entities or give any other aggregate view on them.
As an additional constraint, the texts have to be analyzed under time pressure, because the journalistic value of each story decreases rapidly if other media publish it first. Moreover, the workload for processing confidential documents is frequently allocated to the few journalists that are allowed accessing them.
To meet these challenges, we identified automated support for immediate extraction of valuable information from large text document collections as vital for journalists.


In this project, we will combine NLP technology with advanced InfoVis interfaces into a novel tool for investigative data-driven journalism that addresses research questions regarding the combination of NLP and InfoVis techniques.
After document conversion, as provided by Spiegel IT department, NLP tools will extract named entities and their relationships from document collections. This is the only language-dependent part, because it requires a named entity recognizer for our target languages English and German. Further, relevant keywords for entities and groups of entities will be extracted using statistical NLP techniques of word co-occurrence.
The results of the natural language processing stage will be used as an input for interactive visualization, developed specifically for this purpose in this project. The aggregate view of the extracted entities forms the basis structure of the entity graph, which serves as a visual access to the document collection. The novel visualization will need to show diverse information about entities and their relationships, such as data sources, data frequencies or data changes over time.

Added Value for Science and Journalism

The scientific contribution of the project will push forward both science and investigative journalism. On the scientific side, we will cover both language technologies and information visualization. The main contribution of this project is the connection of these technologies for the use in investigative journalism.
In the information visualization area, we will contribute novel guidelines for effective visual designs and interaction techniques, specifically for the purposes of investigative journalism. This includes the visualization of very large collections, visualization of progression of entities networks over time, and the interaction with the visualization for effective information access.
The key contribution on the natural language processing side is the effective combination of named entity recognition and keyword extraction for the use of collection visualization and information access. Further research questions include the efficient indexing for entering the local view, the ranking and selection of keywords, and the integration of tools that identify and resolve temporal references within the documents.
The added value for journalism is threefold: First, journalists will get their hands on a tool that does not require extensive training, and which allows to quickly spot interesting parts of the dataset. Second, the tool supports sharing particular views, observations and comments within a team of journalists. Third, the tool supports publishing of interesting data views along with the story article, allowing the reader to explore relevant parts data themselves, increasing transparency.

Follow our blog at http://newsleak.io/ for updates!


Project Data

Funding Body: Volkswagenstiftung

Project volume: 96K Euro

Project Duration: Nov 2015 - Oct 2016

Project Partners

1. Spiegel-Verlag Rudolf Augstein GmbH & Co. KG, Hamburg, Germany

Primary Investigator: Marcel Rosenbach
Coordinator: Dr. Michaela Regneri
Main competence and project contribution: Journalist, Knowledge Transfer and Media Partner

2. Technische Universitat Darmstadt, FG Language Technology, Darmstadt, Germany

Primary Investigator: Prof. Dr. Chris Biemann
Executive Staff: Dr. Alexander Panchenko
Main competence and project contribution: Natural language processing of large text corpora

3. Technische Universitat Darmstadt, Interactive Graphics Systems Group, Darmstadt, Germany

Primary Investigator: Dr. Tatiana von Landesberger
Executive Staff: Kathrin Ballweg
Main competence and project contribution: Visual analysis of large datasets, incl. interactive data visualization

A A A | Drucken Print | Impressum Impressum | Sitemap Sitemap | Suche Search | Kontakt Contact | Webseitenanalyse: Mehr Informationen
zum Seitenanfangzum Seitenanfang