TAXI: a Taxonomy Induction System

This page contains implementation of the TAXI, a method for taxonomy induction that reached the first place in the SemEval 2016 challenge on taxonomy extraction evaluation. The method builds a taxonomy from a domain vocabulary. It extracts hypernyms from substrings and large domain-specific corpora bootstrapped from the input vocabulary. Multiple evaluations based on the SemEval taxonomy extraction datasets of four languages and three domains show state-of-the-art performance of our approach. This page contains implementations of the method including all resources needed to reproduce experiment described in:

Panchenko A., Faralli S., Ruppert E., Remus S., Naets H., Fairon C. Ponzetto S. P.,  and Biemann, C. "TAXI: a Taxonomy Induction Method based on Lexico-Syntactic Patterns,  Substrings and Focused Crawling" . Proceedings of the 10th International Workshop on Semantic Evaluation, Association for Computational Linguistics, San Diego, CA, USA, 2016

Image source.

To cite the system in a scientific publication please use the following BibTex reference:

@inproceedings{panchenko2016taxi,
title={TAXI: a Taxonomy Induction Method based on Lexico-Syntactic Patterns,  Substrings and Focused Crawling},
author={Panchenko, Alexander and Faralli, Stefano and Ruppert, Eugen and Remus, Steffen and Naets, Hubert and Fairon, Cedrick and Ponzetto, Simone Paolo and Biemann, Chris},
booktitle={Proceedings of the 10th International Workshop on Semantic Evaluation},
year={2016},
organization={Association for Computational Linguistics}
}

Motivation

We consider taxonomy induction as a process that should -- as much as possible -- be driven solely on the basis of raw text processing. While some labeled examples might be utilized to tune the extraction and induction process, we avoid relying on structured lexical resources such as WordNet or BabelNet. We rather envision a situation where a taxonomy shall be induced in a new domain or a new language for which such resources do not exist. Otherwise, there is little need for induction,  and in application-based scenarios, it is still possible to merge induced and existing taxonomies. In this paper, we demonstrate our methodology by executing hyponymy pattern extraction on general-domain and domain-specific corpora for four languages.

Method

Our approach is characterized by scalability and simplicity, assuming that being able to process larger input data is more important than the complexity of the approach. Our approach to taxonomy induction takes as input a set of domain terms and general-domain text corpora and outputs a taxonomy. It consists of four steps. First, we crawl domain-specific corpora based on terminology of the target domain. These complement general-purpose corpora, like texts of Wikipedia articles. Second, candidate hypernyms are extracted based on substrings and lexico-syntactic patterns. These candidates are subsequently pruned so that each term has only a few most salient hypernyms. The last step performs optimization of the overall taxonomy structure removing cycles and linking disconnected components to the root. Below we present a description of each of these steps. 

The method extracts hypernyms from substrings and large domain-specific corpora bootstrapped from the input vocabulary. Multiple evaluations based on the SemEval taxonomy extraction datasets of four languages and three domains show state-of-the-art performance of our approach.

Downloads

An implementation of the method including all resources, is available for download at http://tudarmstadt-lt.github.io/taxi.

A A A | Drucken Print | Impressum Impressum | Sitemap Sitemap | Suche Search | Kontakt Contact | Webseitenanalyse: Mehr Informationen
zum Seitenanfangzum Seitenanfang