WebCorpus in a Nutshell

The WebCorpus execution chain enables the calculation of statistics on large corpora extracted from web crawls by using the massively distributable MapReduce framework Hadoop.


WebCorpus is a Hadoop-based Java tool chain that allows the processing and computation of statistics of large corpora extracted from web crawls. WebCorpus aims to create a system that generates information like n-gram counts, cooccurrence counts, or isolated sentences from a large corpus of webpages for a language of choice. 

Parallel processing of such tasks can lead to a huge performance benefit over serial processing. The MapReduce paradigm provides a programming model for parallel processing and the Hadoop framework is a massively distributable execution framework of such MapReduce algorithms. The system is designed as a pipeline various Hadoop MapReduce jobs. 

Project Subtasks

  • Extract documents from the raw data and provide them for further tasks in a standardized format aligned with metadata such as document URL and crawl date. We will refer to this job as DocumentJob.
  • Web crawling usually leads to a lot of noise, so some basic cleanup tasks need to be performed:

    • Deduplication of documents, as a documents might occur multiple times due to recrawling of a document or context variations, like a print page for a document that occurred already as a normal page. (DeduplicationJob, DeduplicationByHostJob)
    • Filtering of documents with malformed encodings. (UTF8Job)

  • Inner segmentation of documents for further processing:

    • Detection of paragraphs in documents. (DocumentJob)
    • Detection of sentences in paragraphs. (SentenceJob)

  • Filtering of sentences with a language other than the chosen language. (LanguageJob)
  • Annotate sentences with tokens and parts of speech. (SentenceAnnotateJob)
  • Generate corpus of n-grams. (NGramCountJob, POSNGramCountJob, NGramWithPOSCountJob)
  • Generate corpus of cooccurrences. (CooccurrenceJob)
  • Extract sentences with clearly detected language in a standardized format. (SentenceExtractJob)

See Figure 1 as a visualization of the proposed execution chain.


Figure 1.: Visualization of the WebCorpus Pipeline

Further Information

Please see http://sourceforge.net/p/webcorpus for further informtion.

A A A | Drucken Print | Impressum Impressum | Sitemap Sitemap | Suche Search | Kontakt Contact | Webseitenanalyse: Mehr Informationen
zum Seitenanfangzum Seitenanfang