The Web as a Parallel Corpus

The Web as a Parallel Corpus
Philip Resnik∗ Noah A. Smith†
University of Maryland Johns Hopkins University
Parallel corpora have become an essential resource for work in multilingual natural language
processing. In this article, we report on our work using the STRAND system for mining parallel
text on theWorldWideWeb, first reviewing the original algorithm and results and then presenting
a set of significant enhancements. These enhancements include the use of supervised learning
based on structural features of documents to improve classification performance, a new contentbased
measure of translational equivalence, and adaptation of the system to take advantage of the
Internet Archive for mining parallel text from theWeb on a large scale. Finally, the value of these
techniques is demonstrated in the construction of a significant parallel corpus for a low-density
language pair.

Philip Resnik，Noah A. Smith2007 Computational Linguistics

The Web as a Parallel Corpus.pdf(430 KB)