What is a parallel corpus and what can you do with it?

This blog has been running now for nearly 6 years and we’re coming up to 20,000 visits (25,000 if you count our former incarnation, https://ponyingtheslovos.wordpress.com/), yet in all this time we still haven’t explained what a parallel corpus is in any detail or what you can do once you have one. This post will give a few pointers.

What is a parallel corpus?

A parallel corpus is where you have a text (or collection of texts, e.g. works by the same author) and its (their) translations into another language (or languages). The idea is that once you have this, you can use parallel corpus software to help you identify how particular words, phrases etc. have been translated into the second language and how consistent the translation has been.

In order to make a parallel corpus, you need an original text (e.g. A Clockwork Orange) and its translated text(s) (e.g. L’Orange Mecanique and/or La Naranja Mecanica) in computer readable form. Once you have them, they need to be aligned by ‘translation unit’. The way this is done is you divide the original book into ‘units’ (often sentences or clauses) and then make sure that the equivalent ‘unit’ (which of course may not be a sentence or clause, as translation isn’t that simple) is matched. This is perhaps easiest to conceptualise if we provide an example (unfortunately we are unable to share the full aligned versions of these texts for rights reasons).

The figure below shows the translation units for the first few lines of A Clockwork Orange aligned with the equivalent units in L’Orange Mecanique and in La Naranja Mecanica. Essentially this has to be done for the entire work in order to carry out parallel concordance analysis and find out how target items (words, phrases etc.) have been translated. Just looking at the first few lines of the book throws up some interesting differences between French and Spanish translations. Why is milk-plus simply leche-plus in Spanish but du lait gonflé in French? Why is the Korova Milkbar rendered as El bar lácteo Korova in Spanish but Le Korova Milkbar in French?

Extract of aligned texts (beginning of *ACO, LNM* and LOM) in Excel

Having translations in different languages available in this way then allows us to adopt a rigorous procedure for comparing translations, as we have done for Nadsat in French and Spanish (covered on this blog here and here) and Sophia Malamatidou did when looking at how Nadsat nouns were rendered in French, Spanish, Greek and German. Another possibility, taken up by Pat Corness is to look at differences between translations carried out by the same translator (Robert Stiller, who translated A Clockwork Orange into Polish twice).

In a first for Ponying the Slovos, here’s a YouTube video showing how you create a parallel corpus on Sketch Engine:

What can you do with a parallel corpus?

Once you have fully aligned versions of the same text, you’re then ready to analyse uses of words, phrases and other features in the text and how they compare across various translations (the results of which can be seen here, here and here). A number of different programmes can be used to do this. We have at various times used Laurence Anthony’s AntPConc (freeware) and David Woolls’ Multiconcord but mostly we have relied on the subscription based Sketch Engine, which is really user-friendly and has certain advantages such as the option of searching by part of speech, not just by word. This allows for a greater variety and power of searches. I’ve covered some basic options in the video below.

These are just some of the options, of course – don’t rely on this blog for complete guidance. There are many great books out there on how to explore corpora.

What we’ve found is that using parallel translation corpora, especially in relation to Nadsat, has allowed us to examine the translation strategies of individual translators, when faced with the same task of translating an invented language. The translators themselves may not always have been consciously aware of these strategies, but using corpus methodologies, we have been able to make them visible.

What is a parallel corpus?

What can you do with a parallel corpus?

Leave a Reply Cancel reply