By Adam Kilgarriff

12 March 2014 - 16:18

Corpora can help us find out how words are used in language
Corpora can help us find out how words are used in language. Photo ©

Stinging Eyes, licenced under CC BY-SA 2.0 and adapted form the original.

What verb does 'negotiation' go with? Do you 'make' or 'conduct' a negotiation? Adam Kilgarriff, Director of Lexical Computing, explains how 'corpora' can help us answer such questions where dictionaries meet their limits. He presented a British Council seminar on the subject yesterday.

What is a corpus and how does it differ from a dictionary?

A corpus is a collection of texts. We call it a corpus (plural: corpora) when we use it for language research.

That makes your class's essays a corpus - a small one. It also makes the internet a corpus - a big one.

People writing dictionaries are in the vanguard of corpus linguistics. If you are writing a dictionary, the biggest crime is to miss things: to miss words, to miss phrases or idioms, to miss meanings of words. Lexicographers (the people who write dictionaries) have known for a long time that the best way to avoid missing things is to have a big corpus, and a computer. The computer can then find all the words (ordered by frequency) so a lexicographer can check the list to make sure that words are not missed.

Words in context: Finding out how words are used in a language

It can also show them all the examples of a word in context. This is called a concordance. By running their eye over the concordance, lexicographers can find all the meanings of the word, and phrases it is in.

If it is a big corpus, or a common word (or both), there might be thousands of examples of the word. Then, the computer can go one step further, and prepare a 'word sketch', a summary of the contexts, collocations and phraseology for the word.

This is how contemporary lexicography works. Lexicographers start from the word sketch, which gives them a good idea of what they must not miss. They then work out what different meanings, grammar and phraseology are shown by the collocations in the word sketch, and write definitions for them. They can also use the corpus as a source for example sentences.

When I say 'the computer', of course I mean an app that indexes the corpus and lets users make concordances and word sketches. Google is one app that does something like that (with the Internet as its corpus). However it is not designed for people doing language research. One that is widely used for making dictionaries, with lots of corpora in it, that made the screenshots above, is the Sketch Engine.

Dictionary-makers were leaders in corpus use. Following on were people writing language courses. They wanted to make sure that the facts they were teaching about the language were in fact true (!), and to teach common patterns before rare ones, and to use authentic examples of the patterns.

Should teachers use corpora?

So, in English language teaching, there is plenty of indirect corpus use, via dictionaries and course books. What about direct corpus use, by teachers, even students? Should you use corpora?

My answer is: yes - if the dictionary does not tell you enough. If you want to find out what 'negotiation' or 'secede' means, you could start from a corpus but it will be long and slow: better to look it up in your favourite dictionary. But if you know 'negotiation' and want to use it, but are not sure what verb to use it with, then the leading learners' dictionaries give little help. The word sketch, on the other hand, promptly shows you that people resume, (re)start, (re)open and conduct negotiations, and that negotiations stall, fail, get bogged down, drag on and even collapse (each item can be clicked, to see examples of the collocation in use.

A second situation is the teacher marking work, whose English is good but who is not a native speaker. A student's essay has 'seceding out of Ukraine' - is that OK? A quick check of the concordance for 'secede' shows that a region secedes from a country.

Another consideration is always student motivation. If a class is currently engaged with volcanoes, it would be nice for them to look at the English of volcanoes (I've felt an affinity for volcanoes ever since my big end-of-primary-school project). So can we have a volcanic corpus? Yes! The Sketch Engine has an instant corpus tool, where text on a topic is gathered from the web in a few minutes (by a teacher or, as a class exercise, by the students) and this is then the data for a mini research project.

You might also be interested in: