Katana VentraIP

Corpus linguistics

Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural corpora).[1] Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety.[1] Today, corpora are generally machine-readable data collections.

Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. Large collections of text, though corpora may also be small in terms of running words, allow linguists to run quantitative analyses on linguistic concepts that may be difficult to test in a qualitative manner.[2]

The text-corpus method uses the body of texts in any natural language to derive the set of abstract rules which gover inn that language. Those results can be used to explore the relationships between that subject language and other languages which have undergone a similar analysis. The first such corpora were manually derived from source texts, but now that work is automated.

Corpora have not only been used for linguistics research, they have since the 1969 been increasingly used to compile dictionaries (starting with The American Heritage Dictionary of the English Language in 1969) and reference grammars, with A Comprehensive Grammar of the English Language, published in 1985, as a first.

Experts in the field have differing views about the annotation of a corpus. These views range from John McHardy Sinclair, who advocates minimal annotation so texts speak for themselves,[3] to the Survey of English Usage team (University College, London), who advocate annotation as allowing greater linguistic understanding through rigorous recording.[4]

Katana VentraIP



Annotation consists of the application of a scheme to texts. Annotations may include structural markup, tagging, parsing, and numerous other representations.


Abstraction consists of the translation (mapping) of terms in the scheme to terms in a theoretically motivated model or dataset. Abstraction typically includes linguist-directed search but may include e.g., rule-learning for parsers.

Analysis consists of statistically probing, manipulating and generalising from the dataset. Analysis might include statistical evaluations, optimisation of rule-bases or knowledge discovery methods.



Corpus linguistics has generated a number of research methods, which attempt to trace a path from data to theory. Wallis and Nelson (2001)[21] first introduced what they called the 3A perspective: Annotation, Abstraction and Analysis.

Most lexical corpora today are part-of-speech-tagged (POS-tagged). However even corpus linguists who work with 'unannotated plain text' inevitably apply some method to isolate salient terms. In such situations annotation and abstraction are combined in a lexical search.

The advantage of publishing an annotated corpus is that other users can then perform experiments on the corpus (through corpus managers). Linguists with other interests and differing perspectives than the originators' can exploit this work. By sharing data, corpus linguists are able to treat the corpus as a locus of linguistic debate and further study.[22]



A Linguistic Atlas of Early Middle English


Collostructional analysis



Keyword (linguistics)

Linguistic Data Consortium

List of text corpora

Machine translation

Natural Language Toolkit

Pattern grammar

: they access the "web corpus"

Search engines

Semantic prosody

Speech corpus

Text corpus

Translation memory


Word list

Biber, D., Conrad, S., Reppen R. Corpus Linguistics, Investigating Language Structure and Use, Cambridge: Cambridge UP, 1998.  0-521-49957-7


McCarthy, D., and Sampson G. Corpus Linguistics: Readings in a Widening Discipline, Continuum, 2005.  0-8264-8803-X


Facchinetti, R. Theoretical Description and Practical Applications of Linguistic Corpora. Verona: QuiEdit, 2007  978-88-89480-37-3


Facchinetti, R. (ed.) Corpus Linguistics 25 Years on. New York/Amsterdam: Rodopi, 2007  978-90-420-2195-2


Facchinetti, R. and Rissanen M. (eds.) Corpus-based Studies of Diachronic English. Bern: Peter Lang, 2006  3-03910-851-4


Lenders, W. Computational lexicography and corpus linguistics until ca. 1970/1980, in: Gouws, R. H., Heid, U., Schweickard, W., Wiegand, H. E. (eds.) Dictionaries – An International Encyclopedia of Lexicography. Supplementary Volume: Recent Developments with Focus on Electronic and Computational Lexicography. Berlin: De Gruyter Mouton, 2013  978-3112146651


Fuß, Eric et al. (Eds.): Grammar and Corpora 2016, Heidelberg: Heidelberg University Publishing, 2018. :10.17885/heiup.361.509 (digital open access).


Stefanowitsch A. 2020. Corpus linguistics: A guide to the methodology. Berlin: Language Science Press.  978-3-96110-225-9, doi:10.5281/zenodo.3735822 Open Access https://langsci-press.org/catalog/book/148.


Bookmarks for Corpus-based Linguists – very comprehensive site with categorized and annotated links to language corpora, software, references, etc.

Corpora discussion list

Freely-available, web-based corpora (100 million – 400 million words each): American (COCA, COHA), British (BNC), Time, Spanish, Portuguese

Manuel Barbera's overview site

the composition and use of the Oxford Corpus



Datum Multilanguage Corpora Based on chinese free sample download

a Chinese online forum for corpus linguistics

Corpus4u Community

McEnery and Wilson's Corpus Linguistics Page

Corpus Linguistics with R mailing list

Archived 29 October 2010 at the Wayback Machine

Research and Development Unit for English Studies

Survey of English Usage

Archived 9 April 2003 at the Wayback Machine

The Centre for Corpus Linguistics at Birmingham University

Tools for Corpus Linguistics (annotated list)

: an annotated guide to corpus resources on the web

Gateway to Corpus Linguistics on the Internet

Biomedical corpora

a major distributor of corpora

Linguistic Data Consortium

Penn Parsed Corpora of Historical English

: (formerly Tenka Text) an open-source (GPLed) corpus analysis tool written in C#


text mining

Discussion group

A corpus linguistics related conference MAG 2017: You can find some information and events related to .

Metadiscourse Across Genres by visiting MAG 2017 website

Free access to political speeches by American and Chinese politicians, developed by Hong Kong Baptist University Library

Corpus of Political Speeches

A text annotation tool for machine learning corpus focused on team management

LightTag -Text Annotation Tool

LIVAC Synchronous Corpus






















































