Corpus linguistics

Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural corpora).^[1] Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety.^[1] Today, corpora are generally machine-readable data collections.

Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. Large collections of text, though corpora may also be small in terms of running words, allow linguists to run quantitative analyses on linguistic concepts that may be difficult to test in a qualitative manner.^[2]

The text-corpus method uses the body of texts in any natural language to derive the set of abstract rules which gover inn that language. Those results can be used to explore the relationships between that subject language and other languages which have undergone a similar analysis. The first such corpora were manually derived from source texts, but now that work is automated.

Corpora have not only been used for linguistics research, they have since the 1969 been increasingly used to compile dictionaries (starting with The American Heritage Dictionary of the English Language in 1969) and reference grammars, with A Comprehensive Grammar of the English Language, published in 1985, as a first.

Experts in the field have differing views about the annotation of a corpus. These views range from John McHardy Sinclair, who advocates minimal annotation so texts speak for themselves,^[3] to the Survey of English Usage team (University College, London), who advocate annotation as allowing greater linguistic understanding through rigorous recording.^[4]

$_$_$DEEZ_NUTS#0__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#0__subtitleDEEZ_NUTS$_$_$

Annotation consists of the application of a scheme to texts. Annotations may include structural markup, tagging, parsing, and numerous other representations.

part-of-speech

Abstraction consists of the translation (mapping) of terms in the scheme to terms in a theoretically motivated model or dataset. Abstraction typically includes linguist-directed search but may include e.g., rule-learning for parsers.

Analysis consists of statistically probing, manipulating and generalising from the dataset. Analysis might include statistical evaluations, optimisation of rule-bases or knowledge discovery methods.

$_$_$DEEZ_NUTS#2__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#2__descriptionDEEZ_NUTS$_$_$

Corpus linguistics has generated a number of research methods, which attempt to trace a path from data to theory. Wallis and Nelson (2001)^[21] first introduced what they called the 3A perspective: Annotation, Abstraction and Analysis.

Most lexical corpora today are part-of-speech-tagged (POS-tagged). However even corpus linguists who work with 'unannotated plain text' inevitably apply some method to isolate salient terms. In such situations annotation and abstraction are combined in a lexical search.

The advantage of publishing an annotated corpus is that other users can then perform experiments on the corpus (through corpus managers). Linguists with other interests and differing perspectives than the originators' can exploit this work. By sharing data, corpus linguists are able to treat the corpus as a locus of linguistic debate and further study.^[22]

$_$_$DEEZ_NUTS#3__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__descriptionDEEZ_NUTS$_$_$

A Linguistic Atlas of Early Middle English

Collocation

Collostructional analysis

(KWIC)

Concordance

Keyword (linguistics)

Linguistic Data Consortium

List of text corpora

Machine translation

Natural Language Toolkit

Pattern grammar

: they access the "web corpus"

Search engines

Semantic prosody

Speech corpus

Text corpus

Translation memory

Treebank

Word list

Biber, D., Conrad, S., Reppen R. Corpus Linguistics, Investigating Language Structure and Use, Cambridge: Cambridge UP, 1998. 0-521-49957-7

ISBN

McCarthy, D., and Sampson G. Corpus Linguistics: Readings in a Widening Discipline, Continuum, 2005. 0-8264-8803-X

ISBN

Facchinetti, R. Theoretical Description and Practical Applications of Linguistic Corpora. Verona: QuiEdit, 2007 978-88-89480-37-3

ISBN

Facchinetti, R. (ed.) Corpus Linguistics 25 Years on. New York/Amsterdam: Rodopi, 2007 978-90-420-2195-2

ISBN

Facchinetti, R. and Rissanen M. (eds.) Corpus-based Studies of Diachronic English. Bern: Peter Lang, 2006 3-03910-851-4

ISBN

Lenders, W. Computational lexicography and corpus linguistics until ca. 1970/1980, in: Gouws, R. H., Heid, U., Schweickard, W., Wiegand, H. E. (eds.) Dictionaries – An International Encyclopedia of Lexicography. Supplementary Volume: Recent Developments with Focus on Electronic and Computational Lexicography. Berlin: De Gruyter Mouton, 2013 978-3112146651

ISBN

Fuß, Eric et al. (Eds.): Grammar and Corpora 2016, Heidelberg: Heidelberg University Publishing, 2018. :10.17885/heiup.361.509 (digital open access).

doi

Stefanowitsch A. 2020. Corpus linguistics: A guide to the methodology. Berlin: Language Science Press. 978-3-96110-225-9, doi:10.5281/zenodo.3735822 Open Access https://langsci-press.org/catalog/book/148.

ISBN

Bookmarks for Corpus-based Linguists – very comprehensive site with categorized and annotated links to language corpora, software, references, etc.

Corpora discussion list

Freely-available, web-based corpora (100 million – 400 million words each): American (COCA, COHA), British (BNC), Time, Spanish, Portuguese

Manuel Barbera's overview site

the composition and use of the Oxford Corpus

AskOxford.com

DMCBC.com

Datum Multilanguage Corpora Based on chinese free sample download

a Chinese online forum for corpus linguistics

Corpus4u Community

McEnery and Wilson's Corpus Linguistics Page

Corpus Linguistics with R mailing list

Archived 29 October 2010 at the Wayback Machine

Research and Development Unit for English Studies

Survey of English Usage

Archived 9 April 2003 at the Wayback Machine

The Centre for Corpus Linguistics at Birmingham University

Tools for Corpus Linguistics (annotated list)

: an annotated guide to corpus resources on the web

Gateway to Corpus Linguistics on the Internet

Biomedical corpora

a major distributor of corpora

Linguistic Data Consortium

Penn Parsed Corpora of Historical English

: (formerly Tenka Text) an open-source (GPLed) corpus analysis tool written in C#

Corsis

and Fuzzy Tree Fragments

ICECUP

text mining

Discussion group

A corpus linguistics related conference MAG 2017: You can find some information and events related to .

Metadiscourse Across Genres by visiting MAG 2017 website

Free access to political speeches by American and Chinese politicians, developed by Hong Kong Baptist University Library

Corpus of Political Speeches

A text annotation tool for machine learning corpus focused on team management

LightTag -Text Annotation Tool

LIVAC Synchronous Corpus

$_$_$DEEZ_NUTS#5__descriptionDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#5__heading--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#5__description--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#5__heading--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#5__description--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__subtextDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__quote--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__name--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__company_or_position--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__quote--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__name--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__company_or_position--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__subtextDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__quote--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__name--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__company_or_position--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__quote--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__name--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__company_or_position--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__quote--2DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__name--2DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__company_or_position--2DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__quote--3DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__name--3DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__company_or_position--3DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__quote--4DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__name--4DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__company_or_position--4DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__quote--5DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__name--5DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__company_or_position--5DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__quote--6DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__name--6DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__company_or_position--6DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__quote--7DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__name--7DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__company_or_position--7DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__subtextDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--2DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--3DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--4DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--5DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--6DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--7DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--8DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--9DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--10DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--11DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--12DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--13DEEZ_NUTS$_$_$

Corpus linguistics

$_$_$DEEZ_NUTS#0__titleDEEZ_NUTS$_$_$

part-of-speech

$_$_$DEEZ_NUTS#2__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__titleDEEZ_NUTS$_$_$

A Linguistic Atlas of Early Middle English

Collocation

Collostructional analysis

Concordance

Keyword (linguistics)

Linguistic Data Consortium

List of text corpora

Machine translation

Natural Language Toolkit

Pattern grammar

Search engines

Semantic prosody

Speech corpus

Text corpus

Translation memory

Treebank

Word list

ISBN

ISBN

ISBN

ISBN

ISBN

ISBN

doi

ISBN

Bookmarks for Corpus-based Linguists – very comprehensive site with categorized and annotated links to language corpora, software, references, etc.

Corpora discussion list

Freely-available, web-based corpora (100 million – 400 million words each): American (COCA, COHA), British (BNC), Time, Spanish, Portuguese

Manuel Barbera's overview site

AskOxford.com

DMCBC.com

Datum Multilanguage Corpora Based on chinese free sample download

Corpus4u Community

McEnery and Wilson's Corpus Linguistics Page

Corpus Linguistics with R mailing list

Research and Development Unit for English Studies

Survey of English Usage

The Centre for Corpus Linguistics at Birmingham University

Tools for Corpus Linguistics (annotated list)

Gateway to Corpus Linguistics on the Internet

Biomedical corpora

Linguistic Data Consortium

Penn Parsed Corpora of Historical English

Corsis

ICECUP

Discussion group

Metadiscourse Across Genres by visiting MAG 2017 website

Corpus of Political Speeches

LightTag -Text Annotation Tool

LIVAC Synchronous Corpus

$_$_$DEEZ_NUTS#5__heading--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#5__heading--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__name--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__name--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__name--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__name--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__name--2DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__name--3DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__name--4DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__name--5DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__name--6DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__name--7DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--2DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--3DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--4DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--5DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--6DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--7DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--8DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--9DEEZ_NUTS$_$_$