Available Corpora

Written corpora (synchronic)

corpus name size
(# of words)
lemmatisation morphological
tags
publication
date
short description
SYN 2 232 mil. YES YES 2010 non-referenceNápověda unification of all the SYN-series synchronic written corpora
SYNSYN2013PUB 935 mil. YES YES 2013 corpus of newspapers and magazines from 2005 - 2009
SYNSYN2010 100 mil. YES YES 2010 balanced corpus, most of the texts are from 2005 - 2009
SYNSYN2009PUB 700 mil. YES YES 2010 corpus of newspapers and magazines from 1995 - 2007
SYNSYN2006PUB 300 mil. YES YES 2006 corpus of newspapers and magazines from 1989 - 2004
SYNSYN2005 100 mil. YES YES 2005 balanced corpus, most of the texts are from 2000 - 2004
SYNSYN2000 100 mil. YES YES 2000 balanced corpus, most of the texts are from 1990 - 1999
FSC2000 100 mil. YES NO 2004 modified SYN2000, source of the Frequency Dictionary of Czech
CZESL-PLAIN 2 mil.
NO NO 2012
non-referenceNápověda learner corpus of non-native Czech speakers
CZESL-SGT
960 000
YES YES 2014
non-referenceNápověda corpus of non-native speakers’ Czech with automatic annotation
KSK-DOPISY 800 000 NO NO 2006 transcriptions of handwritten correspondence from 1990 - 2004
JEROME 69 mil. YES YES 2013 monolingual comparable corpus for translation studies
LINK 1.8 mil.
YES YES 2010 non-referenceNápověda corpus of linguistic texts
ORWELL 80 000 YES YES 2003 Orwell's "1984", manually annotated
SKRIPT2012 590 000 YES YES 2013
corpus of school essays

Spoken corpora (synchronic)

corpus name size
(# of words)
lemmatisation morphological
tags
publication
date
short description
SPEECHES 215 000 YES YES 2015 corpus of presidential speeches
ORAL2013 2.79 mil. NO NO 2013 representative corpus of informal spoken Czech
ORAL2008 1 mil. NO NO 2008 sociolinguistically balanced corpus of informal spoken Czech
ORAL2006 1 mil. NO NO 2006 corpus of informal spoken Czech
SCHOLA2010 790 000 NO NO 2010 corpus of school lessons
PMK 675 000 NO NO 2001 Prague spoken corpus
BMK 490 000 NO NO 2002 Brno spoken corpus

Diachronic corpora

corpus name size
(# of words)
lemmatisation morphological
tags
publication
date
short description
DIAKORP  1.95 mil. NO NO 2005 non-referenceNápověda corpus of the diachronic section of the CNC

Foreign language corpora

corpus name size
(# of words)
lemmatisation morphological
tags
publication
date
short description
Aranea 1 000 mil.
(each)
YES YES 2014 non-referenceNápověda comparable web corpora for several European languages
DOTKO 12 mil. NO NO 2010 non-referenceNápověda corpus of Lower Sorbian, most of the texts are from 1848 - 1933
HOTKO 36 mil.
NO
NO
2013
non-referenceNápověda corpus of Upper Sorbian
lEstRepublicain 120 mil. YES YES 2013 corpus of French newspaper L'Est Républicain
deWaC 1 350 mil. YES YES 2013 web corpus of German
frWaC 1 350 mil. YES YES 2013 web corpus of French
itWaC 1 600 mil. YES YES 2013 web corpus of Italian
ukWaC 1 900 mil. YES YES 2013 web corpus of British English

Parallel corpus

corpus name size
(# of words)
lemmatisation morphological
tags
publication
date
short description
InterCorp 138 mil. YES
(partial)
YES
(partial)
2008 non-referenceNápověda parallel corpus being compiled as a part of the InterCorp project