Web Corpus Wordlist Based Language Filter


  • Separates documents and paragraphs by language using word frequency lists.
  • All languages to recognise have to be specified and respective frequency wordlists supplied.

The method

  • A score (a logarithm of relative corpus frequency) is calculated for each word form and language.
  • The sum of scores of all words in paragraphs and documents is calculated for all languages.
  • If the ratio of scores of two top scoring languages is above a threshold, the top scoring language is recorded in headers of the respective paragraph/document.
  • A multi-language document is split to separate documents containing just paragraphs in recognised languages.

Frequency wordlists

  • Frequency wordlists from big web corpora for more than 40 languages are included with the script.
  • The size of wordlists included with the script is limited. User produced unlimited wordlists can be used to improve the performance, esp. in the case of very similar languages.
  • It is important to tokenise all wordlists (used in a single run of the filter) the same way.
Paper | Cite | Licence


Sample English frequency wordlist (top 10 lines)

the     789476980
and     438853153
of      408834726
to      357762038
a       275739809
in      271813858
for     150888766
is      149294535
that    122527476
on      102044349

Sample input

<doc source="">

Sample output

<doc source="" lang="english"
     lang_scores="english: 49.56, czech: 19.86, slovak: 20.15">
<par_langs lang="english" lang_scores="english: 49.56, czech: 19.86, slovak: 20.15"/>
#wordform  English  Czech Slovak score for each word
Linnaeus      0.00   0.00   0.00   #unknown to all sample wordlists
considered    5.18   0.00   0.00   #English only
the           7.82   5.26   5.33   #English word, ~100 x more frequent in the English wl
dog           4.89   0.00   0.00
to            7.48   7.05   7.15   #a valid word in all three languages
be            6.77   0.00   0.00
a             7.37   7.56   7.66   #a valid word in all three languages
separate      4.91   0.00   0.00
species       5.14   0.00   0.00
.             0.00   0.00   0.00   #punctuation is omitted from wordlists


tar -xzvf wcwb_lang_filter_1.0.tar.gz
cd wcwb_lang_filter_1.0
make test/out.vert.lang_czech

Requirements: pypy3 or python3



E.g. to see all score information in a single output file:
./ Czech wl/czech.frqwl.gz Slovak wl/slovak.frqwl.gz \\
    English wl/english.frqwl.gz Czech,Slovak out_rejected.vert 1.01 < in.vert > out.vert \\

Or just to output to separate documents:
./ Czech wl/czech.frqwl.gz Slovak wl/slovak.frqwl.gz \\
    English wl/english.frqwl.gz Czech,Slovak out_rejected.vert 1.01 < in.vert > out.vert \\
    | grep -v '^<par_langs' | cut -f1 \
    | ./ doc lang out.vert.lang_

LANGUAGE language name

FRQ_WORDLIST corpus frequency wordlist
    format: <word>TAB<count>, one record per line, can be gzipped or xzipped

ACCEPTED_LANGS comma separated list of accepted languages (ALL to accept all)

REJECTED_OUT a path to write rejected data to; files with the following suffixes are created:
    "lang"  ... rejected since the content is in an unwanted recognised language
    "mixed" ... rejected since the content is below the threshold of discerning languages
                (too similar to two top scoring recognised languages)
    "small" ... rejected since the content is too short to reliably determine its language

LANG_RATIO_THR threshold ratio of wordlist scores of top/second most scoring languages
    (NONE not to filter mixed language at all)
    1.1 is ok for different languages, 1.01 is better for very close languages

Input format: Vertical (tokenised text, one token per line), <doc/> and <p/> structures.

Output format: The same as input with the following additions:
    Attribute doc.lang -- the language with the top score
    Attribute doc.lang_scores -- scores for each recognised language
    Structure par_langs with attributes lang and lang_scores -- the same for paragraphs
    New token attributes (columns): scores for recognised languages (command line order)

A word form is expected in the first column of the input.
The script keeps any other token attributes (e.g. lemma, tag).
The wordlist comparison is case insensitive.

To build your own frequency wordlist

#Get corpus frequencies of lowercased words from a corpus compiled by Sketch Engine
lsclex -f /corpora/registry/english_web_corpus lc | cut -f2,3 | ./ | perl -pe 's, (\d+)$,\t$1,' > en.wl1
lsclex -f /corpora/registry/czech_web_corpus   lc | cut -f2,3 | ./ | perl -pe 's, (\d+)$,\t$1,' > cs.wl1
lsclex -f /corpora/registry/slovak_web_corpus  lc | cut -f2,3 | ./ | perl -pe 's, (\d+)$,\t$1,' > sk.wl1

#Or get the same from a vertical file
cut -f1 english_web_corpus.vert | grep -v '^<' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | perl -pe 's,^\s*(\d+) (.*)$,$2\t$1,' > en.wl1
cut -f1 czech_web_corpus.vert   | grep -v '^<' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | perl -pe 's,^\s*(\d+) (.*)$,$2\t$1,' > cs.wl1
cut -f1 slovak_web_corpus.vert  | grep -v '^<' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | perl -pe 's,^\s*(\d+) (.*)$,$2\t$1,' > sk.wl1

#Allow words with characters occurring in the language and a reasonable length
grep '[abcdefghijklmnopqrstuvwxyz]' en.wl1                  | grep -v -P "['.-]{2}" | ./ "[#@]?[abcdefghijklmnopqrstuvwxyzéè0-9'][abcdefghijklmnopqrstuvwxyzéè0-9'.-]{0,29}"                               > en.wl2
grep '[aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž]' cs.wl1   | grep -v -P "['.-]{2}" | ./ "[#@]?[aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž0-9'][aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž0-9'.-]{0,29}"     > cs.wl2
grep '[aáäbcčdďeéfghiíjklĺľmnňoóôpqrŕsštťuúvwxyýzž]' sk.wl1 | grep -v -P "['.-]{2}" | ./ "[#@]?[aáäbcčdďeéfghiíjklĺľmnňoóôpqrŕsštťuúvwxyýzž0-9'][aáäbcčdďeéfghiíjklĺľmnňoóôpqrŕsštťuúvwxyýzž0-9'.-]{0,29}" > sk.wl2

#Sort (not necessary) and pack
for l in {en,cs,sk}.wl2; do sort -k2,2rg -k1,1 ${l}.wl2 | gzip > ${l}.frqwl.gz; done


Language Filter is licensed under GNU General Public License Version 3.


Discriminating Between Similar Languages Using Large Web Corpora in the Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2019. BibTeX

Last modified 4 years ago Last modified on 06/18/20 21:04:37

Attachments (1)

Download all attachments as: .zip