= Web Corpora Wordlist Based Language Filter = Summary * Separates documents and paragraphs by language using word frequency lists. * All languages to recognise have to be specified and respective frequency wordlists supplied. The method * A score (a logarithm of relative corpus frequency) is calculated for each word form and language. * The sum of scores of all words in paragraphs and documents is calculated for all languages. * If the ratio of scores of two top scoring languages is above a threshold, the top scoring language is recorded in headers of the respective paragraph/document. * A multi-language document is split to separate documents containing just paragraphs in recognised languages. Frequency wordlists * Frequency wordlists from big web corpora for more than 40 languages are included with the script. * The size of wordlists included with the script is limited. User produced unlimited wordlists can be used to improve the performance, esp. in the case of very similar languages. * It is important to [Unitok tokenise] all wordlists (used in a single run of the filter) the same way. == Examples == === Sample English frequency wordlist (top 10 lines) === {{{ the 789476980 and 438853153 of 408834726 to 357762038 a 275739809 in 271813858 for 150888766 is 149294535 that 122527476 on 102044349 }}} === Sample input === {{{

Linnaeus considered the dog to be a separate species .

}}} === Sample output === {{{

#wordform English Czech Slovak score for each word Linnaeus 0.00 0.00 0.00 #unknown to all sample wordlists considered 5.18 0.00 0.00 #English only the 7.82 5.26 5.33 #English word, ~100 x more frequent in the English wl dog 4.89 0.00 0.00 to 7.48 7.05 7.15 #a valid word in all three languages be 6.77 0.00 0.00 a 7.37 7.56 7.66 #a valid word in all three languages separate 4.91 0.00 0.00 species 5.14 0.00 0.00 . 0.00 0.00 0.00 #punctuation is omitted from wordlists

}}} == Installation == {{{ wget http://corpus.tools/raw-attachment/wiki/Downloads/wcwb_lang_filter_1.0.tar.gz tar -czvf wcwb_lang_filter_1.0.tar.gz cd wcwb_lang_filter_1.0 make test/out.vert.lang_czech }}} == Usage == {{{ ./lang_filter.py (LANGUAGE FRQ_WORDLIST[.gz|.xz])+ ACCEPTED_LANGS REJECTED_OUT LANG_RATIO_THRESHOLD E.g. to see all score information in a single output file: ./lang_filter.py Czech wl/czech.frqwl.gz Slovak wl/slovak.frqwl.gz \\ English wl/english.frqwl.gz Czech,Slovak out_rejected.vert 1.01 < in.vert > out.vert \\ Or just to output to separate documents: ./lang_filter.py Czech wl/czech.frqwl.gz Slovak wl/slovak.frqwl.gz \\ English wl/english.frqwl.gz Czech,Slovak out_rejected.vert 1.01 < in.vert > out.vert \\ | grep -v '^TAB, one record per line, can be gzipped or xzipped ACCEPTED_LANGS comma separated list of accepted languages (ALL to accept all) REJECTED_OUT a path to write rejected data to; three files with the following suffixes are created: "lang" ... rejected since the content is in an unwanted recognised language "mixed" ... rejected since the content is below the threshold of discerning languages (too similar to two top scoring recognised languages) "small" ... rejected since the content is too short to reliably determine its language LANG_RATIO_THRESHOLD threshold ratio of wordlist scores of top/second most scoring languages (NONE not to filter mixed language at all) 1.1 is ok for different languages, 1.01 is better for very close languages Input format: Vertical (tokenised text, one token per line), and

structures. Output format: The same as input with the following additions: Attribute doc.lang -- the language with the top score Attribute doc.lang_scores -- scores for each recognised language Structure par_langs with attributes lang and lang_scores -- the same for paragraphs New token attributes (columns): scores for all recognised languages (in the command line order) A word form is expected in the first column of the input. The script keeps any other token attributes (e.g. lemma, tag). The wordlist comparison is case insensitive. }}} == To build your own frequency wordlist == {{{ #Get corpus frequencies of lowercased words from a corpus compiled by [https://nlp.fi.muni.cz/trac/noske Sketch Engine] lsclex -f /corpora/registry/english_web_corpus lc | cut -f2,3 | ./uninorm_4.py | perl -pe 's, (\d+)$,\t$1,' > en.wl1 lsclex -f /corpora/registry/czech_web_corpus lc | cut -f2,3 | ./uninorm_4.py | perl -pe 's, (\d+)$,\t$1,' > cs.wl1 lsclex -f /corpora/registry/slovak_web_corpus lc | cut -f2,3 | ./uninorm_4.py | perl -pe 's, (\d+)$,\t$1,' > sk.wl1 #Or get the same from a vertical file cut -f1 english_web_corpus.vert | grep -v '^<' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | perl -pe 's,^\s*(\d+) (.*)$,$2\t$1,' > en.wl1 cut -f1 czech_web_corpus.vert | grep -v '^<' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | perl -pe 's,^\s*(\d+) (.*)$,$2\t$1,' > cs.wl1 cut -f1 slovak_web_corpus.vert | grep -v '^<' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | perl -pe 's,^\s*(\d+) (.*)$,$2\t$1,' > sk.wl1 #Filter the wordlist -- allow just characters valid for the language and a reasonable word length grep '[abcdefghijklmnopqrstuvwxyz]' en.wl1 | grep -v -P "['.-]{2}" | ./wl_grep.py "[#@]?[abcdefghijklmnopqrstuvwxyzéè0-9'][abcdefghijklmnopqrstuvwxyzéè0-9'.-]{0,29}" > en.wl2 grep '[aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž]' cs.wl1 | grep -v -P "['.-]{2}" | ./wl_grep.py "[#@]?[aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž0-9'][aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž0-9'.-]{0,29}" > cs.wl2 grep '[aáäbcčdďeéfghiíjklĺľmnňoóôpqrŕsštťuúvwxyýzž]' sk.wl1 | grep -v -P "['.-]{2}" | ./wl_grep.py "[#@]?[aáäbcčdďeéfghiíjklĺľmnňoóôpqrŕsštťuúvwxyýzž0-9'][aáäbcčdďeéfghiíjklĺľmnňoóôpqrŕsštťuúvwxyýzž0-9'.-]{0,29}" > sk.wl2 #Sort (not necessary) and pack for f in {en,cs,sk}.wl2; do sort -k2,2rg -k1,1 ${c}.wl2 $f | gzip > ${f}.frqwl.gz; done }}} == Get Language Filter == See [wiki:Downloads] for the latest version. == Licence == Language Filter is licensed under [https://www.gnu.org/licenses/old-licenses/gpl-2.0.txt GNU General Public License Version 2].