Changes between Version 2 and Version 3 of languagefilter
- Timestamp:
- 04/25/20 09:46:12 (5 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
languagefilter
v2 v3 14 14 * Frequency wordlists from big web corpora for more than 40 languages are included with the script. 15 15 * The size of wordlists included with the script is limited. User produced unlimited wordlists can be used to improve the performance, esp. in the case of very similar languages. 16 * It is important to [ Unitok tokenise] all wordlists (used in a single run of the filter) the same way.16 * It is important to [[Unitok|tokenise]] all wordlists (used in a single run of the filter) the same way. 17 17 18 18 == Examples == … … 84 84 == Usage == 85 85 {{{ 86 ./lang_filter.py (LANGUAGE FRQ_WORDLIST[.gz|.xz])+ ACCEPTED_LANGS REJECTED_OUT LANG_RATIO_THR ESHOLD86 ./lang_filter.py (LANGUAGE FRQ_WORDLIST[.gz|.xz])+ ACCEPTED_LANGS REJECTED_OUT LANG_RATIO_THR 87 87 88 88 E.g. to see all score information in a single output file: … … 103 103 ACCEPTED_LANGS comma separated list of accepted languages (ALL to accept all) 104 104 105 REJECTED_OUT a path to write rejected data to; threefiles with the following suffixes are created:105 REJECTED_OUT a path to write rejected data to; files with the following suffixes are created: 106 106 "lang" ... rejected since the content is in an unwanted recognised language 107 107 "mixed" ... rejected since the content is below the threshold of discerning languages … … 109 109 "small" ... rejected since the content is too short to reliably determine its language 110 110 111 LANG_RATIO_THR ESHOLDthreshold ratio of wordlist scores of top/second most scoring languages111 LANG_RATIO_THR threshold ratio of wordlist scores of top/second most scoring languages 112 112 (NONE not to filter mixed language at all) 113 113 1.1 is ok for different languages, 1.01 is better for very close languages … … 119 119 Attribute doc.lang_scores -- scores for each recognised language 120 120 Structure par_langs with attributes lang and lang_scores -- the same for paragraphs 121 New token attributes (columns): scores for all recognised languages (in thecommand line order)121 New token attributes (columns): scores for recognised languages (command line order) 122 122 123 123 A word form is expected in the first column of the input. … … 128 128 == To build your own frequency wordlist == 129 129 {{{ 130 #Get corpus frequencies of lowercased words from a corpus compiled by [https://nlp.fi.muni.cz/trac/noske Sketch Engine]130 #Get corpus frequencies of lowercased words from a corpus compiled by Sketch Engine 131 131 lsclex -f /corpora/registry/english_web_corpus lc | cut -f2,3 | ./uninorm_4.py | perl -pe 's, (\d+)$,\t$1,' > en.wl1 132 132 lsclex -f /corpora/registry/czech_web_corpus lc | cut -f2,3 | ./uninorm_4.py | perl -pe 's, (\d+)$,\t$1,' > cs.wl1 … … 138 138 cut -f1 slovak_web_corpus.vert | grep -v '^<' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | perl -pe 's,^\s*(\d+) (.*)$,$2\t$1,' > sk.wl1 139 139 140 # Filter the wordlist -- allow just characters valid for the language and a reasonable wordlength140 #Allow words with characters occurring in the language and a reasonable length 141 141 grep '[abcdefghijklmnopqrstuvwxyz]' en.wl1 | grep -v -P "['.-]{2}" | ./wl_grep.py "[#@]?[abcdefghijklmnopqrstuvwxyzéè0-9'][abcdefghijklmnopqrstuvwxyzéè0-9'.-]{0,29}" > en.wl2 142 142 grep '[aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž]' cs.wl1 | grep -v -P "['.-]{2}" | ./wl_grep.py "[#@]?[aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž0-9'][aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž0-9'.-]{0,29}" > cs.wl2