Changes between Version 2 and Version 3 of languagefilter

Timestamp:: 04/25/20 09:46:12 (5 years ago)
Author:: admin
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

languagefilter

-              v2
+              v3
 * Frequency wordlists from big web corpora for more than 40 languages are included with the script.
 * The size of wordlists included with the script is limited. User produced unlimited wordlists can be used to improve the performance, esp. in the case of very similar languages.
 * It is important to [Unitok tokenise] all wordlists (used in a single run of the filter) the same way.
+* It is important to [[Unitok|tokenise]] all wordlists (used in a single run of the filter) the same way.
 == Examples ==
 …
 == Usage ==
 {{{
 ./lang_filter.py (LANGUAGE FRQ_WORDLIST[.gz|.xz])+ ACCEPTED_LANGS REJECTED_OUT LANG_RATIO_THRESHOLD
+./lang_filter.py (LANGUAGE FRQ_WORDLIST[.gz|.xz])+ ACCEPTED_LANGS REJECTED_OUT LANG_RATIO_THR
 E.g. to see all score information in a single output file:
 …
 ACCEPTED_LANGS comma separated list of accepted languages (ALL to accept all)
 REJECTED_OUT a path to write rejected data to; three files with the following suffixes are created:
+REJECTED_OUT a path to write rejected data to; files with the following suffixes are created:
     "lang"  ... rejected since the content is in an unwanted recognised language
     "mixed" ... rejected since the content is below the threshold of discerning languages
 …
     "small" ... rejected since the content is too short to reliably determine its language
 LANG_RATIO_THRESHOLD threshold ratio of wordlist scores of top/second most scoring languages
+LANG_RATIO_THR threshold ratio of wordlist scores of top/second most scoring languages
     (NONE not to filter mixed language at all)
 .1 is ok for different languages, 1.01 is better for very close languages
 …
     Attribute doc.lang_scores -- scores for each recognised language
     Structure par_langs with attributes lang and lang_scores -- the same for paragraphs
     New token attributes (columns): scores for all recognised languages (in the command line order)
+    New token attributes (columns): scores for recognised languages (command line order)
 A word form is expected in the first column of the input.
 …
 == To build your own frequency wordlist ==
 {{{
 #Get corpus frequencies of lowercased words from a corpus compiled by [https://nlp.fi.muni.cz/trac/noske Sketch Engine]
+#Get corpus frequencies of lowercased words from a corpus compiled by Sketch Engine
 lsclex -f /corpora/registry/english_web_corpus lc | cut -f2,3 | ./uninorm_4.py | perl -pe 's, (\d+)$,\t$1,' > en.wl1
 lsclex -f /corpora/registry/czech_web_corpus   lc | cut -f2,3 | ./uninorm_4.py | perl -pe 's, (\d+)$,\t$1,' > cs.wl1
 …
 cut -f1 slovak_web_corpus.vert  | grep -v '^<' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | perl -pe 's,^\s*(\d+) (.*)$,$2\t$1,' > sk.wl1
 #Filter the wordlist -- allow just characters valid for the language and a reasonable word length
+#Allow words with characters occurring in the language and a reasonable length
 grep '[abcdefghijklmnopqrstuvwxyz]' en.wl1                  | grep -v -P "['.-]{2}" | ./wl_grep.py "[#@]?[abcdefghijklmnopqrstuvwxyzéè0-9'][abcdefghijklmnopqrstuvwxyzéè0-9'.-]{0,29}"                               > en.wl2
 grep '[aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž]' cs.wl1   | grep -v -P "['.-]{2}" | ./wl_grep.py "[#@]?[aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž0-9'][aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž0-9'.-]{0,29}"     > cs.wl2