Changes between Version 2 and Version 3 of languagefilter


Ignore:
Timestamp:
04/25/20 09:46:12 (5 years ago)
Author:
admin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • languagefilter

    v2 v3  
    1414* Frequency wordlists from big web corpora for more than 40 languages are included with the script.
    1515* The size of wordlists included with the script is limited. User produced unlimited wordlists can be used to improve the performance, esp. in the case of very similar languages.
    16 * It is important to [Unitok tokenise] all wordlists (used in a single run of the filter) the same way.
     16* It is important to [[Unitok|tokenise]] all wordlists (used in a single run of the filter) the same way.
    1717
    1818== Examples ==
     
    8484== Usage ==
    8585{{{
    86 ./lang_filter.py (LANGUAGE FRQ_WORDLIST[.gz|.xz])+ ACCEPTED_LANGS REJECTED_OUT LANG_RATIO_THRESHOLD
     86./lang_filter.py (LANGUAGE FRQ_WORDLIST[.gz|.xz])+ ACCEPTED_LANGS REJECTED_OUT LANG_RATIO_THR
    8787
    8888E.g. to see all score information in a single output file:
     
    103103ACCEPTED_LANGS comma separated list of accepted languages (ALL to accept all)
    104104
    105 REJECTED_OUT a path to write rejected data to; three files with the following suffixes are created:
     105REJECTED_OUT a path to write rejected data to; files with the following suffixes are created:
    106106    "lang"  ... rejected since the content is in an unwanted recognised language
    107107    "mixed" ... rejected since the content is below the threshold of discerning languages
     
    109109    "small" ... rejected since the content is too short to reliably determine its language
    110110
    111 LANG_RATIO_THRESHOLD threshold ratio of wordlist scores of top/second most scoring languages
     111LANG_RATIO_THR threshold ratio of wordlist scores of top/second most scoring languages
    112112    (NONE not to filter mixed language at all)
    113113    1.1 is ok for different languages, 1.01 is better for very close languages
     
    119119    Attribute doc.lang_scores -- scores for each recognised language
    120120    Structure par_langs with attributes lang and lang_scores -- the same for paragraphs
    121     New token attributes (columns): scores for all recognised languages (in the command line order)
     121    New token attributes (columns): scores for recognised languages (command line order)
    122122
    123123A word form is expected in the first column of the input.
     
    128128== To build your own frequency wordlist ==
    129129{{{
    130 #Get corpus frequencies of lowercased words from a corpus compiled by [https://nlp.fi.muni.cz/trac/noske Sketch Engine]
     130#Get corpus frequencies of lowercased words from a corpus compiled by Sketch Engine
    131131lsclex -f /corpora/registry/english_web_corpus lc | cut -f2,3 | ./uninorm_4.py | perl -pe 's, (\d+)$,\t$1,' > en.wl1
    132132lsclex -f /corpora/registry/czech_web_corpus   lc | cut -f2,3 | ./uninorm_4.py | perl -pe 's, (\d+)$,\t$1,' > cs.wl1
     
    138138cut -f1 slovak_web_corpus.vert  | grep -v '^<' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | perl -pe 's,^\s*(\d+) (.*)$,$2\t$1,' > sk.wl1
    139139
    140 #Filter the wordlist -- allow just characters valid for the language and a reasonable word length
     140#Allow words with characters occurring in the language and a reasonable length
    141141grep '[abcdefghijklmnopqrstuvwxyz]' en.wl1                  | grep -v -P "['.-]{2}" | ./wl_grep.py "[#@]?[abcdefghijklmnopqrstuvwxyzéè0-9'][abcdefghijklmnopqrstuvwxyzéè0-9'.-]{0,29}"                               > en.wl2
    142142grep '[aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž]' cs.wl1   | grep -v -P "['.-]{2}" | ./wl_grep.py "[#@]?[aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž0-9'][aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž0-9'.-]{0,29}"     > cs.wl2