Changes between Version 1 and Version 2 of languagefilter


Ignore:
Timestamp:
04/25/20 09:40:15 (5 years ago)
Author:
admin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • languagefilter

    v1 v2  
    11= Web Corpora Wordlist Based Language Filter =
    22
     3Summary
    34* Separates documents and paragraphs by language using word frequency lists.
    45* All languages to recognise have to be specified and respective frequency wordlists supplied.
     6
     7The method
    58* A score (a logarithm of relative corpus frequency) is calculated for each word form and language.
    69* The sum of scores of all words in paragraphs and documents is calculated for all languages.
    710* If the ratio of scores of two top scoring languages is above a threshold, the top scoring language is recorded in headers of the respective paragraph/document.
    811* A multi-language document is split to separate documents containing just paragraphs in recognised languages.
     12
     13Frequency wordlists
    914* Frequency wordlists from big web corpora for more than 40 languages are included with the script.
    1015* The size of wordlists included with the script is limited. User produced unlimited wordlists can be used to improve the performance, esp. in the case of very similar languages.
     16* It is important to [Unitok tokenise] all wordlists (used in a single run of the filter) the same way.
    1117
    12 === Installation ===
    13 {{{
    14 wget http://corpus.tools/raw-attachment/wiki/Downloads/wcwb_lang_filter_1.0.tar.gz
    15 tar -czvf wcwb_lang_filter_1.0.tar.gz
    16 cd wcwb_lang_filter_1.0
    17 make test/out.vert.lang_czech
    18 }}}
     18== Examples ==
    1919
    20 === Examples ===
    21 English frequency wordlist (top 10 lines)
     20=== Sample English frequency wordlist (top 10 lines) ===
    2221{{{
    2322the     789476980
     
    3332}}}
    3433
    35 Sample output:
     34=== Sample input ===
    3635{{{
    37 <doc source="https://en.wikipedia.org/wiki/Dog" lang="english" lang_scores="czech: 408.43, slovak: 415.74, english: 1359.91">
    38 #wordform     English Czech   Slovak score for each word
    39 The           5.26    5.33    7.82
    40 dog           0.00    0.00    4.89
    41 was           0.00    0.00    6.73
    42 the           5.26    5.33    7.82
    43 first         0.00    0.00    6.14
    44 species       0.00    0.00    5.14
    45 to            7.05    7.15    7.48
    46 be            0.00    0.00    6.77
    47 domesticated  0.00    0.00    0.00
    48 [...]
     36<doc source="https://en.wikipedia.org/wiki/Dog">
     37<p>
     38Linnaeus
     39considered
     40the
     41dog
     42to
     43be
     44a
     45separate
     46species
     47<g/>
     48.
     49</p>
    4950</doc>
    5051}}}
    5152
    52 Usage:
     53=== Sample output ===
     54{{{
     55<doc source="https://en.wikipedia.org/wiki/Dog" lang="english"
     56     lang_scores="english: 49.56, czech: 19.86, slovak: 20.15">
     57<par_langs lang="english" lang_scores="english: 49.56, czech: 19.86, slovak: 20.15"/>
     58<p>
     59#wordform  English  Czech Slovak score for each word
     60Linnaeus      0.00   0.00   0.00   #unknown to all sample wordlists
     61considered    5.18   0.00   0.00   #English only
     62the           7.82   5.26   5.33   #English word, ~100 x more frequent in the English wl
     63dog           4.89   0.00   0.00
     64to            7.48   7.05   7.15   #a valid word in all three languages
     65be            6.77   0.00   0.00
     66a             7.37   7.56   7.66   #a valid word in all three languages
     67separate      4.91   0.00   0.00
     68species       5.14   0.00   0.00
     69<g/>
     70.             0.00   0.00   0.00   #punctuation is omitted from wordlists
     71</p>
     72</doc>
     73}}}
     74
     75
     76== Installation ==
     77{{{
     78wget http://corpus.tools/raw-attachment/wiki/Downloads/wcwb_lang_filter_1.0.tar.gz
     79tar -czvf wcwb_lang_filter_1.0.tar.gz
     80cd wcwb_lang_filter_1.0
     81make test/out.vert.lang_czech
     82}}}
     83
     84== Usage ==
    5385{{{
    5486./lang_filter.py (LANGUAGE FRQ_WORDLIST[.gz|.xz])+ ACCEPTED_LANGS REJECTED_OUT LANG_RATIO_THRESHOLD
     
    94126}}}
    95127
     128== To build your own frequency wordlist ==
     129{{{
     130#Get corpus frequencies of lowercased words from a corpus compiled by [https://nlp.fi.muni.cz/trac/noske Sketch Engine]
     131lsclex -f /corpora/registry/english_web_corpus lc | cut -f2,3 | ./uninorm_4.py | perl -pe 's, (\d+)$,\t$1,' > en.wl1
     132lsclex -f /corpora/registry/czech_web_corpus   lc | cut -f2,3 | ./uninorm_4.py | perl -pe 's, (\d+)$,\t$1,' > cs.wl1
     133lsclex -f /corpora/registry/slovak_web_corpus  lc | cut -f2,3 | ./uninorm_4.py | perl -pe 's, (\d+)$,\t$1,' > sk.wl1
     134
     135#Or get the same from a vertical file
     136cut -f1 english_web_corpus.vert | grep -v '^<' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | perl -pe 's,^\s*(\d+) (.*)$,$2\t$1,' > en.wl1
     137cut -f1 czech_web_corpus.vert   | grep -v '^<' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | perl -pe 's,^\s*(\d+) (.*)$,$2\t$1,' > cs.wl1
     138cut -f1 slovak_web_corpus.vert  | grep -v '^<' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | perl -pe 's,^\s*(\d+) (.*)$,$2\t$1,' > sk.wl1
     139
     140#Filter the wordlist -- allow just characters valid for the language and a reasonable word length
     141grep '[abcdefghijklmnopqrstuvwxyz]' en.wl1                  | grep -v -P "['.-]{2}" | ./wl_grep.py "[#@]?[abcdefghijklmnopqrstuvwxyzéè0-9'][abcdefghijklmnopqrstuvwxyzéè0-9'.-]{0,29}"                               > en.wl2
     142grep '[aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž]' cs.wl1   | grep -v -P "['.-]{2}" | ./wl_grep.py "[#@]?[aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž0-9'][aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž0-9'.-]{0,29}"     > cs.wl2
     143grep '[aáäbcčdďeéfghiíjklĺľmnňoóôpqrŕsštťuúvwxyýzž]' sk.wl1 | grep -v -P "['.-]{2}" | ./wl_grep.py "[#@]?[aáäbcčdďeéfghiíjklĺľmnňoóôpqrŕsštťuúvwxyýzž0-9'][aáäbcčdďeéfghiíjklĺľmnňoóôpqrŕsštťuúvwxyýzž0-9'.-]{0,29}" > sk.wl2
     144
     145#Sort (not necessary) and pack
     146for f in {en,cs,sk}.wl2; do sort -k2,2rg -k1,1 ${c}.wl2 $f | gzip > ${f}.frqwl.gz; done
     147}}}
     148
    96149
    97150== Get Language Filter ==