Changes between Initial Version and Version 1 of languagefilter


Ignore:
Timestamp:
04/24/20 18:20:48 (5 years ago)
Author:
admin
Comment:

created

Legend:

Unmodified
Added
Removed
Modified
  • languagefilter

    v1 v1  
     1= Web Corpora Wordlist Based Language Filter =
     2
     3* Separates documents and paragraphs by language using word frequency lists.
     4* All languages to recognise have to be specified and respective frequency wordlists supplied.
     5* A score (a logarithm of relative corpus frequency) is calculated for each word form and language.
     6* The sum of scores of all words in paragraphs and documents is calculated for all languages.
     7* If the ratio of scores of two top scoring languages is above a threshold, the top scoring language is recorded in headers of the respective paragraph/document.
     8* A multi-language document is split to separate documents containing just paragraphs in recognised languages.
     9* Frequency wordlists from big web corpora for more than 40 languages are included with the script.
     10* The size of wordlists included with the script is limited. User produced unlimited wordlists can be used to improve the performance, esp. in the case of very similar languages.
     11
     12=== Installation ===
     13{{{
     14wget http://corpus.tools/raw-attachment/wiki/Downloads/wcwb_lang_filter_1.0.tar.gz
     15tar -czvf wcwb_lang_filter_1.0.tar.gz
     16cd wcwb_lang_filter_1.0
     17make test/out.vert.lang_czech
     18}}}
     19
     20=== Examples ===
     21English frequency wordlist (top 10 lines)
     22{{{
     23the     789476980
     24and     438853153
     25of      408834726
     26to      357762038
     27a       275739809
     28in      271813858
     29for     150888766
     30is      149294535
     31that    122527476
     32on      102044349
     33}}}
     34
     35Sample output:
     36{{{
     37<doc source="https://en.wikipedia.org/wiki/Dog" lang="english" lang_scores="czech: 408.43, slovak: 415.74, english: 1359.91">
     38#wordform     English Czech   Slovak score for each word
     39The           5.26    5.33    7.82
     40dog           0.00    0.00    4.89
     41was           0.00    0.00    6.73
     42the           5.26    5.33    7.82
     43first         0.00    0.00    6.14
     44species       0.00    0.00    5.14
     45to            7.05    7.15    7.48
     46be            0.00    0.00    6.77
     47domesticated  0.00    0.00    0.00
     48[...]
     49</doc>
     50}}}
     51
     52Usage:
     53{{{
     54./lang_filter.py (LANGUAGE FRQ_WORDLIST[.gz|.xz])+ ACCEPTED_LANGS REJECTED_OUT LANG_RATIO_THRESHOLD
     55
     56E.g. to see all score information in a single output file:
     57./lang_filter.py Czech wl/czech.frqwl.gz Slovak wl/slovak.frqwl.gz \\
     58    English wl/english.frqwl.gz Czech,Slovak out_rejected.vert 1.01 < in.vert > out.vert \\
     59
     60Or just to output to separate documents:
     61./lang_filter.py Czech wl/czech.frqwl.gz Slovak wl/slovak.frqwl.gz \\
     62    English wl/english.frqwl.gz Czech,Slovak out_rejected.vert 1.01 < in.vert > out.vert \\
     63    | grep -v '^<par_langs' | cut -f1 \
     64    | ./vertsplit_by_attr.py doc lang out.vert.lang_
     65
     66LANGUAGE language name
     67
     68FRQ_WORDLIST corpus frequency wordlist
     69    format: <word>TAB<count>, one record per line, can be gzipped or xzipped
     70
     71ACCEPTED_LANGS comma separated list of accepted languages (ALL to accept all)
     72
     73REJECTED_OUT a path to write rejected data to; three files with the following suffixes are created:
     74    "lang"  ... rejected since the content is in an unwanted recognised language
     75    "mixed" ... rejected since the content is below the threshold of discerning languages
     76                (too similar to two top scoring recognised languages)
     77    "small" ... rejected since the content is too short to reliably determine its language
     78
     79LANG_RATIO_THRESHOLD threshold ratio of wordlist scores of top/second most scoring languages
     80    (NONE not to filter mixed language at all)
     81    1.1 is ok for different languages, 1.01 is better for very close languages
     82
     83Input format: Vertical (tokenised text, one token per line), <doc/> and <p/> structures.
     84
     85Output format: The same as input with the following additions:
     86    Attribute doc.lang -- the language with the top score
     87    Attribute doc.lang_scores -- scores for each recognised language
     88    Structure par_langs with attributes lang and lang_scores -- the same for paragraphs
     89    New token attributes (columns): scores for all recognised languages (in the command line order)
     90
     91A word form is expected in the first column of the input.
     92The script keeps any other token attributes (e.g. lemma, tag).
     93The wordlist comparison is case insensitive.
     94}}}
     95
     96
     97== Get Language Filter ==
     98See [wiki:Downloads] for the latest version.
     99
     100
     101== Licence ==
     102Language Filter is licensed under [https://www.gnu.org/licenses/old-licenses/gpl-2.0.txt GNU General Public License Version 2].