| 1 | = Web Corpora Wordlist Based Language Filter = |
| 2 | |
| 3 | * Separates documents and paragraphs by language using word frequency lists. |
| 4 | * All languages to recognise have to be specified and respective frequency wordlists supplied. |
| 5 | * A score (a logarithm of relative corpus frequency) is calculated for each word form and language. |
| 6 | * The sum of scores of all words in paragraphs and documents is calculated for all languages. |
| 7 | * If the ratio of scores of two top scoring languages is above a threshold, the top scoring language is recorded in headers of the respective paragraph/document. |
| 8 | * A multi-language document is split to separate documents containing just paragraphs in recognised languages. |
| 9 | * Frequency wordlists from big web corpora for more than 40 languages are included with the script. |
| 10 | * The size of wordlists included with the script is limited. User produced unlimited wordlists can be used to improve the performance, esp. in the case of very similar languages. |
| 11 | |
| 12 | === Installation === |
| 13 | {{{ |
| 14 | wget http://corpus.tools/raw-attachment/wiki/Downloads/wcwb_lang_filter_1.0.tar.gz |
| 15 | tar -czvf wcwb_lang_filter_1.0.tar.gz |
| 16 | cd wcwb_lang_filter_1.0 |
| 17 | make test/out.vert.lang_czech |
| 18 | }}} |
| 19 | |
| 20 | === Examples === |
| 21 | English frequency wordlist (top 10 lines) |
| 22 | {{{ |
| 23 | the 789476980 |
| 24 | and 438853153 |
| 25 | of 408834726 |
| 26 | to 357762038 |
| 27 | a 275739809 |
| 28 | in 271813858 |
| 29 | for 150888766 |
| 30 | is 149294535 |
| 31 | that 122527476 |
| 32 | on 102044349 |
| 33 | }}} |
| 34 | |
| 35 | Sample output: |
| 36 | {{{ |
| 37 | <doc source="https://en.wikipedia.org/wiki/Dog" lang="english" lang_scores="czech: 408.43, slovak: 415.74, english: 1359.91"> |
| 38 | #wordform English Czech Slovak score for each word |
| 39 | The 5.26 5.33 7.82 |
| 40 | dog 0.00 0.00 4.89 |
| 41 | was 0.00 0.00 6.73 |
| 42 | the 5.26 5.33 7.82 |
| 43 | first 0.00 0.00 6.14 |
| 44 | species 0.00 0.00 5.14 |
| 45 | to 7.05 7.15 7.48 |
| 46 | be 0.00 0.00 6.77 |
| 47 | domesticated 0.00 0.00 0.00 |
| 48 | [...] |
| 49 | </doc> |
| 50 | }}} |
| 51 | |
| 52 | Usage: |
| 53 | {{{ |
| 54 | ./lang_filter.py (LANGUAGE FRQ_WORDLIST[.gz|.xz])+ ACCEPTED_LANGS REJECTED_OUT LANG_RATIO_THRESHOLD |
| 55 | |
| 56 | E.g. to see all score information in a single output file: |
| 57 | ./lang_filter.py Czech wl/czech.frqwl.gz Slovak wl/slovak.frqwl.gz \\ |
| 58 | English wl/english.frqwl.gz Czech,Slovak out_rejected.vert 1.01 < in.vert > out.vert \\ |
| 59 | |
| 60 | Or just to output to separate documents: |
| 61 | ./lang_filter.py Czech wl/czech.frqwl.gz Slovak wl/slovak.frqwl.gz \\ |
| 62 | English wl/english.frqwl.gz Czech,Slovak out_rejected.vert 1.01 < in.vert > out.vert \\ |
| 63 | | grep -v '^<par_langs' | cut -f1 \ |
| 64 | | ./vertsplit_by_attr.py doc lang out.vert.lang_ |
| 65 | |
| 66 | LANGUAGE language name |
| 67 | |
| 68 | FRQ_WORDLIST corpus frequency wordlist |
| 69 | format: <word>TAB<count>, one record per line, can be gzipped or xzipped |
| 70 | |
| 71 | ACCEPTED_LANGS comma separated list of accepted languages (ALL to accept all) |
| 72 | |
| 73 | REJECTED_OUT a path to write rejected data to; three files with the following suffixes are created: |
| 74 | "lang" ... rejected since the content is in an unwanted recognised language |
| 75 | "mixed" ... rejected since the content is below the threshold of discerning languages |
| 76 | (too similar to two top scoring recognised languages) |
| 77 | "small" ... rejected since the content is too short to reliably determine its language |
| 78 | |
| 79 | LANG_RATIO_THRESHOLD threshold ratio of wordlist scores of top/second most scoring languages |
| 80 | (NONE not to filter mixed language at all) |
| 81 | 1.1 is ok for different languages, 1.01 is better for very close languages |
| 82 | |
| 83 | Input format: Vertical (tokenised text, one token per line), <doc/> and <p/> structures. |
| 84 | |
| 85 | Output format: The same as input with the following additions: |
| 86 | Attribute doc.lang -- the language with the top score |
| 87 | Attribute doc.lang_scores -- scores for each recognised language |
| 88 | Structure par_langs with attributes lang and lang_scores -- the same for paragraphs |
| 89 | New token attributes (columns): scores for all recognised languages (in the command line order) |
| 90 | |
| 91 | A word form is expected in the first column of the input. |
| 92 | The script keeps any other token attributes (e.g. lemma, tag). |
| 93 | The wordlist comparison is case insensitive. |
| 94 | }}} |
| 95 | |
| 96 | |
| 97 | == Get Language Filter == |
| 98 | See [wiki:Downloads] for the latest version. |
| 99 | |
| 100 | |
| 101 | == Licence == |
| 102 | Language Filter is licensed under [https://www.gnu.org/licenses/old-licenses/gpl-2.0.txt GNU General Public License Version 2]. |