Changes between Initial Version and Version 1 of languagefilter

Timestamp:: 04/24/20 18:20:48 (5 years ago)
Author:: admin
Comment:: created

Legend:

: Unmodified
: Added
: Removed
: Modified

languagefilter

               v1
+= Web Corpora Wordlist Based Language Filter =
+* Separates documents and paragraphs by language using word frequency lists.
+* All languages to recognise have to be specified and respective frequency wordlists supplied.
+* A score (a logarithm of relative corpus frequency) is calculated for each word form and language.
+* The sum of scores of all words in paragraphs and documents is calculated for all languages.
+* If the ratio of scores of two top scoring languages is above a threshold, the top scoring language is recorded in headers of the respective paragraph/document.
+* A multi-language document is split to separate documents containing just paragraphs in recognised languages.
+* Frequency wordlists from big web corpora for more than 40 languages are included with the script.
+* The size of wordlists included with the script is limited. User produced unlimited wordlists can be used to improve the performance, esp. in the case of very similar languages.
+=== Installation ===
+{{{
+wget http://corpus.tools/raw-attachment/wiki/Downloads/wcwb_lang_filter_1.0.tar.gz
+tar -czvf wcwb_lang_filter_1.0.tar.gz
+cd wcwb_lang_filter_1.0
+make test/out.vert.lang_czech
+}}}
+=== Examples ===
+English frequency wordlist (top 10 lines)
+{{{
+the     789476980
+and     438853153
+of      408834726
+to      357762038
+a       275739809
+in      271813858
+for     150888766
+is      149294535
+that    122527476
+on      102044349
+}}}
+Sample output:
+{{{
+<doc source="https://en.wikipedia.org/wiki/Dog" lang="english" lang_scores="czech: 408.43, slovak: 415.74, english: 1359.91">
+#wordform     English Czech   Slovak score for each word
+The           5.26    5.33    7.82
+dog           0.00    0.00    4.89
+was           0.00    0.00    6.73
+the           5.26    5.33    7.82
+first         0.00    0.00    6.14
+species       0.00    0.00    5.14
+to            7.05    7.15    7.48
+be            0.00    0.00    6.77
+domesticated  0.00    0.00    0.00
+[...]
+</doc>
+}}}
+Usage:
+{{{
+./lang_filter.py (LANGUAGE FRQ_WORDLIST[.gz|.xz])+ ACCEPTED_LANGS REJECTED_OUT LANG_RATIO_THRESHOLD
+E.g. to see all score information in a single output file:
+./lang_filter.py Czech wl/czech.frqwl.gz Slovak wl/slovak.frqwl.gz \\
+    English wl/english.frqwl.gz Czech,Slovak out_rejected.vert 1.01 < in.vert > out.vert \\
+Or just to output to separate documents:
+./lang_filter.py Czech wl/czech.frqwl.gz Slovak wl/slovak.frqwl.gz \\
+    English wl/english.frqwl.gz Czech,Slovak out_rejected.vert 1.01 < in.vert > out.vert \\
+    | grep -v '^<par_langs' | cut -f1 \
+    | ./vertsplit_by_attr.py doc lang out.vert.lang_
+LANGUAGE language name
+FRQ_WORDLIST corpus frequency wordlist
+    format: <word>TAB<count>, one record per line, can be gzipped or xzipped
+ACCEPTED_LANGS comma separated list of accepted languages (ALL to accept all)
+REJECTED_OUT a path to write rejected data to; three files with the following suffixes are created:
+    "lang"  ... rejected since the content is in an unwanted recognised language
+    "mixed" ... rejected since the content is below the threshold of discerning languages
+                (too similar to two top scoring recognised languages)
+    "small" ... rejected since the content is too short to reliably determine its language
+LANG_RATIO_THRESHOLD threshold ratio of wordlist scores of top/second most scoring languages
+    (NONE not to filter mixed language at all)
+.1 is ok for different languages, 1.01 is better for very close languages
+Input format: Vertical (tokenised text, one token per line), <doc/> and <p/> structures.
+Output format: The same as input with the following additions:
+    Attribute doc.lang -- the language with the top score
+    Attribute doc.lang_scores -- scores for each recognised language
+    Structure par_langs with attributes lang and lang_scores -- the same for paragraphs
+    New token attributes (columns): scores for all recognised languages (in the command line order)
+A word form is expected in the first column of the input.
+The script keeps any other token attributes (e.g. lemma, tag).
+The wordlist comparison is case insensitive.
+}}}
+== Get Language Filter ==
+See [wiki:Downloads] for the latest version.
+== Licence ==
+Language Filter is licensed under [https://www.gnu.org/licenses/old-licenses/gpl-2.0.txt GNU General Public License Version 2].