#wordform English Czech Slovak score for each word
Linnaeus 0.00 0.00 0.00 #unknown to all sample wordlists
considered 5.18 0.00 0.00 #English only
the 7.82 5.26 5.33 #English word, ~100 x more frequent in the English wl
dog 4.89 0.00 0.00
to 7.48 7.05 7.15 #a valid word in all three languages
be 6.77 0.00 0.00
a 7.37 7.56 7.66 #a valid word in all three languages
separate 4.91 0.00 0.00
species 5.14 0.00 0.00
. 0.00 0.00 0.00 #punctuation is omitted from wordlists
}}}
== Installation ==
{{{
wget http://corpus.tools/raw-attachment/wiki/Downloads/wcwb_lang_filter_1.0.tar.gz
tar -czvf wcwb_lang_filter_1.0.tar.gz
cd wcwb_lang_filter_1.0
make test/out.vert.lang_czech
}}}
== Usage ==
{{{
./lang_filter.py (LANGUAGE FRQ_WORDLIST[.gz|.xz])+ ACCEPTED_LANGS REJECTED_OUT LANG_RATIO_THRESHOLD
E.g. to see all score information in a single output file:
./lang_filter.py Czech wl/czech.frqwl.gz Slovak wl/slovak.frqwl.gz \\
English wl/english.frqwl.gz Czech,Slovak out_rejected.vert 1.01 < in.vert > out.vert \\
Or just to output to separate documents:
./lang_filter.py Czech wl/czech.frqwl.gz Slovak wl/slovak.frqwl.gz \\
English wl/english.frqwl.gz Czech,Slovak out_rejected.vert 1.01 < in.vert > out.vert \\
| grep -v '^