= Onion = onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. == Installation == === Prerequisites === * 64-bit CPU architecture * libjudy (>=1.0.5) === Configuration and installation === 1. Download the sources: {{{ wget -O onion-1.2.tar.gz 'http://corpus.tools/raw-attachment/wiki/Downloads/onion-1.2.tar.gz' }}} 2. Extract the downloaded file: {{{ tar xzvf onion-1.2.tar.gz }}} 3. Configure the package by editing onion-1.2/Makefile.config: * set PREFIX (or INSTALL_BIN and INSTALL_DATA) according to where you want the executables and data (docs) installed * if you have libjudy installed in a non-standard path you need to: * set JUDY_INC to where Judy.h is located * set JUDY_LIB to where libJudy.a is located 4. Install the package (you may need sudo or a root shell for the last command): {{{ cd onion-1.2/ make make install }}} == Quick start == {{{ onion -s deduplicated_documents.vert }}} There's also an [Onion/UsageExample usage example] on a sample input. For usage information see: {{{ onion -h man onion }}} == Usage == {{{onion [OPTIONS] [FILE]}}} Mark duplicate text parts in the input vertical file. {{{ -f FILE hashes of duplicate n-grams -n NUM n-gram length (default: 5) -t NUM duplicate content threshold (default: 0.5) -d STR document tag (default: doc) -p STR paragraph tag (default: p) -s strip duplicate parts (rather than mark) -m no smoothing -T NUM trim n-gram hashes to NUM bits (default: 64) -l NUM max stub length (default: 20) -b NUM buffer size, in bytes (default: 16777216) -q quiet; suppress all output except for errors -V print version information and exit -h display this help and exit }}} With no FILE, or when FILE is -, read standard input. Output is written to standard output. == Acknowledgements == This software has been developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of [http://www.muni.cz/ Masaryk University in Brno] with financial support from [http://presemt.eu PRESEMT] and [http://www.sketchengine.co.uk Lexical Computing Ltd.] It also relates to Jan Pomikálek's [http://is.muni.cz/th/45523/fi_d/phdthesis.pdf PhD research]. == Licence == Onion is licensed under the [http://opensource.org/licenses/BSD-3-Clause BSD 3-Clause License]