= Onion =
onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts.
{{{
#!html
Paper
|
Cite
|
Licence
}}}
== Installation ==
=== Prerequisites ===
* 64-bit CPU architecture
* libjudy (>=1.0.5)
=== Configuration and installation ===
1. Download the sources:
{{{
wget -O onion-1.2.tar.gz 'http://corpus.tools/raw-attachment/wiki/Downloads/onion-1.2.tar.gz'
}}}
2. Extract the downloaded file:
{{{
tar xzvf onion-1.2.tar.gz
}}}
3. Configure the package by editing onion-1.2/Makefile.config:
  * set PREFIX (or INSTALL_BIN and INSTALL_DATA) according to where you want the executables and data (docs) installed
  * if you have libjudy installed in a non-standard path you need to:
    * set JUDY_INC to where Judy.h is located
    * set JUDY_LIB to where libJudy.a is located
4. Install the package (you may need sudo or a root shell for the last command):
{{{
cd onion-1.2/
make
make install
}}}
== Quick start ==
{{{
onion -s deduplicated_documents.vert
}}}
There's also an [Onion/UsageExample usage example] on a sample input.
For usage information see:
{{{
onion -h
man onion
}}}
== Usage ==
{{{onion [OPTIONS] [FILE]}}}
Mark duplicate text parts in the input vertical file.
{{{
 -f FILE   hashes of duplicate n-grams
 -n NUM    n-gram length (default: 5)
 -t NUM    duplicate content threshold (default: 0.5)
 -d STR    document tag (default: doc)
 -p STR    paragraph tag (default: p)
 -s        strip duplicate parts (rather than mark)
 -m        no smoothing
 -T NUM    trim n-gram hashes to NUM bits (default: 64)
 -l NUM    max stub length (default: 20)
 -b NUM    buffer size, in bytes (default: 16777216)
 -q        quiet; suppress all output except for errors
 -V        print version information and exit
 -h        display this help and exit
}}}
With no FILE, or when FILE is -, read standard input. Output is written to standard output.
== Acknowledgements ==
This software has been developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of [http://www.muni.cz/ Masaryk University in Brno] with financial support from [http://presemt.eu PRESEMT] and [http://www.sketchengine.co.uk Lexical Computing Ltd.] It also relates to Jan Pomikálek's [http://is.muni.cz/th/45523/fi_d/phdthesis.pdf PhD research].
== Licence ==
Onion is licensed under the [http://opensource.org/licenses/BSD-3-Clause BSD 3-Clause License]