= Onion =
onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts.
{{{
#!html
Paper
|
Cite
|
Licence
}}}
== Installation ==
=== Prerequisites ===
* 64-bit CPU architecture
* libjudy (>=1.0.5)
=== Configuration and installation ===
1. Download the sources:
{{{
wget -O onion-1.4.tar.gz 'http://corpus.tools/raw-attachment/wiki/Downloads/onion-1.4.tar.gz'
}}}
2. Extract the downloaded file:
{{{
tar xzvf onion-1.4.tar.gz
}}}
3. Configure the package by editing onion-1.4/Makefile.config:
* set PREFIX (or INSTALL_BIN and INSTALL_DATA) according to where you want the executables and data (docs) installed
* if you have libjudy installed in a non-standard path you need to:
* set JUDY_INC to where Judy.h is located
* set JUDY_LIB to where libJudy.a is located
4. Install the package (you may need sudo or a root shell for the last command):
{{{
cd onion-1.4/
make
make install
}}}
== Quick start ==
{{{
onion -s deduplicated_documents.vert
}}}
There's also an [Onion/UsageExample usage example] on a sample input.
For usage information see:
{{{
onion -h
man onion
}}}
== Usage ==
{{{onion [OPTIONS] [FILE]}}}
Mark duplicate text parts in the input vertical file.
{{{
-f FILE hashes of duplicate n-grams
-n NUM n-gram length (default: 5)
-t NUM duplicate content threshold (default: 0.5)
-d STR document tag (default: doc)
-p STR paragraph tag (default: p)
-s strip duplicate parts (rather than mark)
-m no smoothing
-T NUM trim n-gram hashes to NUM bits (default: 64)
-l NUM max stub length (default: 20)
-b NUM buffer size, in bytes (default: 16777216)
-q quiet; suppress all output except for errors
-V print version information and exit
-h display this help and exit
}}}
With no FILE, or when FILE is -, read standard input. Output is written to standard output.
== Acknowledgements ==
This software has been developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of [http://www.muni.cz/ Masaryk University in Brno] with financial support from [http://presemt.eu PRESEMT] and [http://www.sketchengine.co.uk Lexical Computing Ltd.] It also relates to Jan Pomikálek's [http://is.muni.cz/th/45523/fi_d/phdthesis.pdf PhD research].
== Licence ==
Onion is licensed under the [http://opensource.org/licenses/BSD-3-Clause BSD 3-Clause License]