Onion

onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts.

Installation

Prerequisites

64-bit CPU architecture
Google sparse hashset (https://github.com/sparsehash/sparsehash)

Configuration and installation

Download the sources:

wget -O onion-1.4.tar.gz 'http://corpus.tools/raw-attachment/wiki/Downloads/onion-1.4.tar.gz'

Extract the downloaded file:
```
tar xzvf onion-1.4.tar.gz
```
Configure the package by editing onion-1.4/Makefile.config:
- set PREFIX (or INSTALL_BIN and INSTALL_DATA) according to where you want the executables and data (docs) installed
Install the package (you may need sudo or a root shell for the last command):
```
cd onion-1.4/
make
make install
```

Quick start

onion -s <documents.vert >deduplicated_documents.vert

There's also an usage example on a sample input.

For usage information see:

onion -h
man onion

Usage

onion [OPTIONS] [FILE]

Mark duplicate text parts in the input vertical file.

 -f FILE   hashes of duplicate n-grams
 -n NUM    n-gram length (default: 5)
 -t NUM    duplicate content threshold (default: 0.5)
 -d STR    document tag (default: doc)
 -p STR    paragraph tag (default: p)
 -s        strip duplicate parts (rather than mark)
 -m        no smoothing
 -T NUM    trim n-gram hashes to NUM bits (default: 64)
 -l NUM    max stub length (default: 20)
 -b NUM    buffer size, in bytes (default: 16777216)
 -q        quiet; suppress all output except for errors
 -V        print version information and exit
 -h        display this help and exit

With no FILE, or when FILE is -, read standard input. Output is written to standard output.

Acknowledgements

This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with financial support from PRESEMT and Lexical Computing Ltd. It also relates to Jan Pomikálek's PhD research.

Licence

Onion is licensed under the BSD 3-Clause License

Last modified 4 years ago Last modified on 04/16/21 18:55:06

Download in other formats:

Plain Text