Version 6 (modified by 10 years ago) ( diff ) | ,
---|
Onion ¶
onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts.
Licence ¶
Onion is licensed under the BSD 3-Clause License
Installation ¶
Prerequisites ¶
- 64-bit CPU architecture
- libjudy (>=1.0.5)
Configuration and installation ¶
- Download the sources:
wget -O onion-1.2.tar.gz 'https://corpus.tools/attachment/wiki/Downloads/onion-1.2.tar.gz'
- Extract the downloaded file:
tar xzvf onion-1.2.tar.gz
- Configure the package by editing onion-1.2/Makefile.config:
- set PREFIX (or INSTALL_BIN and INSTALL_DATA) according to where you want the executables and data (docs) installed
- if you have libjudy installed in a non-standard path you need to:
- set JUDY_INC to where Judy.h is located
- set JUDY_LIB to where libJudy.a is located
- Install the package (you may need sudo or a root shell for the last command):
cd onion-1.2/ make make install
Quick start ¶
onion -s <documents.vert >deduplicated_documents.vert
There's also an usage example on a sample input.
For usage information see:
onion -h man onion
Acknowledgements ¶
This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with financial support from PRESEMT and Lexical Computing Ltd. It also relates to Jan Pomikálek's PhD research.