wiki:Onion

Version 2 (modified by admin, 10 years ago) ( diff )

--

Onion

onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts.

Installation

Prerequisites

  • 64-bit CPU architecture
  • libjudy (>=1.0.5)

Configuration and installation

  1. Download the sources:
    wget -O onion-1.2.tar.gz 'https://docs.google.com/uc?authuser=0&id=0B4SxKw5O_gLHUXZhOHBzUDNwcXM&export=download'
    
  2. Extract the downloaded file:
    tar xzvf onion-1.2.tar.gz
    
  3. Configure the package by editing onion-1.2/Makefile.config:
    • set PREFIX (or INSTALL_BIN and INSTALL_DATA) according to where you want the executables and data (docs) installed
    • if you have libjudy installed in a non-standard path you need to:
      • set JUDY_INC to where Judy.h is located
      • set JUDY_LIB to where libJudy.a is located
  4. Install the package (you may need sudo or a root shell for the last command):
    cd onion-1.2/
    make
    make install
    

Quick start

onion -s <documents.vert >deduplicated_documents.vert

There's also an usage example on a sample input.

For usage information see:

onion -h
man onion

Acknowledgements

This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with a financial support from PRESEMT and Lexical Computing Ltd. It also relates to Jan Pomikálek's PhD research.