Version 5 (modified by 8 years ago) ( diff ) | ,
---|
wiki2corpus
- downloads articles from Wikipedia for a given language id (URL prefix)
- works with Wikipedia API (HTML output) as it is not straightforward to turn mediawiki syntax into plain text
- HTML files are converted into plain text using jusText
- texts are tokenized using unitok
- title, categories, translations, number of paragraphs and number of chars are put as attributes to
<doc>
structure
Requirements
Get wiki2corpus
See Downloads for the latest version.
Usage
usage: wikidownloader.py [-h] [--cache CACHE] [--wait WAIT] [--newest] [--links LINKS] langcode Wikipedia downloader positional arguments: langcode Wikipedia language prefix optional arguments: -h, --help show this help message and exit --cache CACHE Directory with previously downloaded pages and data --wait WAIT Time interval between GET requests --newest Download the newest versions of articles (do not use cache) --links LINKS Gather external links from Wikipedia (Reference section)
Example
Let us say you want to download Belarusian Wikipedia. The ISO 639-1 language code is "be", so you want to download articles from be.wikipedia.org. You can use this command:
python wikidownloader.py be --wait 7 --links bewiki.links > bewiki.prevert
The bewiki.prevert
file can be used to feed a pipeline for following processing of the data. It is in a simple XML-like format with element <doc>
.
Licence
wiki2corpus is licensed under MIT licence