| 16 | |
| 17 | === Changelog 0.77 → 0.81 === |
| 18 | * Improvements proposed by Vlado Benko or Nikola Ljubešić: |
| 19 | - escape square brackets and backslash in url |
| 20 | - doc attributes: timestamp with hours, IP address, meta/chared encoding |
| 21 | - doc id added to arc output |
| 22 | - MAX_DOCS_CLEANED limit per domain |
| 23 | - create the temp dir if needed |
| 24 | * Support for processing doc/docx/ps/pdf (not working well yet, URLs of documents are saved to a separate file for manual download and processing) |
| 25 | * Crawling multiple languages (Inspired by Nikola's contribution) (not tested yet) |
| 26 | * Stop crawling by sending SIGTERM to the main process |
| 27 | * Domain distances (distance of web domains from seed web domains, will be used in scheduling in the future) |
| 28 | * Config values tweaked |
| 29 | - MAX_URL_QUEUE, MAX_URL_SELECT greatly increased |
| 30 | - better spread of domains in the crawling queue => faster crawling |
| 31 | * Python --> !PyPy |
| 32 | - scheduler and crawler processes dynamically compiled by pypy |
| 33 | - saves approx. 1/4 RAM |
| 34 | - better CPU effectivity not so visible (waiting for host/IP timeouts, waiting for doc processors) |
| 35 | - process.py requires lxml which does not work with pypy (will be replaced by lxml-cffi in the future) |
| 36 | * Readme updated (more information, known bugs) |
| 37 | * Bug fixes |