16 | | <a href="/wiki/Unitok"> |
17 | | Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata. |
| 22 | <p><a href="/wiki/Unitok"> |
| 23 | Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata.</a></p> |
| 24 | <p> |
| 25 | <a class="lnk" href="http://nlp.fi.muni.cz/raslan/raslan14.pdf#page=79">Paper</a> |
| 26 | | |
| 27 | <a class="lnk" href="">Cite</a> |
| 28 | | |
| 29 | <a class="lnk" href="https://www.mozilla.org/MPL/2.0/">Licence</a> |
| 30 | </p> |
| 31 | </td> |
22 | | <a href="/wiki/Justext"> |
23 | | JusText is a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences. |
| 36 | <p><a href="/wiki/Justext"> |
| 37 | JusText is a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences.</a><p> |
| 38 | <p> |
| 39 | <a class="lnk" href="http://is.muni.cz/th/45523/fi_d/phdthesis.pdf">Paper</a> |
| 40 | | |
| 41 | <a class="lnk" href="">Cite</a> |
| 42 | | |
| 43 | <a class="lnk" href="http://opensource.org/licenses/BSD-3-Clause">Licence</a> |
| 44 | </p> |
| 45 | </td> |