Changes between Version 12 and Version 13 of WikiStart
- Timestamp:
- 02/22/15 14:10:10 (10 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
WikiStart
v12 v13 3 3 <table style="color:white; border-spacing: 1em"><tr> 4 4 5 <td style="background-color:#008000 ; background-image:url('/chrome/site/onion_nb.png') ; background-repeat:no-repeat ; background-position: 6px 6px; width:50% ; border-radius:20px ; padding:1em 1em 1em 60px ">5 <td style="background-color:#008000 ; background-image:url('/chrome/site/onion_nb.png') ; background-repeat:no-repeat ; background-position: 6px 6px; width:50% ; border-radius:20px ; padding:1em 1em 1em 60px; vertical-align:top"> 6 6 <a href="/wiki/Onion" style="color:white; border:none; font-size:120%"> 7 onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts 8 </a> 9 </td> 7 Onion (ONe Instance ONly) is a de-duplicator for large collections of texts. It can measure the similarity of paragraphs or whole documents and drop duplicate ones based on the threshold you set. 10 8 11 <td style="background-color:#800000 ; background-image:url('/chrome/site/unitok_nb.png') ; background-repeat:no-repeat ; background-position: 6px 6px; width:50% ; border-radius:20px ; padding:1em 1em 1em 60px"> 12 unitok is a universal text tokeniser 13 </td> 9 </a></td> 14 10 15 </tr><tr> 11 <td style="background-color:#800000 ; background-image:url('/chrome/site/unitok_nb.png') ; background-repeat:no-repeat ; background-position: 6px 6px; width:50% ; border-radius:20px ; padding:1em 1em 1em 60px; vertical-align:top"> 12 <a href="/wiki/Unitok" style="color:white; border:none; font-size:120%"> 13 Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata. 16 14 17 <td style="background-color:#0080ff ; background-image:url('/chrome/site/justext_nb.png') ; background-repeat:no-repeat ; background-position: 6px 6px; width:50% ; border-radius:20px ; padding:1em 1em 1em 60px"> 18 justext is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora 19 </td> 20 <td></td> 21 </tr> 22 </table> 15 </a></td></tr><tr> 16 17 <td style="background-color:#0080ff ; background-image:url('/chrome/site/justext_nb.png') ; background-repeat:no-repeat ; background-position: 6px 6px; width:50% ; border-radius:20px ; padding:1em 1em 1em 60px; vertical-align:top"> 18 <a href="/wiki/Justext" style="color:white; border:none; font-size:120%"> 19 JusText is a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences. 20 21 </a></td><td></td></tr></table> 23 22 }}}