MODNLP/TC: an API and tools for text categorisation
modnlp/tc: an API and tools for text categorisation, including, functionality for XML parsing, term set reduction (and basic keyword extraction), probabilistic classifier induction, two sample classification tools, and evaluation modules. The software is distributed under the GNU General Public License, and is fully compatible with the GNU Classpath It has been tested on a number of JVM’s, including kaffe (v1.1.5), sablevm (v1.1.6), jamvm (v1.3) and JDK 1.4+ The functionality supported by the API include:
- General mechanisms for parsing and document storage (currently implemented modnlp.tc.parser.Parsers include a a parser for my XML version of David Lewis’ Reuters-21578 collection, and another for Ion Androutsopoulos’ Lingspam corpus for spam filter.
- a feature selection module implementing several term set reduction (filtering) metrics, and a sample utility that illustrates its use
- a basic probabilistic classifier induction program, and
- modules for classification and evaluation.
See also the Developer’s web page at Sourceforge.net for the GIT repository