MODNLP-IDX: A corpus indexer
modnlp-idx: an API, library and tools for (inverted) indexing, storage and retrieval of large amounts of text, with (XML-based) handling of meta-data. The modnlp-idx library has been used in the European Parliamentary Comparable and Parallel Corpora (ECPC) project and by the Translational English Corpus (TEC) project.
A beta version is currently available which uses the Sleepycat’s Java version of the Berkeley DB as a back-end. This beta distribution can be downloaded from the project’s download area. The files ending in bin-tec.tar.gz, bin-ep.tar.gz, bin-hc.tar.gz contain property files configured for indexing the TEC corpus and the European Parliament Corpus and The House of Commons Corpus respectively. The last two are being collected by the ECPC project.
Instructions on compiling and running the indexer can be found in the README files contained in the distribution’s archives.
Development snapshots can be downloaded through the Developer’s web page at Sourceforge.net.