The corpus analysis software under development for this project is available as a Java WebStart tool. This software connects a modnlp-based concordance browser to the most recent update of the Genealogies corpus. You can start the tool by clicking on this button:
If you have not already done so, we recommend that you install the latest version of Java.
Additionally, the software code and plugins are available for download at : https://sourceforge.net/projects/modnlp/
Note: The first time that you download and attempt to launch the software, you may receive a system security warning. Instructions on how to resolve this are provided below:
- Apple users: Go to ‘System Preferences’, click on ‘Security & Privacy’, then click ‘Open anyway’.
- Windows users: Go to the ‘Start’ Menu and search for ‘Configure Java’. Having opened this panel, choose the ‘Security’ tab and click ‘Edit Site list’. Click ‘Add’ and add the following web address to the list: http://genealogiesofknowledge.net/. Click ‘Ok’ and you should now be able to access the software.
Given that Modnlp and these plugins are still under construction and as such are frequently updated, it is recommended that users regularly delete all existing versions of the software from their workstation and re-download the browser from this page. This will ensure that you continue to work with the latest version of the software available as the project evolves.
Should you encounter any software bugs or other technical problems when using these tools, please create a ticket detailing the nature of the issue on our SourceForge project page: https://sourceforge.net/p/modnlp/tickets/
MODNLP: Modular Suite of NLP Tools
modnlp aims to provide a modular architecture and tools for natural language processing written (mainly) in Java. These tools are being developed in connection with the Genealogies of Knowledge project.
The following modnlp modules are currently available:
- idx: an API and tools for (inverted) indexing, storage and retrieval of large amounts of text, with (XML-based) handling of meta-data.
- tc: an API and tools for text categorisation, including, functionality for XML parsing, term set reduction (and basic keyword extraction), probabilistic classifier induction, two sample classification tools, and evaluation modules.
- tec-tools (v2), consisting of tec-server, a corpus indexer and server for corpus access and analysis over the web and tec-client: a corpus analysis client. Unlike the (now obsolete) version 1 of these tools, originally developed for the TEC project, and written in Perl, C (server side) and Java, the version in this site (v2) is written entirely in Java.
This new version of the tools forms the basis of software support for text analysis and visualisation in the Genealogies of Knowledge project.
The modnlp/tec tools have also been used by the European Parliamentary Comparable and Parallel Corpora project (ECPC) coordinated by Dr. Calzada Pérez (Universitat Jaume I, Spain), and by the Translational English Corpus, which has been collected and maintained under Prof Mona Baker’s supervision at the University of Manchester, and made available on the Internet through the Genealogies of Knowledge project website, in a collaboration between The University of Edinburgh and The University of Manchester.
Also available is the documentation of the modnlp suite (for developers).