The corpus analysis software under development for this project is available as a Java WebStart tool. This software connects a modnlp-based concordance browser to the most recent update of the Genealogies corpus. You can start the tool by clicking on this button:
If you have not already done so, we recommend that you install the latest version of Java.
Additionally, the software and plugins are available for download at : https://sourceforge.net/projects/modnlp/
Note: The first time that you download and attempt to launch the software, you may receive a system security warning. Instructions on how to resolve this are provided below:
- Apple users: Go to ‘System Preferences’, click on ‘Security & Privacy’, then click ‘Open anyway’.
- Windows users: Go to the ‘Start’ Menu and search for ‘Configure Java’. Having opened this panel, choose the ‘Security’ tab and click ‘Edit Site list’. Click ‘Add’ and add the following web address to the list: http://genealogiesofknowledge.net/. Click ‘Ok’ and you should now be able to access the software.
Given that Modnlp and these plugins are still under construction and as such are frequently updated, it is recommended that users regularly delete all existing versions of the software from their workstation and re-download the browser from this page. This will ensure that you continue to work with the latest version of the software available as the project evolves.
Using the Corpus
Guidance on how to use the corpus can be accessed here:
Information about the content of our corpora can be accessed here:
Extra functionality is included through the plugin interface. Experimental plugins such as Concordance Mosaic and Concordance Tree enable exploratory analysis of the corpus through visual interfaces.
Concordance Mosaic currently displays concordance lists in a manner which emphasizes positional information such as frequency and collocation strength. This may aid in identification of important patterns within the concordance under analysis.
Concordance Tree displays the branching of a concordances single sided context to help identify pasterns within the concordance.
The design, structure and motivations for the TEC/ECPC tools are described in the following paper:
- S. Luz. ‘Web-based corpus software’. In A. Kruger, K. Wallmach, and J. Munday, editors, Corpus-based Translation Studies – Research and Applications, chapter 5, pages 124-149. Continuum, 2011. [bib | .pdf ]
The concordance mosaic is described in:
- S. Luz and S. Sheehan. A graph based abstraction of textual concordances and two renderings for their interactive visualisation. In Proceedings of the International Working Conference on Advanced Visual Interfaces, AVI ’14, pages 293–296, New York, NY, USA, 2014. ACM. [bib|.pdf]
If you use modnlp or the Genealogies of Knowledge/TEC/ECPC tools in your research, please consider citing these papers.
Modnlp and these plugins are still under construction please forward any issues to Ralph Brown (firstname.lastname@example.org).
MODNLP: Modular Suite of NLP Tools
modnlp aims to provide a modular architecture and tools for natural language processing written (mainly) in Java. These tools are being developed in connection with the Genealogies of Knowledge project.
The following modnlp modules are currently available:
- idx: an API and tools for (inverted) indexing, storage and retrieval of large amounts of text, with (XML-based) handling of meta-data.
- tc: an API and tools for text categorisation, including, functionality for XML parsing, term set reduction (and basic keyword extraction), probabilistic classifier induction, two sample classification tools, and evaluation modules.
- tec-tools (v2), consisting of tec-server, a corpus indexer and server for corpus access and analysis over the web and tec-client: a corpus analysis client. Unlike the (now obsolete) version 1 of these tools, originally developed for the TEC project, and written in Perl, C (server side) and Java, the version in this site (v2) is written entirely in Java.
This new version of the tools forms the basis of software support for text analysis and visualisation in the Genealogies of Knowledge project.
The modnlp/tec tools have also been used by the European Parliamentary Comparable and Parallel Corpora project (ECPC) coordinated by Dr. Calzada Pérez (Universitat Jaume I, Spain), and by the Translational English Corpus, which has been collected and maintained under Prof Mona Baker’s supervision at the University of Manchester, and made available on the Internet through the Genealogies of Knowledge project website, in a collaboration between The University of Edinburgh and The University of Manchester.
Also available is the documentation of the modnlp suite (for developers).
- Saturnino Luz
- Shane Sheehan
- Michael Davy (contributed to the TC module)
- Daniel Kelleher (contributed to the IDX module)
- Noel Skehan (contributed to an earlier version of the teccli/tecser modules)