Genealogies of Knowledge Corpus

The analysis tool under development for this project is available as a Java WebStart software tool which you can start by clicking on this button:

Additionally, the software and plugins are available for download at :

This software connects a modnlp based concordance browser to the most recent update of the genealogies corpus.

The corpus may be filtered in a number of ways using the inbuilt sub-corpus selection tool and meta data for the corpus files are made available through the tool.

Extra functionality is included through the plugin interface. Experimental plugins such as Concordance Mosaic and Concordance Tree enable exploratory analysis of the corpus through visual interfaces.

Concordance Mosaic currently displays concordance lists in a manner which emphasizes positional information such as frequency and collocation strength. This may aid in identification of important patterns within the concordance under analysis.

Concordance Tree displays the branching of a concordances single sided context to help identify pasterns within the concordance.

The design, structure and motivations for the TEC/ECPC tools are described in the following paper:

S. Luz. ‘Web-based corpus software’. In A. Kruger, K. Wallmach, and J. Munday, editors, Corpus-based Translation Studies – Research and Applications, chapter 5, pages 124-149. Continuum, 2011. [bib | .pdf ]

The concordance mosaic is described in:

S. Luz and S. Sheehan. A graph based abstraction of textual concordances and two renderings for their interactive visualisation. In Proceedings of the International Working Conference on Advanced Visual Interfaces, AVI ’14, pages 293–296, New York, NY, USA, 2014. ACM. [bib|.pdf]

If you use modnlp or the Genealogies of Knowledge/TEC/ECPC tools in your research, please consider citing these papers.

Modnlp and these plugins are still under construction please forward any issues to Ralph Brown (

MODNLP: Modular Suite of NLP Tools

modnlp aims to provide a modular architecture and tools for natural language processing written (mainly) in Java. These tools are being developed in connection with the Genealogies of Knowledge project.

The following modnlp modules are currently available:

  • idx: an API and tools for (inverted) indexing, storage and retrieval of large amounts of text, with (XML-based) handling of meta-data.
  • tc: an API and tools for text categorisation, including, functionality for XML parsing, term set reduction (and basic keyword extraction), probabilistic classifier induction, two sample classification tools, and evaluation modules.
  • tec-tools (v2), consisting of tec-server, a corpus indexer and server for corpus access and analysis over the web and tec-client: a corpus analysis client. Unlike the (now obsolete) version 1 of these tools, originally developed for the TEC project, and written in Perl, C (server side) and Java, the version in this site (v2) is written entirely in Java.

This new version of the tools forms the basis of software support for text analysis and visualisation in the Genealogies of Knowledge project.

The modnlp/tec tools have also been used by the European Parliamentary Comparable and Parallel Corpora project (ECPC) coordinated by Dr. Calzada Pérez (Universitat Jaume I, Spain), and by the Translational English Corpus, which has been collected and maintained under Prof Mona Baker’s supervision at the University of Manchester,University of Edinburgh and made available on the Internet through the Genealogies of Knowledge project website, in a collaboration between The University of Edinburgh and The University of Manchester.


Also available is the documentation of the modnlp suite (for developers).


Current developers
  • Saturnino Luz
  • Shane Sheehan
Past Contributors
  • Michael Davy (contributed to the TC module)
  • Daniel Kelleher (contributed to the IDX module)
  • Noel Skehan (contributed to an earlier version of the teccli/tecser modules)