Manual for using the Genealogies Corpus Analysis Software

This page aims to provide a user-friendly guide to the corpus analysis software that is currently being developed for the Genealogies of Knowledge project.

Information about the content of our corpora can be accessed here:


Table of contents

 


Concordance Browser

 

Once you have downloaded and launched the software, a screen similar to the one shown below will be presented:

Click on ‘File’ to choose the language corpus you wish to work with. ‘English’ is the default corpus unless you choose another corpus from the drop-down menu. For more information about the content and design of each of the corpora, please click here.

It is also possible to access other remote corpora via the Genealogies interface. For example, in order to access the Translational English Corpus (TEC), go to ‘File’->’New remote corpus…’ and enter genealogiesofknowledge.net:1240 as the IP address of the new corpus server.

The ‘Options’ menu allows you to make a number of choices relating to how you access the corpora. ‘Plugins’, on the other hand, are additional tools (mostly tools of analysis) which facilitate working with the corpus and provide further information on the texts, such as frequency lists and details about the content of each corpus. All of these features are explained below.

A keyword search will populate the concordance browser window, displaying the Filename, Left Context, Keyword and Right Context for each line of text in the corpus that matches this search string.

The width of each column can then be adjusted manually by clicking and dragging the divider bars separating each column header. This is often useful when working with unusually long or short keywords.

The total number of ‘hits’ for a search query is displayed at the foot of the corpus browser window. In the screenshot below, for example, the software tells us that there are 6110 instances of the keyword democracy in the corpus selected.

Individual concordances can be saved to your computer as a CSV file via the ‘File’ menu (‘File->Save concordances…’). This can be imported into a spreadsheet and manipulated using software such as Microsoft Excel.


 

Note: The following description of search options is based on using the Genealogies English corpus. Because Latin shares the same script, the description will also be relevant to the use of the Latin corpus. In order to view more specific information on using the Greek and Arabic corpora, please click on the links below:

 

Single Keywords

Single keywords may be typed into the search bar to retrieve a concordance of all lines in the corpus containing the keyword.

The search function is by default not case-sensitive so a search for democracy will retrieve both democracy and Democracy. In order to make a search case-sensitive, click ‘Options->Case sensitive’.

Wildcards

The * symbol may be used to represent any string of characters of any length. For example, searching test* will retrieve a concordance containing all words which start with test (e.g. test, tests, testament, testability, etc):

Sequences

You can also specify sequences of keywords and/or wildcards, and the maximum number of intervening words you wish to allow between each element in the sequence.

For example, entering seen+before will return every instance in which these two words appear next to one another (and in this order):

Entering seen+[1]before finds, in addition, …seen her before…, …seen him before…, i.e., all sequences in which there is at most one word between ‘seen’ and ‘before’:

Entering seen+[2]before finds, in addition, …seen to fall before…, and all sequences in which there is at most two words between ‘seen’ and ‘before’:

Combinations of words and wildcards are also allowed in sequences, so entering know+before* will find …know before…, …know beforehand…, etc.

Using sequences to find exact combinations of keywords is possible and wildcards can be used to find patterns with an exact number of intervening words. However, these queries will take a long time to run.

An example query of good+man+*+bad finds the concordance shown below.

Regular expressions

Finally, searching using ‘regular expressions’ allows you to select any string that matches a specified element of regular language. Regular expressions need to be enclosed in double quotation symbols (e.g. “regex”). A selection of example ‘regex’ searches are shown below:

  1. “(man|men)” retrieves a concordance of lines containing EITHER man OR men (i.e. the vertical bar is used to separate alternatives; the set of alternatives must be placed within parentheses);
  2. democracy+”(is|as|means)” retrieves a concordance of sequences containing democracy AND EITHER is OR as OR means;
  3. “labou?r” retrieves a concordance of labour AND labor (i.e. the question mark is used to indicate that the ‘u’ in this search string is optional);
  4. “citizens?(hip)?” finds instances of citizen, optionally followed by an ‘s’ AND optionally followed by ‘hip’ (i.e. the parentheses group the characters in ‘hip’ together as a suffix that can be treated as optional. This regex can thus be used to generate a concordance of citizen AND citizens AND citizenship);
  5. “pl…” retrieves a concordance of all five-letter words beginning with ‘pl’ contained in the corpus (i.e. the full stop is used here to indicate any single character, not including white space).
  6. “dem.{2,7}” returns a concordance of all words starting with ‘dem’ followed by a minimum of two characters and a maximum of seven characters (i.e. the curly braces {} are used in combination with the full stop to indicate a minimum and maximum number of characters {min,max}).

A brief guide to regular expression syntax is also available here.


Sort

 

Searched concordances can be sorted at a position relative to the keyword. Select the position using the dropdown menus for either the right or left context, and click the corresponding sort button to reorganise the concordance list. The words at the sorted position will be highlighted in red.

By clicking ‘Sort by Filename’, the concordance lines can also be grouped according to the different texts from which they have been extracted.


Extract

 

Clicking on a concordance line and then pressing the ‘Extract’ button will bring up a window containing an expanded context for that concordance line, as shown below:


Metadata

 

Clicking on a concordance line and then pressing the ‘Metadata’ button will bring up a window containing metadata for the file which contains that concordance line, as shown below:


Delete Line

 

Selecting a concordance line and then clicking on the ‘Delete Line’ button will remove this line from the concordance. You can also select and delete a block of consecutive lines. This can be a useful feature if you wish to declutter the display in order to focus in on a particular collocation or set of collocations.

Running the same search again will bring any deleted lines back into the concordance displayed.


Subcorpus Selection

 

The subcorpus selection tool allows you to restrict the results of concordance queries and the contents of frequency tables to files matching certain criteria. These criteria can be, for example, the subcorpus (e.g. Internet or Modern subcorpus) to which they belong, their publication date, the author, the translator, etc..

In order to select a subcorpus, choose ‘Options->Select subcorpus…’

A window similar to the one shown below should appear:

The menu boxes allow you to select one or more parameters for texts to be included in or excluded from the desired subcorpus. In order to select more than one item (e.g. a range of dates) within a single menu box, hold the CTRL key on your keyboard (CMD on Apple computers) as you click on each.

Criteria in multiple menu boxes can also be connected so as to form the logical expressions which ultimately determine what gets included or excluded. You can, for example, choose a subcorpus of translations from French published after 1990 by selecting each of these criteria within the interface:

By default, selecting an item in the menu boxes will include files that meet this criterion in the subcorpus; in order to exclude certain texts from your subcorpus selection, select the relevant criterion and tick the ‘Exclude’ checkbox below this menu.

For example, if you want to run a search on only the translated texts contained in the corpus (i.e. no original writings), you should select ‘Translation’ within the ‘Translation status’ menu box. This will include only those texts that have been tagged as translations in their metadata within your subcorpus selection. If, on the other hand, you want to run a search on non-translated works only, you should instead select ‘Translation’ within the ‘Translation status’ menu box and tick the ‘Exclude’ checkbox below this menu.

Clicking ‘Apply’ and then ‘OK’ activates the subcorpus selection. In order to de-activate it (that is, allow a search on the full corpus or another subcorpus), choose ‘Options’ and de-select ‘Activate subcorpus’.

Sometimes the information in the metadata will not be sufficient to define the required subcorpus. In this case, building the subcorpus manually by selecting the required filenames individually and saving the output as a textual query for reuse at a later time is recommended.

For example, the articles currently contained in the Genealogies of Knowledge Internet Corpus are sourced from a number of different outlets. In order to explore the content published by one particular outlet only (e.g. Left Flank), you will need to define this subcorpus by selecting all the filenames associated with this outlet. This will generate a textual query similar to the following which you can copy and paste into a text file for future reference:

($s/../document/@filename=’int000265′ or $s/../document/@filename=’int000266′ or $s/../document/@filename=’int000267′ or $s/../document/@filename=’int000268′ or $s/../document/@filename=’int000269′ or $s/../document/@filename=’int000270′ or $s/../document/@filename=’int000271′ or $s/../document/@filename=’int000272′)

To reuse a textual query such as this, simply tick the ‘Use Textual Query?’ checkbox and paste into the Textual Query box at the bottom of the Subcorpus Selection window.

Some useful subcorpora selections which require textual queries are available here.


Options and Preferences

 

The preferences panel can be opened by selecting the menu option ‘Options->Preferences…’.

The following window should appear:

  • Concordance context changes the number of characters displayed to the left and right of the keyword.
  • File extract context changes the number of characters displayed in the ‘Extract’ window.
  • Font Size changes the font size in the browser, extract and metadata windows.

Plugins

 

Several plugins have been developed to enhance corpus analysis. These plugins are mature prototypes and as such are under continued development:

Frequency List

Select ‘Plugins->Word Frequency List’. The following window will appear:

Select the range of ranked items to display (the default is to display the 500 most common terms) and click on ‘Get List’ to retrieve their frequency table. This table can be saved to a CSV file which you can manipulate through spreadsheet software or other external tools.

This list can also be sorted alphabetically or numerically by frequency within the window by clicking on the headers ‘Type’ and ‘Frequency’ respectively.

Note: as shown above, set 0 as the print option to retrieve the entire frequency list.

Corpus Description Browser

By selecting ‘Plugins->Corpus Description Browser’, a window will appear which contains a list of each file in the corpus or subcorpus selection. Also displayed is the metadata associated with these files, the number of tokens they contain and their type-token ratios.

At the bottom of the window you will see the total number of tokens in the corpus (or subcorpus selection) and the overall type-token ratio.

Mosaic Visualization

Unlike the Frequency List and Corpus Description Browser, the Mosaic and the Concordance Tree plugins generate positional word statistics based on a concordance you have already generated.

Within the Mosaic plugin, four modes of operation are available:

[A] Frequency Mosaic

This view shows word frequency at positions to the left and right of a keyword. Each box (tile) represents a different word at each position. The height of each box is directly proportional to its frequency.

For the example, in the Mosaic shown below the keyword muscles was searched. Looking at the visualisation it should be clear that:

  • ‘of’ is the most frequent word one position to the right of muscles
  • ‘of’ is approximately four times more frequent than the second most frequent word (‘that’) in this same position relative to the keyword muscles.
  • The word ‘the’ occurs more frequently one position to the left of muscles than in any other word-position relative to this keyword.

[B] Frequency Mosaic (No Stopwords)

This mode simply removes words which occur with a frequency above a certain threshold. It thus aims to help the researcher by omitting very common ‘stopwords’ such as ‘the’, ‘of’, ‘and’, etc.

The figure below shows the second Mosaic mode for the keyword muscles:

[C] Collocation Strength (Global)

The Collocation Strength Mosaic scales the height of the word boxes using a collocation strength statistic. This statistic is calculated as the positional frequency of a word relative to the keyword divided by its absolute frequency in the selected corpus or subcorpus. This visually enhances words that occur more frequently with the keyword at a particular position than would be expected based on the word’s frequency in the (sub)corpus as a whole.

For example, in the screenshot below, ‘masticatory’ is calculated as having a high collocation strength (20.45) one position to the left of the keyword muscles. This is because, out of a total three times that ‘masticatory’ appears in the English corpus, it is found in two cases immediately to the left of muscles.

Intercostal’, on the other hand, appears much more frequently (26 times) than ‘masticatory’ one position to the left of the keyword muscles. However, because it is significantly more frequent in the corpus as a whole (a search for ‘intercostal’ on its own returns 50 hits), this collocation is less significant and its collocation strength is calculated as just 10.23.

For reference, it can be shown that the collocation strength measure used is equivalent to log(Pointwise mutual information). More information about this feature of the Mosaic plugin can also be found here.

[D] Collocation Strength (Local)

In this last view the collocation strength boxes are scaled up so that each column is full height. This distorts the values so that comparisons across word positions are now invalid. However, this improves the researcher’s ability to inspect the word positions individually and view collocation patterns which would have been too difficult to identify in the global view:

[E] Mosaic Interactions

The Mosaic and the concordance window are linked and interactions with the Mosaic are mirrored in the concordance window. Clicking on a tile in the Mosaic sorts the appropriate column in the concordance window and scrolls to the clicked word where it appears in that position.

All concordance lines containing the clicked word in the relevant position are highlighted in purple. In addition, the clicked word is also highlighted in the concordance where it occurs in positions other than the one selected in the Mosaic. This allows users to capture occurrences of a given collocation, such as ‘good’ + ‘citizen’, within an expanded collocation span, as can be seen in the following screenshot.

Hovering over tiles in the mosaic pops up a tool-tip which displays the word and its collocation frequency in this word-position. In addition the tool-tip also displays the calculated collocation strength statistic when using either of the Collocation Strength views.

Mousing over tiles which are too small to read expands them and a number of tiles around the target.

Right clicking on a context word will attempt to search for concordances of the keyword and the clicked word at the chosen position. Warning: These searches may take a long time to finish.

If you find a tile marked *null*, this simply means that no collocations can be found at this word-position. This is most commonly because the keyword occurs at the very beginning or end of a corpus text.

Concordance Tree

The Concordance Tree builds a tree of either the left or right context of the concordance. The tree is rooted at the keyword and expands outwards, allocating a separate column to each word position. Thus, unlike the Mosaic visualisation, the Concordance Tree maintains the sentence structure within each line, albeit only on one side of the search term at a time.

You can use the scroll function on your mouse to zoom in and out, and click and drag using the hand tool to move the tree around the screen. Clicking on a word expands/collapses the branch of the tree associated with that word.


Example Usage for the English Corpus

 

The Genealogies of Knowledge English corpus can be used to explore a wide range of research questions: for instance, users might choose to compare multiple retranslations of a single source text or alternatively to contrast the use of a particular concept at different moments in its historical evolution by focusing on much larger selections of texts.

Examples illustrating the ways in which members of the Genealogies of Knowledge team are using the resource and software are provided below and more will be added over time as our investigations progress.

 

  • Using the Mosaic plugin to investigate Benjamin Jowett’s (1881) translation of Thucydides’ History of the Peloponnesian War 

The above Mosaic visualisation has been produced by selecting Benjamin Jowett’s translation of Thucydides’ History of the Peloponnesian War (Filename: mod000019) in the subcorpus selector and searching for the keyword assembly in the corpus browser. The Mosaic tool was launched by clicking ‘Plugins->Mosaic’ and the ‘Collocation Strength (Local)’ view was then chosen.

Previous analyses of Jowett’s translation have suggested this translator made significant interventions in the text as part of his interpretation of the History, in some key instances dramatically downplaying the political agency of ordinary Athenian citizens in the representation of classical democracy that Thucydides provides. Corpus tools allow the researcher to either support, complicate or challenge such hypotheses by greatly facilitating the investigation of Jowett’s target text as a whole text and the identification of patterns within it.

This Mosaic visualisation tool provides a useful starting point for this corpus-based approach. Specifically, and with reference to the keyword assemblyit indicates we might productively begin our analysis by examining the extent to which this translator shows unusual preference for the verbs ‘summon’ and ‘summoned’ in connection with this noun (perhaps implying the existence of a higher political authority above and beyond popular democratic structures), instead of less marked choices such as ‘hold’ or ‘held’. This can then be followed by close reading of particular concordance lines extracted from Jowett’s version and qualitative comparison with other translations of the History (e.g. Richard Crawley’s – Filename: mod000020).

For more information about this study, see Jones, H. (forthcoming) ‘Jowett’s Thucydides: A corpus-based analysis of translation as political intervention’.

  • Using the Concordance Tree plugin to approach ROAR Magazine’s conceptualisation of ‘community’

The above Concordance Tree visualisation was produced by first using the text-query option in the concordance browser’s subcorpus selector to capture all the articles from ROAR Magazine that are included in the Genealogies of Knowledge Internet Corpus. Secondly, a search for the keyword community was conducted in the browser. Finally, the visualisation tool was launched by clicking ‘Plugins’, then ‘Concordance Tree’, then ‘Grow Tree ->’.

The screenshot above only shows the central part of the right-side variant of the tree, which in this case foregrounds several instances of postmodification (either through prepositional addition or through the introduction of a relative clause). Although restricted in scope, this limited view serves as a useful entry point for examining the patterning of ‘community’ in ROAR Magazine, a publication situated on the radical left on the political spectrum.

‘Community of’, for instance, can be seen to be followed not only by human agents (‘citizens’, ‘people’), but also legal attributes (‘rights’) and belief systems (‘nationalism’). The medium responsible for binding members of a certain community together might then be understood as the concrete physical distribution of human bodies, as a transcendental unifying principle, or both. In the case of ‘community of citizens’, for example, what holds this community together is simultaneously derived from a measure of geographical proximity and an appeal to shared rights and responsibilities.

Moreover, in the relative clauses, ‘community’ appears to be used in ways that implicate this keyword in a process of generating and consolidating boundaries. We find that the community ‘which includes’ is also the one ‘that surrounds’. Following the final branch of the tree, which starts with ‘that’, reveals a less conspicuous instance of delineation in the clause ‘community that was ready to defend itself’ (as seen below). Self-defence is generally defined as resistance to or protection from hostile external elements, and in this sense ‘community’ would seem to be characterized here as constitutive of the etymologically opposite mechanism of ‘immunity’.

In conclusion, these patterns suggest that ROAR Magazine’s contributors perform a tentative balancing act that situates belonging between tangible reality and principled thought. Furthermore, they hint at a conception of community determined by the reciprocity of acts of inclusion and exclusion. The side to which the balance tilts can be further examined by trailing the Concordance Tree’s other branches, or by turning to the relevant concordance lines in search of further specifications.


Using the Greek corpus

 

In order to search the Greek corpus, you must enter your query in the Greek script. The browser does not recognise transliterated Greek. Thus, searching for )/anqrwpos (as one might when using e.g. the Perseus Digital Library or the Thesaurus Linguae Graecae) will not return any results. You must instead enter ἄνθρωπος in order to obtain the desired results.

Also unlike search query tools implemented by the Perseus Digital Library and the Thesaurus Linguae Graecae, the Genealogies of Knowledge software is currently sensitive to diacritical marks. Queries are, therefore, literal, in the sense that all diacritical marks must be correctly entered by the user.

 

Wildcards vs. Regular Expressions

Owing to the fact that queries are literal, you may need to conduct some searches in the Greek corpus using regular expressions.

For example, in order to find all the declined forms of ἄνθρωπος in the corpus, you cannot simply add a wildcard * at the end of the query ἄνθρωπ* because in the genitive, dative singular as well as the accusative, genitive and dative plurals, the accent moves to the penultimate position (over the ώ):

Gen. sing. ἀνθρώπου
Dat. sing. ἀνθρώπῳ
Acc. pl. ἀνθρώπους
Gen. pl. ἀνθρώπων
Dat. pl. ἀνθρώποις

Thus, querying ἄνθρωπ* in the concordance will not provide the desired results since it will only return the nominative, accusative and vocative singular and the nominative plural.

 

The regular expression “ἀνθρώπ.{1,3}|ἄνθρωπ.{1,3}”, on the other hand, yields better (but not perfect) results. This returns all words that begin with ἀνθρώπ- followed by a minimum of one and a maximum of three other characters OR those that begin with ἄνθρωπ- (i.e. no accent) followed by a minimum of one and a maximum of three other characters:

Note: In addition to returning the desired results, this query also returned possibly undesired results  in the concordance, such as the adjectives ἀνθρώπεια and ἀνθρώπινα. The Delete Line button in the concordance browser is helpful for removing unwanted results from the concordance display.

 

Querying Verbs

Verbs are generally more challenging to search for than nouns. In addition to the problem that the internal vowels in contract verbs change, ancient Greek makes use of πε, ε, κ and σ augments to verb stems to form tenses. In addition, inflections are used to conjugate verbs.

It is therefore recommended to use the principal parts as a way of structuring the query, using, for example, the tense stems with wildcards. The following table may be helpful:

PRINCIPAL PART TENSE STEM VERB FORMS DERIVED FROM STEM
Form Name
I. παιδεύω παιδευ- present tense stem present indicative active, middle, passive

present subjunctive active, middle, passive

present optative active, middle, passive

imperfect indicative active, middle, passive

II. παιδεύσω παιδευσ- future active and middle tense stem future indicative active, middle
III. ἐπαίδευσα

 

ἐλιπον

παιδευσ-

 

λιπ-

first aorist active and middle tense stem

second aorist active and middle tense stem

aorist indicative active, middle

aorist subjunctive active, middle

aorist optative active, middle

aorist infinitive active, middle

IV. πεπαίδευκα πεπαιδευκ- perfect active tense stem perfect indicative active

perfect infinitive active

pluperfect indicative active

V. πεπαίδευμαι πεπαιδευ- perfect middle and passive tense stem perfect indicative middle, passive

perfect infinitive middle, passive

pluperfect indicative middle, passive

VI. ἐπαιδεύθην παιδευθ-

παιδευθησ-

aorist passive tense stem

future passive tense stem

aorist indicative passive

aorist subjunctive passive

aorist optative passive

aorist infinitive passive

future indicative passive

For example, the query παιδεύ* will return most of the conjugations the following verb forms derived from Principal Part I, their corresponding infinitives, and several of the present active participles, but none of the middle/passive participles.

In the following table, conjugations in bold will not be included in the concordance because of the shifts in accent. These would have to be searched for separately:

Present indicative active παιδεύω, παιδεύεις, παιδεύει

παιδεύομεν, παιδεύετε, παιδεύουσιν

Present indicative middle/passive παιδεύομαι,  παιδεύῃ/παιδεύει, παιδεύεται

παιδευόμεθα, παιδεύεσθε, παιδεύονται

Present subjunctive active παιδεύω, παιδεύῃς, παιδεύῃ

παιδεύωμεν, παιδεύητε, παιδεύωσιν

Present optative active παιδεύοιμι, παιδεύοις, παιδεύοι

παιδεύοιμεν, παιδεύοιτε, παιδεύοιεν

Present subjunctive middle/passive παιδεύωμαι, παιδεύῃ, παιδεύηται

παιδευώμεθα, παιδεύησθε, παιδεύωνται

Present optative passive/middle παιδευοίμην, παιδεύοιο, παιδεύοιτο

παιδευοίμεθα, παιδεύοισθε, παιδεύοιντο

Present infinitive active παιδεύειν
Present infinitive middle/passive παιδεύεσθαι
Present active participle παιδεύων, παιδεύουσα, παιδεῦον

παιδεύοντος, παιδευούσης, παιδεύοντος

παιδεύοντι, παιδευούσῃ, παιδεύοντι

παιδεύοντα, παιδεύουσαν, παιδεῦον

 

παιδεύοντες, παιδεύουσαι, παιδεύοντα

παιδευόντων, παιδευουσῶν, παιδευόντων

παιδεύουσι(ν), παιδευούσαις, παιδεύουσιν

παιδεύοντας, παιδευούσας, παιδεύοντα

Present middle/passive participle παιδευόμενος, παιδευομένη, παιδευόμενον

παιδευομένου, παιδευομένης, παιδευομένου

παιδευομένῳ, παιδευομένῃ, παιδευομένῳ

παιδευόμενον, παιδευομένη, παιδευόμενον

 

παιδευόμενοι, παιδευόμεναι, παιδευόμενα

παιδευομένων, παιδευομένων, παιδευομένων

παιδευομένοις, παιδευομέναις, παιδευομένοις

παιδευομένους, παιδευομένας, παιδευόμενα

It goes without saying that a query that finds the conjugations of irregular verbs such as εἰμί will require an extremely complex regular expression. Practically speaking, however, such a query is rare.


 

Using the Arabic corpus

 

The software used to search the Arabic corpus is currently under development. Please check back soon for updates on its progress.