Corpus Semiotics: Reassessing Context – Abstracts

Conference Room (C1. 18), Ellen Wilkinson Building, The University of Manchester

5 June 2017

In association with DH@Manchester


(listed in order of appearance)


  • Martin Thomas (University of Leeds, UK)
  • Apostolos Antonacopoulos (University of Salford, UK)
  • Christian Clausner (University of Salford, UK)

Since Elinor Ochs’ seminal paper (1979) a lot of serious thought has been given to the representation of spoken language data and how to accommodate non-verbal communication. More recently, attention has turned to the potential of incorporating video data to capture kinesic features (see, e.g., Knight et al 2009).

Meanwhile corpus-based approaches to handling written language have not only neglected non-verbal features: they have often actively sought to isolate linguistic elements from their natural graphic contexts (e.g., CleanEval workshops of the mid-late 2000s). This ‘cleansing’ removes context necessary for the comprehension of many genres of text, notably those which make extensive use of tables, graphs, images and other graphic features. Even for texts which are less visually informative (Bernhard 1985), such as most novels, the formatting of words is highly constitutive of genre and supports particular reading strategies (Waller 1987). Moreover, the writing systems of different languages afford diverse typographic realizations and layout choices, which add an under-researched aspect to comparative discourse and translation studies. Thus, for a wide range of research questions about patterns in meaning-making, corpora which account for the graphic features of texts and their documentness would seem invaluable.

Creating such corpora is complicated fundamentally by the fact that writing exists in two-dimensional space. This raises a series of questions, not least around segmentation and the linearity assumed in most corpus annotation. Mindful of these challenges, Bateman et al (2004) developed an annotation scheme, known as GeM, which provides a means of describing many features of the graphic realization of documents. Crucially, the scheme uses stand-off layers of XML which allow description of formal features, such as layout structure, to remain entirely independent of semantic features, such as rhetorical relationships. Thomas developed a concordancer for GeM (2007) and tools to support semi-automated annotation by exploiting OCR output (2009). Hiippala has since introduced additional techniques from computer vision (2016). Nevertheless, the need for manual intervention in the annotation process has so far inhibited the construction of large corpora. Moreover, the assumption of rectangular blocks causes significant problems when accounting for many layouts in the GeM scheme.

Alongside this work on multimodal document corpora, significant advances have been made in the computer image analysis community, with large-scale applications in the digital humanities. Specifically, the PAGE format (Pletschacher and Antonacopoulos 2010) combines accurate description of physical layout and content elements. Layout regions are described in terms of polygons with an arbitrary number of edges rather than rectangles. PAGE supports a wealth of metadata that makes it useful for real-world applications, including nuanced reading order specifications and multi-layered, flexible descriptions and typing of regions.

This presentation draws on the experience of GeM and PAGE. It proposes a new take on the representation of static documents in corpora, which explicitly prioritizes the graphic over the linguistic, not least because humans perceive visual displays holistically (Waller 2012). Thus, rather than starting with transcription of verbal elements and then encoding their graphic realization, the proposed approach describes graphic features, including verbal elements, in terms of variation in values in two-dimensional space. The basic hypothesis is that combining such features in (frequently occurring) configurations will provide a way in to the identification of design patterns, thus bringing the descriptive power offered by linguistic corpora to documents as they occur in the wild.


Bateman, J., J. Delin & R. Henschel (2004) ‘Multimodality and empiricism: Preparing for a corpus-based approach to the study of multimodal meaning-making’, in Ventola, E., Charles, C. and Kaltenbacher, M. (eds). Perspectives on multimodality, Amsterdam: John Benjamins: 65-87.

Bernhardt, S. A. (1985) ‘Text structure and graphic design: the visible design’, in J. Benson & W. Greaves  (eds) Systemic perspectives on discourse, Vol. 2, Norwood, NJ: Ablex, 18-38.

Hiippala, T. (2016) ‘Semi-automated annotation of page-based documents within the Genre and Multimodality framework’, in Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), Berlin: 84-89.

Knight, D., D. Evans, R. Carter & S. Adolphs (2009) ‘HeadTalk, HandTalk and the corpus: towards a framework for multi-modal, multi-media corpus development’, Corpora, 4(1): 1-32.

Ochs, E. (1979) ‘Transcription as Theory’, in Ochs & Scheifflen (eds) Developmental Pragmatics, New York: Academic Press: 43–72.

Pletschacher, S. & A. Antonacopoulos (2010) ‘The PAGE (Page Analysis and Ground-Truth Elements) format framework’, in Proceedings of the 20th International Conference on Pattern Recognition (ICPR2010), Istanbul: 257-260.

Thomas, M. (2007) ‘Querying multimodal annotation: A concordancer for GeM’, in Proceedings of the Linguistic Annotation Workshop at the 45th Annual Meeting of the Association for Computational Linguistics, Prague: 57-60.

Thomas, M. (2009) Localizing pack messages: A framework for corpus-based cross-cultural multimodal analysis. PhD thesis, University of Leeds.

Waller, R. (1987) The typographic contribution to language. PhD thesis, University of Reading.

Waller, R. (2012) ‘Graphic literacies for a digital age: The survival of layout’, The Information Society, 28(4): 236-252.

:: top ::


  • Sofia Malamatidou (University of Birmingham, UK)

In recent years, corpus methods have been expanded to encompass the interrogation of multi-modal (text-image-sound) material (Adolphs et al. 2011; Baldry and O’Halloran 2010). Although such attempts are significant in driving the field forward, they still primarily focus on the important role that text (either spoken or written) plays, and no attempt has been made so far to develop a corpus methodology (and respective corpus tools) that would exclusively focus on the interrogation of visual material. This presentation aims to address this gap by acknowledging the importance of images as meaning making mechanisms and proposing a new methodological framework for their investigation using corpora.

In the modern globalised world, a typical example of the powerful meaning making potential of images is magazine covers. Despite the apparent universality of images, the central image of magazine covers is often adapted when a new audience needs to be addressed, even though the idea depicted remains the same. Such is the case with the international editions of Scientific American. This is a clear indication that meaning relies heavily on visual material, even for topics that are expected to be relevant to a large number of people, irrespective of the culture to which they belong (e.g. space, biology, physics). Thus, it is important to investigate how meaning interacts with, is defined and limited by visual material in such cases, as well as identify larger patterns of meaning making across cultures.

To achieve this, a new methodological framework is required, which can be used in conjunction with corpus-based methods for the interrogation of large electronic collections of visual material. Relying on principles of visual content analysis (Bell 2001) and visual semiotics (Kress and van Leeuwen 2006), it is possible to annotate images to make them ‘searchable’. According to this methodological framework, each relevant aspect of an image is converted into a variable, which can take certain values. Images can then be tagged with these variables and values, which can ultimately be converted into search parameters for corpus tools. For instance, a possible variable can be the type of participant in an image, which can be assigned one of the following values: human, human-like, animal, or no participant. Depending on the nature and aim of each individual project, variable and values can be selected accordingly, while it is also possible to develop universal guidelines for the annotation of images similar to the electronic Text Encoding and Interchange (TEI) guidelines.

Data from the magazine covers of four international editions of Scientific American (French, Spanish, Russian and Brazilian) will be used to demonstrate how this new framework can be applied to the study of cross-cultural visual material. Results suggest that certain aspects of cross-cultural communication are particularly elusive and can only be captured if appropriate techniques for the corpus analysis of images are developed.


Adolphs, S., D. Knight and R. Carter (2011) ‘Capturing Context For Heterogeneous Corpus Analysis: Some First Steps’, International Journal of Corpus Linguistics 16(3): 305-324.

Baldry, A.P. & K.L. O’Halloran (2010) ‘Research into the Annotation of a Multimodal Corpus of University Websites: An illustration of multimodal corpus linguistics’, in T. Harris (ed.) Corpus Linguistics in Language Teaching, Bern: Peter Lang, 177-210.

Bell, P. (2001) ‘Content Analysis of Visual Images’, in T. Van Leeuwen and C. Jewitt (eds) Handbook of Visual Analysis, Thousand Oaks, CA: Sage, 10-34.

Kress, G. R. & T. van Leeuwen (2006) Reading Images: The grammar of visual design, Abington & New York: Routledge.

:: top ::


  • Svenja Adolphs (University of Nottingham, UK)

Heterogeneous corpora are emergent multi-modal datasets which comprise a variety of different records of everyday communication, from SMS/MMS messages to interactions in virtual environments, and from location based data to phone and video calls. By tracking a person’s specific (inter)actions over time and place, the analysis of such “ubiquitous” corpora enables more detailed investigations of the interface between different communicative modes.

In this talk I will outline some of the ways in which multi-modal, heterogeneous corpora can be used in corpus-based analyses, and how we can construct richer descriptions of language in relation to different aspects of context gathered from multiple sensors (e.g. position, movement and time).

:: top ::


  • Ben Clarke (University of Portsmouth, UK)

This presentation reports on methodological lessons learned as a result of undertaking a past project (Clarke, 2012; 2013) which empirically tested the predictive hypotheses of Hallidayan text-context relations (Halliday, 1977; 1985a; 1985b; Martin, 1992; Hasan, 1995). The chief finding of the aforementioned project was support for the Hallidayan hypothesis at a broad level of generality; in relation to the specific textual and contextual features under focus in the case study, it was found that there was (i) an increased occurrence of ellipsis the more ancillary a text’s context, and (ii) a greater proportion of instances of situationally-recoverable ellipsis to instances of textually-recoverable ellipsis the more ancillary the text’s context.

A corpus-based, big-data, quantitative methodology was adopted for this project. It proceeded on an assumption of co-variation between textual and contextual variables. That is, if the occurrence of ellipsis in datasets differentiated for their contextual mode-of-discourse was not found to be significantly different, statistically, a null hypothesis could be concluded. If, instead, the occurrence of ellipsis in different datasets was found differ to a statistically significant level, an alternative hypothesis could be supported. However, a potential confounding factor of the project design was the decision to assign whole texts to corpora and the contextual characteristics such corpora were intended to represent. Doing so calls into question the reliability of the methodology adopted and the validity of the consequent results (Clarke, 2013: 292-295; cf. Biber, 1993; Williams, 2002). In this presentation, a revised methodology, more sensitive particularly to dynamic shifts in contextual instantiations as the text proceeds, is proposed and illustrated.


Biber, D. (1993) ‘Representativeness in corpus design’, Literary and Linguistic Computing, 8(4): 243–57.

Clarke, B. P. (2012) Do Patterns of Ellipsis in Text Support Systemic Functional Linguistics’ ‘Context-metafunction Hook-up’ Hypothesis? A corpus-based approach,  PhD thesis, Cardiff: Cardiff University.

Clarke, B. P. (2013) ‘The Differential Patterned Occurrence of Ellipsis in Texts Varied for Contextual Mode: Some support for the ‘mode of discourse – textual metafunction’ hook-up’, in G. O’Grady, T. Bartlett & L. Fontaine (eds) Choice in Language: Applications in Text Analysis, Sheffield: Equinox, 269-297.

Halliday, M. A. K. (1977) ‘Text as Semantic Choice in Social Contexts’, in T. van Dijk & J. Petofi (eds) Grammars and Descriptions, Berlin: Walter de Gruyter, 176-225.

Halliday, M. A. K. (1985a) ‘Context of Situation’, in M. A. K. Halliday and R. Hasan Language, Context and Text: Aspects of language in a social semiotic perspective, Oxford: Oxford University Press, 3-14.

Halliday, M. A. K. (1985b) ‘Register Variation’, in M. A. K. Halliday & R. Hasan Language, Context and Text: Aspects of language in a social semiotic perspective, Oxford: Oxford University Press, 29-49.

Hasan, R. (1995) ‘The Conception of Context in text’, in P. H. Fries & M. Gregory (eds) Discourse in Society: Systemic functional perspectives, London: Edward Arnold, 183-283.

Martin, J. R. (1992) English Text: System and structure, Amsterdam: John Benjamins.

Williams, G. (2002) ‘In Search of Representativity in Specialised Corpora: Categorisation through collocation’, International Journal of Corpus Linguistics 7(1): 43–64.

:: top ::


  • Natalia Pavliuk (B. Grinchenko Kyiv University, Ukraine)

Connotative proper nouns that signal extralinguistic information in their inner structure change their semantics when used in different contexts, or when used to point to different referents. To study the semantics of such names, especially those that communicate some extralinguistic information about their referents – for instance, historical, mythological, Biblical and literary names – we need to process a considerable number of texts that feature their use in different contexts.

To analyse changes in the semantics of well-known literary names, I constructed a parallel corpus of English-Ukrainian texts featuring the names of the main characters in Shakespeare’s Hamlet and Tom Stoppard’s Rozenkranz and Guildenstern are Dead, as well as the English scripts and Ukrainian subtitles of the film versions of both plays: Hamlet (1990, directed by Franco Zeffirelli) and Rozenkranz and Guildenstern Are Dead (1990, directed by Tom Stoppard). The purpose was to study collocations of the proper names in question and trace the changes in their semantics, as they move chronologically from Shakespeare’s plays to Stoppard’s, from plays to movies, and from English to Ukrainian.

Analysis of this corpus demonstrates changes in the collocations of the proper names, resulting in shifts in their semantics, with the images of the characters acquiring new hues each time the referent, setting or audio-visual effects change in the context of films. A similar pattern is observed in the English scripts and their translations.

:: top ::


  • Marie-France Rooney (University of Ottawa, Canada)

Before the publication of La vie est d’hommage, the importance of French in Jack Kerouac’s creative process was mainly mythical. Little did we know that Jack Kerouac first wrote in French only to self-translate himself for publication purposes since Kerouac is among the “writers belonging to traditional linguistic minorities because of the multilingual make-up of the State of which they are citizens” (Grutman 2013:188). It appears that the famous author of On the Road wrote an oral French, which we might qualify as a “proto-joual”. Unlike Quebec’s authors (mainly playwrights) who assume that the readers already know the basic rules of the French-Canadian oral language, known as joual, Jack Kerouac wrote French as he spoke it, giving his writing a prescriptive dimension.

Did he follow scriptural rules he imposed upon himself? Did he use a vocabulary and idioms according to each theme related to his Franco-American reality (family, childhood and teenage years, religion) or without any thematic distinctions? To what extent does the French language influence the English language?

A terminological analysis of French Canadian sections from three chapters of the preliminary version of Maggie Cassidy will help us answer these questions. Through a concordancer, we will draw up a list of all the words, including articles, prepositions and verbs (which might be a tedious step, given their formal variability). Upon the completion of this inventory, we will be able to conduct quantitative and qualitative analysis and thus draw insightful conclusions.

By defining the parameters of Jack Kerouac’s French heritage, it will be possible to evaluate the validity of self-translation in the Franco-American author’s creative process as a means to preserve his culture and his language. Any conclusions – even partial – might contribute to a better framing of the analysis of self-translation as a survival tool for languages and cultures in a minority position.


Grutman, R. (2013) ‘Beckett and Beyond, Putting Self-Translation in Perspective’, Orbis Litterarum, 68(3): 177-289.

Kerouac, J. (2016) La vie est d’hommage, Montreal, Boréal.

:: top ::


  • Jitka Zehnalová (Palacký University, Czech Republic)
  • Helena Kubátová (Palacký University, Czech Republic)

The contribution discusses the methodology used for a research project called The Analysis of the Czech Field of Literary Translation and Translation Strategies after 2000, whose aim is to explore mutual interdependencies between the Czech field of literary translation and translation strategies preferred by literary translators from English into Czech and vice versa and translators from Hebrew into Czech and vice versa. The methodology has to come to terms with these questions: How can translational and sociological methods be effectively combined? Is it feasible/rewarding to make use of corpus methods? The first question is answered positively by including context in semiotic and in sociological terms, the second one by designing a methodology for compiling small-scale, purpose-built corpora of literary texts.

Semiotically speaking, the research investigates context in terms of the ever continuing process of semiosis: the text (ST and/or TT), conceived of as a sign, enters the context of the author’s/ translator’s work and style, the context of a particular literary movement/period/genre, a specific socio–cultural–temporal context, as well as the world system of translation (Heilbron 2010). These textual and contextual aspects are investigated both quantitatively and qualitatively, relying on the concept of the world system of translation and on Toury´s account of translation norms and strategies (Toury 1995). The methodology for compiling small-scale, purpose-built corpora of literary texts includes one part that has been tested already (Zehnalová 2016) and that makes use of the CAT tool MemsourceCloud. A newly added part includes into the methodology the use of the corpus manager Sketch Engine and the multi-level annotation tool GraphAnno.

Sociologically speaking, “a conceptualization of norms, of collective structures by definition, requires a conceptualization of the translator, of theagency ‘behind’ the norms, and indeed of the relationships between the two“ (Meylaerts 2008, 92; italics added). The translators, their habituses and the pressures of the field of literary translation are covered by the sociological part of the proposed research by employing Bourdieusian concepts.

The goal of the contribution is to focus on the semiotic aspects of context and to discuss (1) the suggested methods of quantitative and qualitative textual analyses and (2) the tools to be used for conducting them, i.e. MemsourceCloud,Sketch Engine and GraphAnno and the way these tools can be made to work jointly in a methodology which is effective for literary ST/TT analyses.


Heilbron, J. (2010) ‘Structure and Dynamics of the World System of Translation’. Available online at:

Jettmarová, Z. (2016) Mozaiky překladu. Translation Mosaics. K devadesátému výročí narození Jiřího Levého, Praha: Karolinum.

Meylaerts, R. (2008) ‘Translators and their Norms: Towards a sociological construction of the individual’, in Pym, Anthony, Miriam Shlesinger & Daniel Simeoni (eds) Beyond Descriptive Translation Studies, Amsterdam/Philadelphia: John Benjamins, 91-102.

Toury, G. (1995) Descriptive Translation Studies and Beyond, Amsterdam /Philadelphia: John Benjamins.

Vorderobermeier, G. M. (ed.) (2014) Remapping habitus in translation studies, Amsterdam/New York: Rodopi/Brill.

Yannakopoulou, V. (2014) ‘The Influence of the Habitus on Translatorial Style: Some Methodological Considerations Based on the Case of Yorgos Himonas’ Rendering of Hamlet into Greek’, in Vorderobermeier, Gisella M. (ed.) Remapping habitus in translation studies, Amsterdam/New York: Rodopi/Brill, 163-84.

Zehnalová, J. (2015) ‘The Czech Structuralist Tradition and a Model of Translation-Related Semiotic Analysis’, Folia Translatologica, 3: 149-72.

Zehnalová, J. (2016) ‘Literary Style and Its Transfer in Translation: Bohumil Hrabal in English’, in The Art of Translation: Jiří Levý (1926–1967) y la otra historia de la Traductología, Mutatis Mutandis, 9(2): 418-444.

:: top ::