Using Corpora to Trace the Cross-Cultural Mediation of Concepts through Time: An interview with the coordinators of the Genealogies of Knowledge Research Network

Mona Baker, Shanghai International Studies University, China & University of Oslo, Norway
Jan Buts, Trinity College Dublin, Ireland
Henry Jones, Aston University, UK

Interview and translation by Zhao Wenjing, Zhengzhou Shengda University and Henan Normal University, and Yang Guosheng, Henan Polytechnic


(1) Can you say something about the remit and scope of the Genealogies of Knowledge project? How does it differ from other projects that involve building corpora in several languages and developing accompanying software?

Many readers of this journal will readily associate corpus-based translation studies (CTS), a very popular strand of research in China, with the monolingual Translational English Corpus project which was launched at the University of Manchester in the late 1990s, and with a series of early publications that set the agenda for corpus-based studies in the field. The characteristic focus on identifying distinctive features of translation in this early work, later extended to identifying features of the style of individual translators, continues to inform research in CTS (Wang and Li 2012, Dirdal 2014, Saldanha 2014, Redelinghuys and Kruger 2015, Kruger and De Sutter 2018). Some studies have also drawn on bilingual or multilingual corpora either to pursue a similar agenda (Calzada Pérez 2017, De Sutter and Lefer 2020) or to identify strategies of and shifts in translation (Moropa 2011, Doms 2015) or interpreting (Hu and Tao 2013, Bendazzoli et al. 2018). More recently, some scholars in China have begun to develop new approaches, partly inspired by the Genealogies of Knowledge project and in tune with its remit. In particular, Zhu and Kim (2019) have examined the historical evolution of the concept of individualism since its introduction to Chinese society in the early 1900s until today, and the role that translation played in redefining and adapting it to the new environment of reception.

The Genealogies of Knowledge project, funded by the Arts and Humanities Research Council in the UK (2016-2020), is interested not in linguistic features of translation per se, nor in strategies of translation as such, but in the broader and more complex role that translation and other forms of mediation play in guiding our understanding of key aspects of social and political life. This includes but is not restricted to examining the active intervention of individual translators, with their own world views, and how they might influence the course through which concepts such as civil society or fact evolve over long stretches of time (Jones 2020b). Rather than focusing on a series of unconnected individual concepts, the project examines two constellations of concepts that have been central to Anglophone and European societies since the medieval period and are usually traced back to the ancient Greeks. One constellation relates to the body politic, understood as a group of people sharing a system of governance – you can think, for instance, of what we call a nation today. This constellation might include not only individuals or groups within the body politic (statesman, citizen, commoner, taxpayer) but also those specifically excluded from the polity (migrant, foreigner, non-citizen, slave, asylum seeker). It also includes concepts that are central to the organization of political life at different points in time and in different societies (meritocracy, democracy, law, reform, socialism). The second constellation relates to scientific discourse and includes concepts such as expertise, evidence, proof, causality and validity. These are terms that underpin epistemological assumptions throughout society, but they may mean different things to different people, depending on factors such as one’s profession or political affiliation. The two constellations are therefore set to overlap, as they both represent negotiations of truth within a broad discursive environment.

Concepts are abstractions, expressed by different lexical items in different languages, and although correspondences can be established in translation, items presented as equivalent will each have their own history of use and interpretation within complex cultural spaces. At heart, it could be argued, the essence of human experience is shared across languages and cultures, but even though the basic divisions and categories through which we interpret the world emerge from a common core, different environments require the elaboration of different concepts. The ancient Greeks, for instance, had a category of people called metoikoi or metics, which is peculiar to that society at that point in time and has no equivalent in English or other languages. Metics were a special category of migrants in today’s terms: they were foreigners who lived in Athens and enjoyed certain privileges of citizenship. The lexical item metoikoi is therefore part of the constellation relating to the body politic in ancient Greek, but not necessarily anywhere else. This example brings to the foreground the diachronic dimension of the project. Over time, vocabularies change, conceptual clusters evolve in different directions, and languages come to reflect the unique circumstances of their users. Part of the Genealogies research agenda involves capturing some of these processes as they take shape in different temporal and cultural moments.

The project has two main strands, one focusing on evolution and the second on contestation. The first involves tracing shifts in the meaning and use of a given lexical item like democracy or nation in English – corresponding roughly to dēmokratía or pólis in Greek, or dīmuqrātiyya or umma in Arabic, for instance – over time and across different geographical spaces. The second involves examining how these concepts or specific interpretations of them are contested by various individuals or groups, especially in digital space. These two strands are not completely separate in practice, because we do not understand evolution as unfolding in a linear fashion but rather dynamically, with different and conflicting interpretations co-existing and competing for acceptance at any point in time. In this respect we are partly inspired by Foucault’s genealogical practice, which aims to identify the processes through which a certain behaviour or set of values come to be presented as self-evident. Of course, scholars such as Foucault are now themselves part of history and can be subjected to the same critical treatment they adopted in their own work, this time with the help of large electronic databases that reveal discursive patterns within and across authorial and translational output. The Genealogies corpus of Modern English features several translations of important works by Foucault: Madness and Civilization, The Birth of the Clinic and The Order of Things.

Unlike the earlier agenda of CTS, which is associated with the Translational English Corpus, the corpora built by the Genealogies of Knowledge team and the methodology elaborated in the various studies undertaken so far are intended to engage scholars in the humanities at large rather than translation studies scholars alone. This reflects a broader commitment on our part to work across disciplines and to raise awareness of the importance of translation and other forms of mediation in intellectual and social life. The Genealogies of Knowledge project is also far more ambitious and more complex in terms of the resources it has created and the methodology it aims to elaborate. Specifically, it is not restricted to a single language like English, or two languages as in many corpus-based studies, nor to a limited time span such as the 19th or 20th century. Instead, the project attempts to capture important aspects of the interaction among a set of languages with a shared history, and across several centuries.

(2) How many corpora is the Genealogies of Knowledge team building to pursue this agenda, and what do they consist of?

As we have already indicated, one of the key aims of the Genealogies of Knowledge project is to trace the cross-cultural path of mediation that two clusters of related concepts have followed throughout the ages, as they are passed on from language to language, each time undergoing particular transformations according to the situated needs and aspirations of different social, cultural and political communities. To this end, we have built a series of non-parallel but carefully interlinked corpora as part of an attempt to capture the successive stages through which the constellations of concepts we are interested in developed in ancient Greek, medieval Arabic, Latin, and modern English:

  • The Ancient Greek corpus (currently totalling 3.3 million tokens) features a selection of scientific, philosophical and political treatises and commentaries, written between the 5th century BC and the 2nd/3rd century AD by key authors in the history of ideas such as Plato, Aristotle, Galen, Hippocrates and Isocrates;
  • The MedievalArabic corpus (3.3 million tokens) consists of translations of and commentaries on ancient Greek texts produced between the 8th and 13th centuries AD by prolific translators such as Hunayn Ibn Ishaq, in addition to a number of original texts by important authors such as Al-Farabi, Averroes and Avicenna;
  • The Latin corpus (1.5 million tokens) includes the works of classical Roman thinkers such as Cicero (106-43 BC), alongside a number of Latin (re)translations and commentaries produced by figures such as Thomas Aquinas and Robert Grosseteste, who played a central role in the development and dissemination of Greek philosophy in Europe from the 13th century onwards;
  • Totalling 21 million tokens at the time of writing, the Modern English corpus is currently the largest of the Genealogies of Knowledge corpora and is mainly made up of translations and retranslations of political and scientific texts published in the UK and US during the 19th, 20th and 21st centuries. This corpus not only features multiple retranslations of works by classical Greek and Roman authors such as Plato, Thucydides and Plutarch, but also translations of works by more modern writers such as Ludwig Wittgenstein, René Descartes, Karl Popper, Karl Marx, Michel Foucault, Étienne Balibar and G.W.F. Hegel.
  • The Internet English corpus (4.2 million tokens at the time of writing) includes a diversity of blog posts, articles and opinion pieces published online by alternative media and news outlets across the political spectrum, from Indymedia and Discover Society on the political left, to Newsmax andorg on the right.

The decision to begin by building a corpus of ancient Greek texts and thus to trace the concepts we study back to the works of Aristotle, Plato, Hippocrates and their contemporaries is an orthodox one. Idealised visions of classical Athens have occupied a central position in what is appealed to as Western identity since at least the Renaissance. Modern politicians and social commentators make frequent reference to ancient Greek authors such as Aristotle, Plato and Thucydides as continuing sources of cultural authority, while to this day many medical doctors around the world swear an adapted version of Hippocrates’ Oath on graduating from university (Hardwick 2003, King 2019). Similarly, the decision to construct corpora of texts written in medieval Arabic and Latin reflects the influential role these languages have come to serve as lingua francas for the production and circulation of knowledge at various times over the past two millennia. Previous research has amply highlighted the powerful contribution of Abbasid-era translators such as Hunayn ibn Ishaq to the history of science and to the refinement and development through translation and commentary of ideas first conceived in the ancient Greek world (Saliba 2007); later, in medieval Europe, it was largely through Latin translations of these Arabic translations and commentaries that scholars came to take renewed interest in key concepts drawn from classical philosophy (Burnett 2001). Finally, the construction of our Modern English corpus acknowledges the current status of English as the dominant language for international politics and science, and thus the influence this language now exercises as the temporary guardian of the scientific and political categories that inform the worldviews of many hundreds of millions of people around the globe.

Despite our selective focus on these four historical lingua francas and the diachronic design of the corpora, we are very much aware that these languages have not been the only currencies used in the marketplace of ideas, nor is global English the only true heir of our cultural inheritance from past civilisations. Modern Mandarin Chinese, 18th-century French and today’s global Spanish constitute equally interesting sites of exchange and transformation. Indeed, studying patterns of interaction among these languages would likely suggest quite different histories for the concepts we aim to study. We are keen to encourage other researchers to explore the evolution of these or alternative sets of concepts through other languages and cultural traditions at different moments in time. To this end, we have made a point of documenting the diverse methodologies we apply to investigating the Genealogies corpora, and are disseminating them internationally by means of workshops, and through publications such as this one. Research transparency ensures that those working in CTS and interested in conceptual studies do not have to reinvent the wheel but can emulate, adapt and refine the approach we seek to illustrate.

The reader might have wondered by now how corpora consisting of several different languages, and more importantly, several different scripts, can be accessed. Although many of the texts held in the corpora are translations of each other, the different languages cannot be queried simultaneously, nor can source and target texts be consulted side by side in an alignment interface. The corpora are accessed separately. While the lack of alignment precludes certain kinds of analysis, the benefits of not building parallel aligned corpora are numerous. As demonstrated in the various studies based on the Translational English Corpus or a similar resource, this methodology allows the researcher to confront translated text on its own terms, rather than being tempted to constantly compare it with a source in search of supposed inaccuracies. Furthermore, it facilitates placing translation within a richer historical context, among other works of transmission and response such as commentaries and critical editions. This explains why we set out to build separate corpora that illustrate how translational discourse governs the reception of inherited conceptual regimes. The Genealogies of Knowledge project is heavily informed by the now well-established belief that translation is not just a derivative procedure, but a creative process in its own right.

The final important feature of the corpora to highlight at this stage is the fact that the English corpus is divided into a ‘Modern’ and an ‘Internet’ component. These can be queried together but consist of very different material. The Modern English corpus is strongly connected to the Greek, Arabic, and Latin corpora, as it incorporates numerous retranslations from ancient works, alongside more recent works in the philosophy of science and politics. All these corpora may be considered canonical, as they contain material that has been passed on with care, from authors whose names carry weight: Galen, Averroes, Al-Farabi, Cicero, Aquinas, Marx, and Wittgenstein, for instance, are familiar across many disciplines and cultures. The Internet English corpus is quite different in make-up, as it covers primarily digital-born, alternative media and blog posts, written by little known and sometimes even anonymous authors. These texts are usually shorter, more ephemeral, and more topical than the printed works included in the other corpora. Importantly, they also tend to be much more contentious at this point in history. This is partly because the Internet English corpus consists of very recent texts, mostly from 2010 onwards. Arguably, what has reached us from earlier civilisations may have been equally controversial at its time, though it may not seem so to us now.

Nevertheless, the internet as a medium has profoundly impacted the shape of our knowledge society. Throughout human history, information has been relatively scarce. Today, there is too much of it. People must constantly negotiate not only their beliefs, but also the terms they use to express them, as they find themselves confronted with an overload of often conflicting information. In that sense, there is a case to be made for regarding today’s internet discourse as unprecedented in terms of the degree of conceptual contestation that characterises it. We have, moreover, sought to mainly cover material in this corpus that does not fit the label of mainstream media, and to investigate voices that operate on the fringes of debates about the nature and role of political and scientific knowledge in today’s society. Finally, it should be noted that the Internet corpus, unlike its Modern counterpart, mainly consists of works originally written in English rather than translations.

(3) Could you elaborate some more on that? How exactly does translation feature in the project?

Since its inception, the Genealogies of Knowledge project has sought to promote a broad conception of translation which can account for a wide spectrum of mediation practices operating across languages, cultures and historical periods. In practice, this means that the corpora built by the project contain not only texts which have been overtly labelled as translations but also numerous commentaries, critical editions and original writings whose production is here understood to have involved a comparable process of interpreting, reformulating and adapting other texts, concepts or systems of thought for new audiences, often – but not exclusively – across some form of linguistic or cultural barrier. Attempts to impose and maintain strict distinctions between translated and non-translated texts when building the corpora would have made little sense given our interest in the evolution and contestation of key cultural concepts over time and space. Our primary interest is not in the features of translated text as opposed to non-translated text, but in the variety of processes through which political and scientific ideas are transformed as they are imported into new cultural contexts. Thus, the design of these corpora has been based on the conviction that the still predominant focus in translation studies on the integral transmission of texts tends to obscure from view countless translational practices involved in the circulation and communication of knowledge within and between different communities – practices which are not necessarily based on clear correspondences between specific texts, but on more complex, dispersed encounters between languages, cultures and epistemologies.

An example may help illuminate the value and implications of this perspective. One of the proper nouns that occur most frequently throughout the Internet English corpus is the name (Karl) Marx. A keyword search for marx currently retrieves 466 concordance lines, 306 of which derive from texts originally published in Viewpoint Magazine, a finding which is hardly surprising given this outlet’s stated aim “to understand the struggles that define our conjuncture, critically reconstruct radical history, and reinvent Marxism for our time” (Viewpoint Magazine n.d.; emphasis added). Scanning through the concordance quickly reveals another intriguing pattern: the item is repeatedly embedded within larger structures such as “according to Marx…” (5 lines), “as Marx wrote/said/pointed out/stressed…” (15 lines), “for Marx…” (7 lines) and “…what Marx called/meant by/described as…” (8 lines). Such clauses are in many cases followed by a term, phrase or occasionally a full sentence enclosed within quotation marks; indeed, the concordance is littered with inverted commas. By citing Marx verbatim, the authors of the texts reveal that they attach particular significance and authority to the choice of words and phrasing produced by this 19th-century philosopher-historian; it is not merely Marx’s ideas that they wish to communicate but also the precise terminology and forms of expression he deployed. Yet, in not one of these lines is Marx cited in his original German. Rather, the authors either cite the published English translations of canonical texts such as Das Kapital, Manifest der Kommunistischen Partei and Der 18te Brumaire des Louis Napoleon in order to scrutinise the translators’ linguistic choices, or provide their own ad hoc English retranslations.

These lines thus provide clear evidence of two important patterns that have not received sufficient attention in the literature. The first concerns the extent to which texts which cannot easily be categorised as translations may still rely extensively on the mediation of translators in order to construct and deliver their argument, even if the fact of translation is only rarely explicitly acknowledged. The second concerns the paradoxical power of translation to shape the reception of key political concepts in the receiving culture and to influence political debate (perhaps even “reinvent Marxism for our time”), not primarily through the transmission of whole texts but instead through their subsequent re-dissemination in fragmentary form among new audiences, most of whom will never read Das Kapital in full. Viewed from this perspective, it also becomes clear that the authors of online magazine articles and other secondary texts such as these play a similarly decisive mediating role in this process of translation: the Marxist terminology they seek to highlight, the passages of the Communist Manifesto they choose to reproduce, and the ways in which they frame and adapt Marx’s arguments in order to make them relevant for modern readers constitute a crucial series of decisions that influence readers’ understandings of Marxist philosophy. These kinds of interventions are all too often overlooked in translation studies; corpus-based approaches of the type being developed by the Genealogies of Knowledge team aim to foreground them and provide a methodology for their analysis (Buts 2020).

A similarly inclusive understanding of translation has informed our decision to consider texts such as modern critical editions of medieval manuscripts for inclusion in the Greek, Latin and Arabic corpora. While such printed editions ostensibly present themselves as objective reconstructions of historically significant works, physical copies of which may only otherwise exist as fragile and often incomplete manuscripts, closer investigation reveals their hybridity as texts shaped just as much by the ideals and ideologies of their modern editors as by the material evidence of the manuscripts on which they are based. Our colleague and Genealogies of Knowledge team member Kamran Karimullah has effectively demonstrated the importance of the role played by editors in the language-mediated evolution of political concepts in a recently published case study (Karimullah 2020a). This compares two separate critical editions – both of which are included in the Genealogies of Knowledge corpus – of a medieval Arabic translation of Aristotle’s Nichomachean Ethics,first created in Baghdad in the 9th century. The first modern critical edition of this text was produced in 1979 by famous Egyptian philosopher Abd al-Rahman Badawi (1917–2002); later, an alternative edition was published in 2005 by two German classical scholars, Anna Akasoy and Alexander Fidora. Through both quantitative and qualitative analyses contrasting these two versions, Karimullah highlights a diverse range of paratextual and textual interventions made by Badawi which can clearly be seen to derive from his own philosophical programme for further developing Arab existentialism in the post-war Middle East. Unlike Akasoy and Fidora, Badawi did not see the Arabic Nichomachean Ethicsas a textual specimen of mainly philological interest, providing evidence of the historical evolution of classical Arabic in the 9th century; rather, through a series of glosses, revisions and omissions, he sought to ensure that his edition faithfully communicated what he interpreted as Aristotle’s philosophical insights and that his text could serve as a vehicle for educating modern Arabic-speaking audiences about Aristotle’s philosophy. Thus, much as translation scholars have repeatedly argued with regard to interlingual forms of translation, the process of creating a critical edition is shown to be fundamentally shaped by the needs and interests of the receiving context.

(4) To what extent can the Genealogies of Knowledge resources be used to pursue the type of research supported by the Translational English Corpus and similar corpora?

As mentioned at the beginning of this interview, the Genealogies of Knowledge corpora have not been developed to pursue the same research for which the Translational English Corpus and other similar corpora were built. The corpora have not, for example, been designed to investigate whether linguistic features such as explicitation are more or less common in translated language as opposed to non-translated language, nor have they been created primarily to investigate phenomena such as translator style. As a result, they are not likely to meet the criteria of size, balance or representativeness required to serve as suitable datasets for pursuing such research agendas. That is not to say, however, that these resources may not be used to address key questions central to the interests of translation scholars today. Most notably, we would suggest that the Modern English corpus constitutes a substantial new resource for the study of retranslation, a phenomenon which has proved a recurring focus of attention among translation scholars for over three decades (Tahir Gürçağlar 2020). Retranslations of texts originally produced in Greek and Latin account for just over half of the 350 texts currently included in this corpus, and two thirds of the total word-count at the time of writing. In some cases, such as Thucydides’ History of the Peloponnesian War, ten different English versions of a single source text are available for analysis via the Genealogies of Knowledge corpus browser interface, each produced at different points in time over the past 400 years.

Henry Jones’ (2019a, 2019b, 2020a, 2020b) research has made particular use of this feature of the Modern English corpus. Building on the work of Susam-Sarajeva (2003), Brownlie (2006) and Deane-Cox (2014), Jones has developed a corpus-based methodology with which to further emphasise the complexity of retranslation as a cultural phenomenon and the powerful insights into the agency of the translator that can be derived from examining multiple (re)interpretations of one source text. By comparing three near-contemporary English retranslations of Thucydides’ History, for example, his analysis demonstrates the importance of recognizing the ideological heterogeneity of the societies in which translations are produced (2020a:6). Despite the fact that these three English versions were all produced in the same city (Oxford, UK) in the same decade (the 1870s), Benjamin Jowett’s, Henry Musgrave Wilkins’ and Richard Crawley’s interpretations of their ancient Greek source differ quite substantially, especially in terms of their presentation of the role of leaders within classical Athenian democracy. While Jowett’s retranslation seeks to draw the readers’ attention to the importance of effective leaders within democratic structures of governance, Crawley’s and Wilkins’ versions appear much more ambiguous in their renderings of key passages describing the political decision-making processes at the heart of the Athenian state. Thus, Jones’ investigations make clear the need to consider Jowett’s renarration of Thucydides’ History not simply as a product of the general ideological climate of late Victorian Britain as a whole, but also as part of a series of interventions by a highly politically engaged individual with a specific set of political objectives in mind. Other case studies by Jones explore the recharacterisation of ‘the common people’ in two further English retranslations of Thucydides (2019b), the role of Thucydides’ retranslators in constructing him as the founder of scientific historiography in the 19th century (2020b), and the rise and fall of the discourse of statesmanship across a large collection of English retranslations of classical Greek authors such as Aristotle, Plato and Xenophon (2019a).

Apart from the study of retranslation, it is also possible to use the Modern English and Medieval Arabic corpora to examine the output of an individual translator, whether in terms of their distinctive style or their ideological preferences. These two corpora feature many works by the same translator. The Arabic corpus, for instance, includes numerous translations by Hunayn Ibn Ishaq of authors such as Hippocrates, Galen and Aristotle. However, many of the translators featured in the Modern English corpus tend to specialise in a single author. Thus, for instance, all seven translations by Harold Fowler in the corpus are of works by Plato, and all eleven translations by Bernadotte Perrin are of works by Plutarch. This may make it difficult to study aspects of these translators’ style, but it supports the much richer agenda of examining the influence exercised by an individual translator on shaping the history of reception of key authors.

(5) Can you give us examples of some research that can be conducted across two or more of these corpora, other than research on retranslations into modern English?

An example that immediately comes to mind is the study of paratext, which constitutes an expanding field in translation studies today (Batchelor 2018). Paratext consists of explicit signs of mediation, such as footnotes and introductions, which may or may not have been produced by the author of a text. Translating often involves a paratextual intervention, for instance the insertion of a translator’s foreword. Painstaking forms of mediation bring paratextual features to the foreground, as in the case of critical editions. A common mode of analysis in this field is to focus on a single or small number of texts, sketch the source and target contexts, and demonstrate how features ranging from book covers to acknowledgments aim to make a text fit for reception in its new intended environment. The Genealogies of Knowledge corpora make it possible to broaden the scope of this kind of research by addressing additional questions. For instance, what concepts and textual patterns recur in paratextual material produced with reference to vastly different circumstances, and subject to different regimes of writing?

In Baker’s research on the concept of evidence, conducted for a presentation at SISU as part of the launch in November 2019 of China’s first dedicated corpus research institute, it became apparent that internal evidence and external evidencewere particularly strong collocates in the paratexts of the Modern English corpus, with the pattern almost fully restricted to introductions to ancient texts. In the field of textual criticism, the distinction between internal and external evidence refers to indicators, either textual or material, that help to establish which reading of a manuscript is most reliable. For centuries, scholars interested in the textual history of a given document have debated such issues, which formed an intrinsic part of their engagement with the classics. Corpora can reveal what kind of material is most often characterised as evidence in philological criticism, and can also establish when this type of discussion ceased to be relevant due to more stable lines of textual transmission. Researchers who read Greek or Latin can use the corpora to trace the material cited as internal evidence, and to sketch differences and similarities between the treatment of sources in different languages and traditions of mediation. In addition, a corpus is a treasure trove of serendipitous encounters, and given that the Genealogies of Knowledge corpora incorporate many different disciplinary and historical perspectives, it is often surprising to see a shared conceptual sphere emerging across very different environments. The distinction between external and internal evidence, for instance, also functions in the modern scientific paradigm of evidence-based medicine, but the specific meaning attached to each element of the distinction is very different in this sphere. In evidence-based medicine, internal evidence consists of knowledge that is acquired through education and experience, while external evidence is derived from scientific research, including randomised controlled trials. Despite these differences in conceptualisation, the basic template of enquiry is the same, and we would argue that acknowledging commonalities between distinct discursive spheres can shed light on the process through which knowledge is constructed within and across different areas of social practice. The last few decades have witnessed increased interest in inter-, trans- and cross-disciplinarity in the humanities and beyond. The real challenge, in this regard, is not to temporarily overcome the boundaries that separate, but to show the contingency of the separation itself.

That being said, scholars have specialisms, and most studies based on the Genealogies of Knowledge corpora will be restricted to the use of one or two corpora. The software environment through which such studies are conducted, however, is the same, and they will  all be informed by a very similar methodology. The Genealogies team has thus published a series of articles to illustrate this methodology as a special collection in the humanities journal Palgrave Communications (Baker and Jones 2020). Some of the case studies make use of multiple corpora, others focus on a single one. It is also possible to combine the Genealogies of Knowledge corpora with external datasets, in order to obtain a variety of comparative results (Karimullah 2020b). Whether or not our studies are conducted within or across GoK corpora, we aim to be transparent about the methods used and the results obtained, and to situate our research clearly within the broader field of CTS. This allows for the gradual dissemination of a methodological framework that strengthens the ties between translation studies, conceptual history and discourse analysis, is supported by the carefully designed interaction between the various corpora, and is applicable well beyond the confines of the resources we created.

(6) What would you consider the main legacy of this project?

First, the corpora themselves, as a resource. These are freely accessible to anyone, with or without institutional affiliation, are easy to work with, and contain, as we hope to have illustrated by now, a wide variety of material. General purpose corpora exist of course and may offer larger datasets. Ad-hoc corpora may require less effort to build provided the researcher knows how to crawl or scrape the web. In any case, as mentioned above, access to massive quantities of information is seldom a problem today. The difficulty is how to navigate such a wealth of readily available information, and how to get involved without losing critical distance. The Genealogies of Knowledge corpora can help in this respect. Given their impressive historical scope, they place the issues of the day in perspective and shed new light on persistent political and scientific debates, precisely because they are not archives but carefully designed collections of texts that are focused on the polity and scientific discourse. They are also heavily annotated in terms of metadata. Information about translators, authors, publishers, publication dates and languages is included, as is a short summary of the main content of every text in the corpus. This is no easy task. The Internet English corpus in particular holds thousands of texts, often shorter blogposts or news flashes, and summarizing this content is labour-intensive. But providing these summaries means that the corpus adequately contextualizes, records and curates items of information that are ephemeral, or at least unstable, in their natural online environment. We do not claim to capture everything in the corpora, but what we capture is made accessible for researchers from a variety of backgrounds, regardless of specific technical or linguistic skills.

At our workshops, people who are inspired by the use of corpora often see an immediate relevance to their own research topic and ask whether it is possible to add their own texts of choice to the Genealogies corpus. It is not. Copyright restrictions and the care with which we ensure consistency among the files included preclude this option. At the same time, while, for ease of comprehension, we sometimes speak about the corpora in monolithic terms, in fact they are highly customisable. The software can be used to query the input of single files, authors, online magazines, translators, time periods, and so on. Every file in the corpus is categorised in multiple ways, and individual researchers can draw on these variables to select a personalised dataset. Full access to the corpus files is restricted, however, meaning users may only view concordances or short extracts of text and download the output of various types of analysis, and the overall corpus contents cannot be altered by users. Neither can the interface, but software developers can consult and repurpose elements from the open source modular architecture behind the Genealogies of Knowledge software, namely Modnlp, a package that was originally developed for the TEC and also supports the European Parliamentary Comparable and Parallel Corpora (ECPC) (Luz 2011).

In addition to the corpora, then, the Genealogies suite of open source software tools is part of the project’s legacy. The foundations for this software were laid two decades ago, and because the KWIC concordance is still the most widely used form of visualisation in CTS research, adaptations to the basic interface have been minor. However, CTS is an evolving discipline, and in the last decade the importance of statistical analysis and alternative visualisation techniques has increased markedly. The project is part of this transition and has heavily invested in experimental visualisation tools to enhance interaction with the corpus. The software made available through the Genealogies website includes a range of plugins at varying stages of development (Sheehan and Luz 2019). These tools are purpose-built for linguistic analysis, and take into account the effectiveness of a range of visual variables for representing quantitative aspects of textual data. Furthermore, they respond to needs expressed by corpus researchers today, as well as principles laid out in general works of corpus didactics (Luz and Sheehan 2020, Sinclair 2003). We consider both the product and process of software development to be a major part of our legacy. Apart from the various visualisation tools that are now available for analysing the corpora, as demonstrated in the studies published in Baker and Jones (2020), working closely with computer scientists has afforded us different perspectives on the nature of language and mediation, and has alerted us to important differences among various strands of humanities research. We believe this experience is well reflected in the research we have published so far.

The Genealogies of Knowledge project has brought together researchers with vastly different backgrounds whose pattern of cooperation had to be accommodated by the design of the corpus and software interface. This convergence of disciplinary interests has been disseminated to the wider research community through the organisation of events. In Manchester, the project’s base of operations, we have organised lectures on particular concepts such as rights and liberty. Larger events have focused on conceptual clusters in combination with selected time periods, as in the case of symposia such as Constructing the Public Intellectual in the Premodern World (September 2019) and Mutations of Citizenship: Activist and translational perspectives on migration and mobility in the age of globalization (March 2018). At the project’s inaugural conference (December 2017), centred on the translation of scientific and political concepts, an international group of scholars exchanged perspectives on matters ranging from quantitative methods for lexical analysis to the principles and paradoxes of cultural history. We have also worked closely with the University of Perugia in Italy on the organisation of a major conference in May 2019 on the interface of politics and translation. Finally, our international workshop series, which has so far delivered hands-on sessions in China, Norway, Qatar and the UK, has aimed to bring the latest developments in CTS in general, and Genealogies of Knowledge in particular, to a broad audience of researchers. Such dissemination activities are meant to consolidate and expand the legacy of the project.

(7) What are your plans for future development and extensions of the project?

The project formally came to an end in the spring of 2020, but through the creation of a dedicated research network, the team continues to extend its activities in this promising new area of enquiry by supporting the development of further corpora, software tools and analytical methodologies relevant to this and related lines of investigation. Coordinated by Mona Baker, Jan Buts and Henry Jones, the Network seeks to connect scholars across the humanities in order to promote greater interdisciplinary collaboration of mutual benefit to fields as diverse as translation studies, classics, cultural studies, linguistics, intellectual history, digital culture and computer science. We have already begun collaborating with colleagues at Hamid Bin Khalifa University in Doha, Qatar, who are working to develop corpora of translations into and out of Modern Standard Arabic in order to examine Arab discourses on women’s position in the polity and the evolving role of women in Arab societies. We are additionally working closely with a group at the Centre for Advanced Studies in Oslo who specialise in the medical humanities. The collaboration here is centred on an investigation of the historical development of the discursive regime of evidence-based medicine, the central paradigm in contemporary health science.

The Genealogies of Knowledge Research Network will be organising a further series of events at different locations to promote this new strand of CTS research among local and international audiences. These will include not only conferences and seminars, but also hands-on workshops intended to introduce postgraduate and early career researchers to a new generation of corpus-based translation studies, to equip them with the knowledge and tools needed to develop their own research projects. All Network activities are publicised through a dedicated website. Fellow researchers at different stages of their career development are also warmly invited to subscribe to the Genealogies mailing list to receive updates on recent publications, upcoming events and other news.


Baker, M. and H. Jones (eds) (2020) ‘Genealogies of Knowledge’, Special Collection for Palgrave Communications. Available at:

Batchelor, K. (2018) Translation and Paratexts, London and New York: Routledge.

Bendazzoli, C., M. Russo and B. Defrancq (2018) ‘Corpus-based Interpreting Studies: A booming research field’, introduction to special issue of  inTRAlinea: New Findings in Corpus-based Interpreting Studies. Available at:

Brownlie, S. (2006) ‘Narrative theory and retranslation theory’, Across Languages and Cultures 7(2): 145–170.

Burnett, C. (2001) ‘The Coherence of the Arabic-Latin Translation Program in Toledo in the Twelfth Century’, Science in Context 14(1-2): 249-288.

Buts, J. (2020) ‘Community and Authority in ROAR Magazine’, in M. Baker and H. Jones (eds) ‘Genealogies of Knowledge’, special collection for Palgrave Communications 6(16): 1-12.

Calzada Pérez, M. (2017) ‘Corpus-based Methods for Comparative Translation and Interpreting Studies’, Translation and Interpreting Studies 12(2): 231–252.

Deane-Cox, S. (2014) Retranslation: Translation, literature and reinterpretation, Bloomsbury, London

De Sutter, G. and M.-A. Lefer (2020) ‘On the Need for a New Research Agenda for Corpus-based Translation Studies: A multi-methodological, multi-factorial and interdisciplinary approach’, Perspectives 28(1): 1-23.

Dirdal, H. (2014) ‘Individual Variation between Translators in the Use of Clause Building and Clause Reduction’, in S. Ebeling Oksefjell, A. Grønn, K.R. Hauge and D. Santos (eds) Corpus-based Studies in Contrastive Linguistics 6(1): 119–142.

Doms, S. (2015) ‘Non-human Agents in Subject Position: Translation from English into Dutch: A corpus-based translation study of “give” and “show”’, in C. Fantinuoli and F. Zanettin (eds.), New directions in corpus-based translation studies, Berlin: Language Science Press, 1–11.

Hardwick, L. (2003) Reception Studies: Greece and Rome, Oxford: Oxford University Press.

Hu, K. and Q. Tao (2013) ‘The Chinese-English Conference Interpreting Corpus: Uses and limitations’, Meta 58(3): 626–642.

Jones, H. (2019a) ‘Searching for Statesmanship: A corpus-based analysis of a translated political discourse’, Polis: The Journal for Ancient Greek and Roman Political Thought 36(2): 216-241. DOI:

Jones, H. (2019b) ‘Shifting characterizations of the ‘common people’ in modern English retranslations of Thucydides’ History of the Peloponnesian War’, in M. Baker and H. Jones (eds) ‘Genealogies of Knowledge’, special collection for Palgrave Communications 5(135). DOI:

Jones, H. (2020a) ‘Jowett’s Thucydides: A corpus-based analysis of translation as political intervention’, Translation Studies. Online first. DOI:

Jones, H. (2020b) ‘Retranslating Thucydides as a Scientific Historian: A corpus-based analysis’, Target 32(1): 59-82.

Karimullah, K. (2020a) ‘Editions, translations, transformations: refashioning the Arabic Aristotle in Egypt and metropolitan Europe, 1940–1980’, in M. Baker and H. Jones (eds) ‘Genealogies of Knowledge’, special collection for Palgrave Communications 6(3). DOI:

Karimullah, K. (2020b) ‘Sketching Women: A corpus-based approach to representations of women’s agency in political Internet corpora in Arabic and English’, Corpora 15(1): 21-53.

King, H. (2019) Hippocrates Now: The ‘Father of Medicine’ in the Internet Age, London: Bloomsbury.

Kruger, H., and G. De Sutter (2018) ‘Alternations in Contact and Non-contact Varieties: Reconceptualising that-omission in translated and non-translated English using the MuPDAR approach’, Translation, Cognition and Behavior 1 (2): 251–290.

Luz, S. (2011) ‘Web-Based Corpus Software’, in A. Kruger, K. Wallmach and J. Munday (eds) Corpus-Based Translation Studies: Research and Applications, London and New York: Bloomsbury, 124-149.

Luz, S. and S. Sheehan (2020) ‘Methods and visualization tools for the analysis of medical, political and scientific concepts in Genealogies of Knowledge’, in M. Baker and H. Jones (eds) ‘Genealogies of Knowledge’, special collection for Palgrave Communications 6(49): 1-20.

Moropa, K. (2011) ‘A Link Between Simplification and Explicitation in English-Xhosa

Parallel Texts: Do the morphological complexities of Xhosa have an influence?’, in A. Kruger, K. Wallmach and J. Munday (eds) Corpus-Based Translation Studies: Research and Applications, London: Continuum, 259–81.

Redelinghuys, K. and H. Kruger (2015) ‘Using the Features of Translated Language to Investigate Translation Expertise. A corpus-based study’, International Journal of Corpus Linguistics 20(3): 293–325.

Saldanha, G. (2014) ‘Style in, and of, Translation’, in S. Bermann and C. Porter (eds) A Companion to Translation Studies, Chichester: Wiley-Blackwell, 95-106.

Saliba, G. (2007) Islamic Science and the Making of the European Renaissance, Cambridge, MA: MIT Press.

Sinclair, J. (2003) Reading Concordances: An introduction, London: Pearson Longman.

Sheehan, S. and S. Luz (2019) ‘Text Visualization for the Support of Lexicography-Based Scholarly Work’, in Proceedings of the eLex 2019 conference on electronic lexicography in the 21st century, Sintra, Portugal, 694-725.

Susam-Sarajeva, Ş. (2003) ‘Multiple Entry Visa to Travelling Theory: Retranslations of literary and cultural theories’, Target 15(1):1–36.

Tahir Gürçağlar, Ş. (2020) ‘Retranslation’, in M. Baker and G. Saldanha (eds) Routledge Encyclopedia of Translation Studies, London and New York: Routledge, 484-490.

Viewpoint Magazine (n.d.) ‘About’. Available at:

Wang, Q. and D. Li (2012) ‘Looking for Translators’ Fingerprints: A corpus-based study on Chinese translations of Ulysses’,Literary and Linguistic Computing 27(1): 81-93.

Zhu, Y. and K. Kim (2019) ‘The Individual on the Move: Redefining ‘individualism’ in China’, Translation and Interpreting Studies. Online First. DOI: