Call for papers

Building long-diachrony corpora:

Methodologies, goals and linguistic or stylistic research

13-14 October 2022, Grenoble

In recent decades, the digitisation of printed works and progress in NLP have significantly changed the way that corpora can be created and the way that research on these corpora has been carried out. It is now possible to obtain vast amounts of quantitative data which allow for fine-grained analysis and identification of linguistic or stylistic phenomena in written corpora of historical states of language. Digital corpora created over the last quarter century allow for an easier appreciation of the dynamics of French in the long-term: the Grande Grammaire Historique du Français (Marchello-Nizia et alii, 2020),completed after many years of work is a shining example. We define long-diachronic corpora as periodised corpora, containing texts chosen for their representativeness of certain states of language (from Old French to Contemporary French, for example) corresponding to the time-periods covered by the corpus.

Since the 1980s, researchers have been able to benefit from Frantext, the first corpus of French language texts, which allowed for research on literary texts over a very large timespan. The pioneering work of the Base de Français Médiéval(1989) led to the creation of a corpus of literary and non-literary texts, albeit limited, as its name suggests, to Old and Middle French. Numerous further corpora restricted to specific genres have since been created (for example, the Condé project’s corpus of Norman coutumiers spanning six centuries, and the Sermo project’s corpus of XVIth -XVIIIth century protestant sermons).

As Reppen (2010: 31) and Nelson (2010: 53) underline, the first step in building a corpus is defining the goal the corpus serves. For example, selecting comparable sources to allow for homogeneous quantitative analyses is essential and the timespan examined depends on the phenomena to be investigated (GGHF 2020: 43). The building of a corpus is thus the product of a set of reasoned decisions seeking to satisfy the representativeness principle according to which a corpus is "a collection of texts assumed to be representative of a given language put together so that it can be used for linguistic analysis" (Tognini-Bonelli, 2001: 2). This representativeness principle also takes account of the variety of different purposes for which corpora can be constructed: representativeness requirements will differ between lexicographers seeking to take account of the meaning of lexical units on the one hand, and stylists characterising a textual genre on the other. For some, it is essential that analyses should only be made of texts in their entirety (Rastier, 2011:33) whereas for others, a corpus can only ever be a sample of the phenomena analysed and can thus be built on samples of texts (Renouf, 1987, Biber, 1993). The goal of this conference is to question not only the choice of what sources a long-diachronic corpus should include, but also the linguistic, stylistic or literary objectives that determine the contents of the corpus.

The proposed themes of the conference take viewpoints that are retrospective (what have diachronic corpora shown? How can corpora built in recent decades be put to better use?) as well as forward-looking (what theoretical and methodological challenges await research based on diachronic corpora in the era of digital humanities and corpus-tool platforms?). Contributions can be based on both French-language and foreign-language corpora.

Theme 1: Building a corpus

Creating corpora capable of providing data on long timescales poses new questions of homogeneity of tools and formats at all stages of the preparation of the corpus, from the selection of texts to the precise manner in with they are processed. For example, in the presentation of the criteria used to construct the corpus for the GGHF, Prévost (2020 : 42-43), distinguishes between texts selected according to paratextual criteria "which have more to do with the modern speaker’s point of view on the texts" and which involve the choice of reference texts such as la Chanson de Roland or the Queste del Saint Graal, and texts selected based on descriptive criteria which have more to do with the period specific to each linguistic phenomenon. In particular, contributions may focus on:

  • the diversity or homogeneity of texts, at different hierarchical levels (domains, discourses, genres; on these categories, see inter alios, Malrieu & Rastier, 2001; Marchello-Nizia et al., 2020) or different types of variation (diatopic, diastratic)
  • the origin of texts to be included in the corpus, depending on whether the corpus itself is based on secondary sources (previously published texts) or primary sources (yet-unpublished texts). When corpora are based on previously published texts, what can be done to compensate for the inevitable array of different editorial decisions? In the case of primary sources, what changes should be made at the written level, given the evolution of philological editorial practices over the centuries with regards to matters such as the segmentation of words, spelling, accents, punctuation, capitalisation?)
  • types of coding and annotation to be added to the chosen texts (what types of additional information have been and should be preferred when enriching texts? How many layers of annotation were added?)

Theme 2: Doing research with after building corpora

As the goal of a corpus influences its composition and construction, questions should be raised about the data that is to be extracted from the corpus.

  • What type of research do long-diachronic corpora allow, be it on the linguistic level (lexicon, syntax, morphology, orthography, pragmatics…), the stylistic level (identifying changes in stylistic features and phraseological units…) or the literary level (identification of narrative topics or motifs)?
  • Which ways of consulting the corpus were chosen from among the multiple possibilities offered by the chosen tools?
  • What methods and specific tools have been developed to facilitate the analysis of long- diachronic corpora? Proposals can address techniques automating the division of corpora into stages (Gries & Hilpert, 2008), trend detection and measurement (Herman & Kovář, 2013; Hilpert & Gries, 2009: 388-390), specific chronological characteristics (Salem, 2021; Lebart et al. 1998: 155-161; Diwersy et al., 2021), as well as new textometric methods for the study of diachronic corpora. Proposals may additionally analyse novel corpus exploration and visualisation tools.

 

Paper submission

Conference languages: French and English

Length of papers: 30 minutes, followed by 10 minutes of discussion.

Format: hybrid.

Two versions of each abstract (300-500 words not including bibliographic references) must be submitted in the language of the proposed paper: one anonymous version and one version indicating the name and affiliation of the authors/corresponding author. All submissions must be made on the conference’s website:  https://concordial2022.sciencesconf.org

 

 

References

 

Biber D. (1993). Representativeness in Corpus Design. Literary and Linguistic Computing, 8(4): 243-257.

Diwersy S., Jackiewicz A., Luxardo G. & Steuckardt A. (2021). Les sens de « numérique » : émergence d’emplois et dynamique du changement sémantique. Linx82. https://doi.org/10.4000/linx.8153

Galleron I., Fatiha I., Lavrentiev A., Demonet M.-L. & Réach-Ngô A. (2021). Décrire les textes dans le cadre d’une édition numérique : Le thésaurus “Typologie textuelle” du Consortium CAHIER.

Glikman J. & Verjans T. (dir.) (2021). Regards linguistiques sur les éditions de textes médiévaux, Diachroniques, 8 :  7-16.

Gries S. Th. & Hilpert M. (2008). The identification of stages in diachronic data: variability-based neighbour clustering. Corpora, 3: 59–81.

Herman O. & Kovář V. (2013). Methods for Detection of Word Usage over Time. In Seventh Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2013: 79–85.

Hilpert, M. & Gries, S. Th. (2009). Assessing frequency changes in multistage diachronic corpora: Applications for historical corpus linguistics and the study of language acquisition. Literary and Linguistic Computing, 24(4): 385–401.

Lavrentiev A., Guillot-Barbance C. & Heiden S. (2021). Enjeux philologiques, linguistiques et informatiques de la philologie numérique : l’exemple de la segmentation des mots, Diachroniques,8 : 76-102.

Lebart L., Salem A. & Berry L. (1998). Exploring Textual Data. Kluwer Academic Publisher.

Malrieu D. & Rastier F. (2001). Genres et variations morphosyntaxiques. Traitement automatique des langues, 42.2 : 547-577.

Marchello-Nizia C., Combettes B., Scheer T. & Prévost S (2020). Grande Grammaire Historique du Français (GGHF). De Gruyter.

Martineau F. (2008). Un corpus pour l’analyse de la variation et du changement linguistique, Corpus, 7 <https://doi.org/10.4000/corpus.1508>

Martineau F. & Séguin M.-C. (2016). Le Corpus FRAN : réseaux et maillages en Amérique française, Corpus, 15 <https://doi.org/10.4000/corpus.2925>

McEnery T. & Wilson A. (dir.) (2001). Corpus linguistics, Edinburgh University Press.

Nelson M. (2010). Building a written corpus. In A. O’Keeffe & M. Mc Carthy (éd.), The Routledge Handbook of Corpus Linguistics (p.53-65). Routledge.

Prévost S. (2015). Diachronie du français et linguistique de corpus : une approche quantitative renouvelée. Langages, 197 : 23-45 <https://doi.org/10.3917/lang.197.0023>

Rastier F. (2011). La mesure et le grain. Sémantique de corpus. Honoré Champion.

Reppen R. (2010). Building a corpus. What are the key considerations? In A. O’Keeffe & M. Mc Carthy (éd.), The Routledge Handbook of Corpus Linguistics (p.31-37). Routledge.

Salem A. (2021). Le temps lexical. Histoire & Mesure, Vol.XXXVI-2.

Tognini-Bonelli E. (2001). Corpus Linguistics at Work. John Benjamins Publishing Company.

Zufferey S. (2020). Introduction à la linguistique de corpus, ISTE Editions.

Online user: 2 Privacy
Loading...