We exploit the information theoretic measure of surprisal to analyze the formulaicity of lexical sequences. We first show the prevalence of individual lexical bundles, then we argue that abstracting to surprisal as an information-theoretic measure of lexical bundleness, formulaicity and non-creativity is an appropriate measure for the idiom principle, as it expresses reader expectations and text entropy. As strong and gradient formulaic, idiomatic and selectional preferences prevail on all levels, we argue for the abstraction step from individual bundles to measures of bundleness. We use surprisal to analyse differences between genres of native language use, and learner language at different levels: (a) spoken and written genres of native language (L1); (b) spoken and written learner language (L2), across selected written genres; (c) learner language as compared with native language (L1). We thus test Pawley and Syder (1983)’s hypothesis that native speakers know best how to play the tug-of-war between formulaicity (Sinclair’s idiom principle) and expressiveness (Sinclair’s open-choice principle), which can be measured with Levy and Jaeger (2007)’s uniform information density (UID) which is a principle of minimizing comprehension difficulty. Our goal to abstract away from word sequences also leads us to language models as models of processing, first in the form of a part-of-speech tagger, then in the form of a syntactic parser. While our hypotheses are largely confirmed, we also observe that advanced learners bundle most, and that scientific language may show lower surprisal than spoken language.
This chapter has two major aims. First, it attempts to extend earlier research on recurrent phraseologies used in the pharmaceutical field (Grabowski 2015) by exploring the use, distribution and functions of lexical bundles found in English texts describing drug-drug interactions. Conducted from an applied perspective, the study uses 300 text samples extracted from DrugDDI Corpus originally collected in the Drugbank database (Segura-Bedmar et al. 2010). Apart from presenting new descriptive data, the second aim of the chapter is to reflect on the ways lexical bundles have been typically explored across different text types and genres. The problems discussed in the chapter concern the methods used to deal with structurally incomplete bundles, filter out overlapping bundles, and select, for the purposes of qualitative analyses, a representative sample of bundles other than the most frequent ones. This chapter is therefore meant to help researchers fine tune the methodologies used to explore lexical bundles depending on the specificity of the research material, research questions and scope of the analysis.
This paper explores a new methodology for extracting forms that were once common but are now obsolete, from large corpora. It proceeds from the relatively under-researched problem of lexical mortality, or obsolescence in general, to the formulation of two closely related procedures for querying the n-gram data of the Google Books project in order to identify the best word and lexical expression candidates that may have become lost or obsolete in the course of the last three centuries, from the Late Modern era to Present-day English (1700–2000). After describing the techniques used to process big uni- and trigram data, this chapter offers a selective analysis of the results and proposes ways the methodology may be of help to corpus linguists as well as historical lexicographers.
This paper describes the use of a corpus-driven methodology, the retrieval of part-of-speech-grams (PoS-grams), which is extremely effective for the discovery of phraseologies that might otherwise remain hidden. The PoS-gram is a string of part-of-speech categories (Stubbs 2007: 91), the tokens of which are strings of words that have been annotated with these PoS tags. A list of PoS-grams retrieved from a sample corpus can be compared with that from a reference corpus. Statistically significant items are further analysed to identify recurrent patterns and potential phraseologies. The utility of PoS-grams will be illustrated by way of analysis of a one million token corpus composed of texts from ten sections of The Guardian, the Sassari Newspaper Article Corpus (SNAC).
The article investigates the link between lexical and meaning patterns in the specialized discourse of judicial opinions. It presents an analysis of the N that pattern in a corpus of US Supreme Court opinions. The analysis looks at the distribution of a selection of nouns found in the pattern across different discourse functions. It is shown that judicial opinions use a range of status-indicating nouns in the N that pattern to perform five main functions: evaluation, cause, result, confirmation and existence. Yet, evaluation plays a central role in judicial writing and most status-indicating nouns are used to signal sites of contentions, i.e. challenged propositions are likely to be labelled as arguments, assumptions, notions or suggestions. By drawing on the concept of semantic sequence (Hunston 2008), the analysis illustrates how corpus-based and corpus-driven approaches can complement one another to build a picture of common epistemological practices in the corpus of legal texts.
This chapter analyses three-word sequences in Early Modern and Present-day English legal writing by defining their grammatical and functional distribution in Acts of Parliament. The method follows a corpus-driven approach: the lexical bundles are retrieved automatically from the corpus using frequency as the criterion. The study indicates that lexical bundles in acts extend to the textual level and reveals consistent word combinations on the level of the lexis. The study illustrates that the acts are established as a genre, and the overall distribution of both grammatical types and functions of bundles is rather similar in all the analysed periods. Nevertheless, textual organisation is more important in contemporary acts and textual links further become more specific, although early modern bundles already show textual patterning. Noun phrase and prepositional phrases also increase in contemporary acts, indicating a change to nominal writing conventions.
Wikipedia is widely used by academics and students in higher education, but research on the linguistic characteristics of this genre is scarce (Kuteeva 2016). This paper explores the usefulness of lexical bundles as an analytical tool to describe disciplinary variation within Wikipedia articles, and to contrast Wikipedia writing with two neighbouring genres, student essays and research articles. The results indicate that the occurrence of lexical bundles in Wikipedia varies between disciplines, which is in broad agreement with previous studies on other academic genres. The analysis of bundles also suggests that a credible authorial persona is less crucial to Wikipedia articles. Indicative of this is the low frequency of bundles indicating stance and engagement, which are characteristic of professional academic writing (e.g. Hyland 2008a).
This paper researches the lexical bundles of email marketing texts targeted at lawyers. The goal is to research the repetitive nature of email marketing. The research uses a corpus of email marketing texts targeted at lawyers, legal case decisions and blog posts written by and for labor and employment lawyers. The results show that the email marketing texts do not borrow lexical bundles from either of the other text types and that much of the language is predetermined by a template. This paper also presents the advantages of using range rather than frequency to analyze lexical bundles.
Blogs are one of the most prominent genres of Web 2.0; yet, research on their linguistic characteristics is limited. This study contributes to addressing this research gap by investigating lexical bundles in American blogs. Lexical bundles are units of discourse structure which can reveal a great deal about the unique linguistic characteristics and communicative functions shaping registers. Extraction of four-word bundles in a corpus of American blogs reveals, firstly, that lexical bundles are relatively uncommon in blog writing. Analyses of discourse function and grammatical patterns show that blogs rely mainly on stance expressions, which often encapsulate first person reference (e.g., I don’t want to), thus reflecting the focus on self-expression and subjectivity which characterizes this register. Like in conversation, bundles in blogs tend to be verb-phrase based. But blogs also rely substantially on referential (e.g., a lot of people) and narrative expressions (e.g., I got to see), and thus share characteristics of literate registers and fiction writing. In sum, lexical bundles in blog writing are characterized by a unique combination of features which reflect two underlying forces: mode and communicative purpose.
The borderless nature of blogging raises the question whether the traditional regionally defined varieties of English continue to hold true (see Crystal 2011). In order to investigate the extent to which the language published online without external intervention is similar around the world, this chapter investigates repetitive patterns, or 3-grams, found in blogs in the 583-million-word GloWbE corpus (Davies 2013). The data shows two types of repetitive word sequences: universal, or those that are frequent in all or most of the nineteen geographic locations represented in the corpus, and localised, or those unique to specific regions. We explore multiple ways of approaching the regional distribution of universal and localised 3-grams, such as statistical similarity measures (Jaccard coefficient and hierarchical clustering) and network visualisations. Three correlated research issues are addressed by this study: (1) the ratio of 3-grams in blogs from various World Englishes, which will shed light onto the degree of formulaicity in Web Englishes around the world; (2) the overlaps between various locations in terms of preferred sequences, which may point to local or global standardization hubs on the level of sentence and text construction; (3) finally, the status of model-providing varieties for internet communication, especially American English, in view of the most frequent 3-grams from other locations (cf. Mair 2013).
We exploit the information theoretic measure of surprisal to analyze the formulaicity of lexical sequences. We first show the prevalence of individual lexical bundles, then we argue that abstracting to surprisal as an information-theoretic measure of lexical bundleness, formulaicity and non-creativity is an appropriate measure for the idiom principle, as it expresses reader expectations and text entropy. As strong and gradient formulaic, idiomatic and selectional preferences prevail on all levels, we argue for the abstraction step from individual bundles to measures of bundleness. We use surprisal to analyse differences between genres of native language use, and learner language at different levels: (a) spoken and written genres of native language (L1); (b) spoken and written learner language (L2), across selected written genres; (c) learner language as compared with native language (L1). We thus test Pawley and Syder (1983)’s hypothesis that native speakers know best how to play the tug-of-war between formulaicity (Sinclair’s idiom principle) and expressiveness (Sinclair’s open-choice principle), which can be measured with Levy and Jaeger (2007)’s uniform information density (UID) which is a principle of minimizing comprehension difficulty. Our goal to abstract away from word sequences also leads us to language models as models of processing, first in the form of a part-of-speech tagger, then in the form of a syntactic parser. While our hypotheses are largely confirmed, we also observe that advanced learners bundle most, and that scientific language may show lower surprisal than spoken language.
This chapter has two major aims. First, it attempts to extend earlier research on recurrent phraseologies used in the pharmaceutical field (Grabowski 2015) by exploring the use, distribution and functions of lexical bundles found in English texts describing drug-drug interactions. Conducted from an applied perspective, the study uses 300 text samples extracted from DrugDDI Corpus originally collected in the Drugbank database (Segura-Bedmar et al. 2010). Apart from presenting new descriptive data, the second aim of the chapter is to reflect on the ways lexical bundles have been typically explored across different text types and genres. The problems discussed in the chapter concern the methods used to deal with structurally incomplete bundles, filter out overlapping bundles, and select, for the purposes of qualitative analyses, a representative sample of bundles other than the most frequent ones. This chapter is therefore meant to help researchers fine tune the methodologies used to explore lexical bundles depending on the specificity of the research material, research questions and scope of the analysis.
This paper explores a new methodology for extracting forms that were once common but are now obsolete, from large corpora. It proceeds from the relatively under-researched problem of lexical mortality, or obsolescence in general, to the formulation of two closely related procedures for querying the n-gram data of the Google Books project in order to identify the best word and lexical expression candidates that may have become lost or obsolete in the course of the last three centuries, from the Late Modern era to Present-day English (1700–2000). After describing the techniques used to process big uni- and trigram data, this chapter offers a selective analysis of the results and proposes ways the methodology may be of help to corpus linguists as well as historical lexicographers.
This paper describes the use of a corpus-driven methodology, the retrieval of part-of-speech-grams (PoS-grams), which is extremely effective for the discovery of phraseologies that might otherwise remain hidden. The PoS-gram is a string of part-of-speech categories (Stubbs 2007: 91), the tokens of which are strings of words that have been annotated with these PoS tags. A list of PoS-grams retrieved from a sample corpus can be compared with that from a reference corpus. Statistically significant items are further analysed to identify recurrent patterns and potential phraseologies. The utility of PoS-grams will be illustrated by way of analysis of a one million token corpus composed of texts from ten sections of The Guardian, the Sassari Newspaper Article Corpus (SNAC).
The article investigates the link between lexical and meaning patterns in the specialized discourse of judicial opinions. It presents an analysis of the N that pattern in a corpus of US Supreme Court opinions. The analysis looks at the distribution of a selection of nouns found in the pattern across different discourse functions. It is shown that judicial opinions use a range of status-indicating nouns in the N that pattern to perform five main functions: evaluation, cause, result, confirmation and existence. Yet, evaluation plays a central role in judicial writing and most status-indicating nouns are used to signal sites of contentions, i.e. challenged propositions are likely to be labelled as arguments, assumptions, notions or suggestions. By drawing on the concept of semantic sequence (Hunston 2008), the analysis illustrates how corpus-based and corpus-driven approaches can complement one another to build a picture of common epistemological practices in the corpus of legal texts.
This chapter analyses three-word sequences in Early Modern and Present-day English legal writing by defining their grammatical and functional distribution in Acts of Parliament. The method follows a corpus-driven approach: the lexical bundles are retrieved automatically from the corpus using frequency as the criterion. The study indicates that lexical bundles in acts extend to the textual level and reveals consistent word combinations on the level of the lexis. The study illustrates that the acts are established as a genre, and the overall distribution of both grammatical types and functions of bundles is rather similar in all the analysed periods. Nevertheless, textual organisation is more important in contemporary acts and textual links further become more specific, although early modern bundles already show textual patterning. Noun phrase and prepositional phrases also increase in contemporary acts, indicating a change to nominal writing conventions.
Wikipedia is widely used by academics and students in higher education, but research on the linguistic characteristics of this genre is scarce (Kuteeva 2016). This paper explores the usefulness of lexical bundles as an analytical tool to describe disciplinary variation within Wikipedia articles, and to contrast Wikipedia writing with two neighbouring genres, student essays and research articles. The results indicate that the occurrence of lexical bundles in Wikipedia varies between disciplines, which is in broad agreement with previous studies on other academic genres. The analysis of bundles also suggests that a credible authorial persona is less crucial to Wikipedia articles. Indicative of this is the low frequency of bundles indicating stance and engagement, which are characteristic of professional academic writing (e.g. Hyland 2008a).
This paper researches the lexical bundles of email marketing texts targeted at lawyers. The goal is to research the repetitive nature of email marketing. The research uses a corpus of email marketing texts targeted at lawyers, legal case decisions and blog posts written by and for labor and employment lawyers. The results show that the email marketing texts do not borrow lexical bundles from either of the other text types and that much of the language is predetermined by a template. This paper also presents the advantages of using range rather than frequency to analyze lexical bundles.
Blogs are one of the most prominent genres of Web 2.0; yet, research on their linguistic characteristics is limited. This study contributes to addressing this research gap by investigating lexical bundles in American blogs. Lexical bundles are units of discourse structure which can reveal a great deal about the unique linguistic characteristics and communicative functions shaping registers. Extraction of four-word bundles in a corpus of American blogs reveals, firstly, that lexical bundles are relatively uncommon in blog writing. Analyses of discourse function and grammatical patterns show that blogs rely mainly on stance expressions, which often encapsulate first person reference (e.g., I don’t want to), thus reflecting the focus on self-expression and subjectivity which characterizes this register. Like in conversation, bundles in blogs tend to be verb-phrase based. But blogs also rely substantially on referential (e.g., a lot of people) and narrative expressions (e.g., I got to see), and thus share characteristics of literate registers and fiction writing. In sum, lexical bundles in blog writing are characterized by a unique combination of features which reflect two underlying forces: mode and communicative purpose.
The borderless nature of blogging raises the question whether the traditional regionally defined varieties of English continue to hold true (see Crystal 2011). In order to investigate the extent to which the language published online without external intervention is similar around the world, this chapter investigates repetitive patterns, or 3-grams, found in blogs in the 583-million-word GloWbE corpus (Davies 2013). The data shows two types of repetitive word sequences: universal, or those that are frequent in all or most of the nineteen geographic locations represented in the corpus, and localised, or those unique to specific regions. We explore multiple ways of approaching the regional distribution of universal and localised 3-grams, such as statistical similarity measures (Jaccard coefficient and hierarchical clustering) and network visualisations. Three correlated research issues are addressed by this study: (1) the ratio of 3-grams in blogs from various World Englishes, which will shed light onto the degree of formulaicity in Web Englishes around the world; (2) the overlaps between various locations in terms of preferred sequences, which may point to local or global standardization hubs on the level of sentence and text construction; (3) finally, the status of model-providing varieties for internet communication, especially American English, in view of the most frequent 3-grams from other locations (cf. Mair 2013).