Code-switching and types of multilingual communities

Developed the principles that a corpus of texts containing code-mixing should have and built a working prototype of Udmurt/Russian Code-Mixing Corpus. Discussed different approaches to studying code-mixing and various classifications of code-mixing.

Рубрика Программирование, компьютеры и кибернетика
Вид дипломная работа
Язык английский
Дата добавления 30.12.2015
Размер файла 1,7 M

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

Размещено на


Размещено на


Code-switching has recently become a very popular topic for research in linguistics. However, for the lack of a tool allowing to analyze such a phenomenon on big amounts of data many questions stay unanswered. This work focuses on creating a set of rules for automatic annotation of texts generated by multilingual speakers in order to develop a prototype of a corpus that will grant more precise and extensive analyses of data containing cases of code-switching. The project consists of research of existing papers on code-switching, working out the main features of the marking, building and annotating a code-switching corpus based on data collected for the corpus of Udmurt language ( and assessing its quality by conducting a series of tests.

1. Introduction

1.1 Background

There are around 6500 languages on the planet. Over half of the population of the Earth is at least bilingual; many people are trilingual and multilingual. There are a number of reasons for such a situation, many are geographical, as people in one settlement learn the languages of the closest villages for communication, it is common for parents to speak different languages to their children, many kids speak one language at home and another at school and so on. It will not be exaggerating to claim that bilingualism is more of a norm than an exception to the rule (Golovko 2001).

Evidently, there are many communities where people speak the same few languages. These conditions cause many changes in grammar and vocabulary due to language contact, but it also makes the speakers intentionally and unintentionally mix those languages in their speech by, for instance, starting a sentence in one language and finishing it in another. Such code-mixing or code-switching is a very common phenomenon, it has been widely studied on the basis of many different languages. This paper will be using `code-mixing' term rather that code-switching for the reasons suggested in (Muysken 2000) and discussed in the next part of this work.

As there are various situations in which people tend to learn more than one language there are a few conditions in which they do so, including bilingual acquisition with differing areas of application, second language learning (possibly leading to incipient bilingualism) at any age, balanced bilingualism (a child learns both languages to an equal level, often due to parents speaking different languages). In addition to favorable conditions of people mixing languages out of `surplus', there are conditions which actually force people to insert words from other language; it can be due to the one of the languages being on the verge of extinction and lacking many words and expressions or simply due to various diseases and conditions, such as aphasia or dementia.

Although all these conditions are not novelty, code-mixing is a relatively new area of research.

In 1960s Meri Lehtinen and Michael Clyne on the basis of a small Finnish/English corpus made the first attempt to figure out if there are certain patterns in the way the speaker chooses the language; they have also assumed that the switch can only occur when there are similarities in the `surface grammar' of two languages and that only the words belonging to open classes could switch (Lehtinen 1966; Clyne 1967). Up until 1970s the cases of alternations between languages in the course of a discourse were mainly considered `linguistic rubbish' and were dismissed as random (Labov 1972; Lance 1975; Weinreich 1953/1968), although today it is universally recognized as grammatically constrained. There are a few early works, describing various restrictions on switches within particular grammatical constructions (Gumperz 1976/1982), (Timm 1975), but in the end of the 1970s a few papers (Pfaff 1975, 1976; Poplack 1978/1981) revealed some regular code-mixing constraints, which led to figuring out more definite rules and limitations regarding code-mixing in different language pairs. Soon this topic was taken up by other linguists.

Despite the newness of this area of research, quite a lot of work has already been done. Code-mixing is traditionally studied using one of the three approaches: psycholinguistic (Grosjen 1982; Kolers 1966; Lipski 1978), sociolinguistic (Gumperz 1982; Finlayson, Calteaux, Myers-Scotton 1998; Heller 1992) or linguistic (Poplack 1980), (Myers-Scotton 1993) (Muysken 2000), (Sridhar, Sridhar 1980). For my research I will mostly be concentrating on the linguistic approach, although hopefully the results of the work will be helpful for every aspect of studying code-mixing.

Linguistic approach accounts for structural questions, such as research in the field of morphology and syntax. The main goal is figuring out whether code-mixing obeys any rules. For instance, (Poplack, 1980, 1981) and (Sankoff, Poplack, 1981) deduce two constraints on code-mixing. One states that morphology of different languages cannot be mixed within the boundaries of one word. The other one suggests that syntactic structures of two languages have to be equal for switch to occur. However, although first assumed to be such, both these constraints turned out not to be universal.

It may be paradoxical, but despite quite a lot of researches conducted, there is still very little data collected for such purposes and therefore most papers are only able to describe the situation for two (or more) particular languages, but there is no way to analyze a bigger picture. There are a few rather big corpora of examples for various languages (Spanish/English (Poplack 1980), Italian/French (DiSciullo, Muysken, Singh 1986), Maroccan Arabic/French (Bentahila, Davies 1983), etc.). However, the corpus method is usually dismissed by most people doing research in code-mixing (Milroy, Muysken 1994), mostly for the costs of collecting data. Therefore, for the lack of tools for automatic processing of texts with code-mixing, most of the research is conducted manually; thus with no way to analyze such a phenomenon on a big scale, many questions stay unanswered.

The objective obstacle is, obviously, that code-mixing mostly occurs in spontaneous speech and the cases that appear in fiction are based entirely on authors' intuition, thus the generally accepted idea is that corpora have to consist of spoken conversations. Furthermore, most code-mixing hypotheses cannot be proved or turned down by informants, as they cannot be checked with intuition (Muysken 2000).

Nevertheless, the Internet allows collecting incredible amounts of data such as public blogs, twitter, etc. (Dorleijn, Nortier 2009) They are much closer to spoken conversations than traditional literature and often include huge amounts of code-mixing examples and therefore can be used for a research. It is important to note that I in no way want to acknowledge written blogs to be equal to spoken conversation, but I believe that it is more than worthy data that can be analyzed and potentially be a huge step towards understanding the rules of code-mixing in general.

In this paper I have tried to look over most of the major researches describing various rules and constrains to code-mixing. I went over the types of code-switching; insertion, alternation and congruent lexicalization in particular, and discussed what should fall under each type. In order to create a unified system for analyzing code-mixing phenomena I have worked out basic principles for annotation, which, with very few modifications, can be applied to any language pair (or more than a pair). I have considered which conditions might influence the choice of various code-mixing patterns and pointed out the problems that might come up when studying certain phenomena in different languages. Based on the developed principle I have created the first version of an Udmurt/Russian online annotated corpus. The corpus consists of Internet blogs and has both morphological and code-mixing annotation. Based on the obtained data I have been able to determine the main strategies of Udmurt -> Russian and some Russian -> Udmurt code-mixing and pointed out some problems that occur in automatic code-mixing annotation when it is applied to language that have been in contact for a long time. I have checked if any of the known code-mixing constraints were violated.

In addition, I have developed a plan for possible future research based on this new annotation that can be used to conduct further research and significantly extend the potential of typological approach in code-mixing and maybe even become a step towards figuring out how to induce, manipulate, and replicate natural code-mixing (Gullberg, Indefrey, Muysken 2009).

1.2 What is code-mixing?

The term code-switching came from physical sciences (Fano 1950), then shifted to political anthropology (Gal 1987, 1995), the meaning of the notion has changed, and multiplied. In research on bilingualism and bilingual behavior in particular however it came as switching code. This was the term that was first used for what we call code-switching or code-mixing today. The topic were taken up in structural phonology, information theory, and research on bilingualism. In 1952 Jackobson gave the start to its synthesis (Jakobson, Fant and Halle 1952). His work is based on (Fano 1950), a paper in information theory and (Fries & Pike 1949) in phonemic systems, who suggest that `two or more phonemic systems may coexist in the speech of a monolingual' (1949:29).

At about the same time, Hoijer (1948) introduced a concept of `phonemic alteration' (parallel to what today is called borrowing) and `phonemic alternation' (parallel to code-mixing).

(Jakobson, Fant and Halle 1952) and later (Jakobson 1961) describe the notion of `switching code' in terms of the decoding that bilingual speaker must do to understand another person's code or to produce their own. As an example they present the situation of Russian aristocracy of 19th century that was switching between Russian and French constantly, sometimes within a single sentence (Jakobson, Fant and Halle 1952:603-604) .

The work also states that `Two styles of the same language may have divergent codes and be deliberately interlinked within one utterance or even one sentence' (Jakobson, Fant and Halle 1952:604). Interestingly, it formulates that every language is not a code, but that it has a code (Alvarez-Cбccamo 1998).

Therefore, code-switching is conceptualized as the alternation not only of languages, but also of dialects, styles, prosodic registers, paralinguistic cues, etc, subjects later discussed in (Gumperz 1982), (Gumperz 1992) and (Auer 1992).

Later (Muysken 2000) proposes the term code-mixing for the general notion of alternation of various code in a language and suggests to reserve code-switching to the the rapid succession of several languages in a single speech event. He however uses switch and switching when referring to particular co-occurrence of elements in different languages in a sentence. I am going to take up his terminology due to its additional accuracy and transparency.

The question of choosing the correct term for this phenomenon is important to understand the border line of what is being discussed. However, even after deciding on the word there are still very different opinions on what code-mixing represents. There are suggestions that code-mixed fragments of speech should be considered a new single code, sort of a new language. It is not unreasonable, because the speakers do not rely on any grammatical distinctions between languages as something significant (Gardner-Chloros, 1991). However, the fact that the switching in possible in particular conditions the grammars should be taken into account. I can also suppose that for particular types of languages these conditions are very similar. However, I have to agree with Gardner-Chloros in regard of us dealing with a single speech flow. The switches may occur many times in one sentence, leaving us to wonder whether a bilingual person has a certain `bilingual system' in their mind which allows them to switch between languages so easily.

There are a few works that support this hypothesis. (Swigart 1992) describes bilingual situation in Dakar, Senegal, where they speak Wolof and French and, interestingly, these languages are almost never used separately there:

(1) ...xam nga weeru benn jour, quelques minutes lay def, quelques minutes rekk et puis c'est petit, un tout petit kii la! Boo gaawul, doo ko men a gis. know the first day's moon, it's only there for a couple of minutes, just a couple of minutes and then it's small, it's a really small thing! If you are not quick, you won't be able to see it.

(Swigart 1992: 89-90)

The examples and the translations are taken from (Swigart 1992), the italics are mine.

(2) Me?n naa lakk olof sans lakk 'faranse'.

I can speak Wolof without speaking French.

The last example shows a person trying to prove that he can speak in pure Wolof, but still using French sans `without'. The irony of the example shows how sometimes the speaker cannot avoid code-mixing even when he aims for it. Obviously, there are less switches, but it seems to be hard to avoid it all together. (Golovko 2001) states that not every code switch is determined by the listener, especially in a bilingual community. So what Golovko suggests is to consider code-mixing from the point of view where orientation towards the listener is not obligatory or to stop viewing such phenomenon as code-switching at all. To solve it he proposes to introduce an opposition of motivated vs. unmotivated mixing.

A few papers (Backus 1993: 233; Sarhimaa 1999: 237) support the same approach. They claim that bilingual communities are characterized by fluid code-mixing, which is due to the unflagged (unmotivated) code-mixing.

Another supporting argument towards the existing of single mixed code was offered by Yael Maschler, when he worked on Hebrew-English language alternations of one bilingual speaker (Maschler 1998). The article demonstrates that some of the elements of the discourse make it a mixed code rather than code-switching, because they prove to have exclusive functions, not typical for either of the languages (to varying degrees, depending on the context and not all of the alternations were such, but any deviation proves the existence of some level of transformation.) It, however, only says something about the speech of one particular person and it does not say whether this transition is resident to Hebrew-English mixing in general. I believe, such research has not been conducted yet, although with comparison to some other `transitioned' code of a different language pair that would definitely be a very strong argument towards a more global distinction between code-switching and mixed code.

1.3 What is there to study?

Code-mixing, studied as a sociolinguistic phenomenon, naturally is influenced by many extra-linguistic factors. Therefore, apart from strict pattern description study there are questions of whether anything other than restrictions of the language involved in the choice these mixing patterns or why do certain communities show one pattern rather than another.

There are two major ways to work on code-mixing. We can use a descriptive method and work on particular mixing strategies and look at the constraints to when they can occur in a certain language pair. Another approach is explanatory and requires making an attempt to account for the reasons of why and where mixing is possible. Although I would like to have a golden mean on that, I will mostly use the first approach with the goal of making it the path to getting a constructive theory explaining different features of code-mixing.

The main topic of the studies with linguistic approach is traditionally focused on whether there are any rules to how the switching between languages happens and if there are, whether any of them are universal.

2. Code-mixing

2.1 Classification

There are many ways to classify the types of code-mixing. One of the first classifications were proposed in (Clyne 1967), where he divided code-mixing into three forms, based on the notion of trigger-word that forces the speaker to switch to another language unintentionally: following (the switch occurs after the trigger-word), preceding (the switch takes place before the trigger-word) and combinative (the switch is realized between two trigger-words) (Clyne 1980; 2003).

However, a more common distinction is by (van Hout, Muysken 1994). They proposed partitioning code-switches into three other categories: insertion (the switches that are preceded and followed by the elements of another language), alternation (the switches from one language into another for more then one word) and congruent lexicalization (a few words in different languages that do not form one or multiple constituents).

Another classification can be found in (Poplack 2000). She offers 4 types of code-mixing. The first one are single word insertions, which she classifies as nonce-borrowings. The second type includes bigger, established constituents, such as exclamations, particular phrases, idioms. The third type describes longer alternations for over one word inside one sentence and the fourth type characterizes the switches between the whole sentences.

There are more approaches to describing and working with code-mixing, including generative approach (MacSwan 1999a, 1999b) and classification of types of mixes in (Myers-Scotton 1988; 1989)

We are only going to be analyzing the code-switching within one sentence and therefore we are going to look at (van Hout, Muysken 1994)

2.2 Insertion

2.2.1 Definition

Let's first consider insertion. The schematic way to describe it would be the following:

On this and subsequent pictures A and B are simply different languages. Thus the scheme here shows a clause in language A, and one or more constituents in language B inserted inside this clause.

So, the reasonable step would be to admit that any switch to another language and back should be considered insertion. However, it is not exactly the case. First of all not every constituent gets involved in switches, and if it does than there is the most controversial question of whether it is a code-mixing insertion or a borrowing or another common term `nonce-borrowing'.

Under equivalence constraint of which I will speak later, code-mixing is only allowed within the grammatical structures that exist in both languages. Therefore, almost any code-mixing involving noun phrases is of the insertional type. NPs are very well-defined constituents and mostly syntactically inert and because of that easily insertable. (Muysken 2009) suggests the following insertional types in terms of nominal constituents:

He deduced that insertions most often involve single constituents; they exhibit an `A B A' structure, so that the fragment preceding and following the insertion are related grammatically:

These are also mostly content words rather than function words (Van Flout & Muysken 1994). Inserted items are mostly nouns, adjectives and verbs.

Thus, insertions are usually single, nested, content words and morphologically integrated constituents and the grammar of the base language determines the overall structure of the sentence.

2.2.2 Code-mixing vs borrowing

One of the most controversial topics in code-mixing, especially while describing insertion, is what we should consider switched elements and what is rather just a borrowed word or an expression.

(Poplack 1980) suggests that there are nonce borrowings, where the word is only borrowed for one occasion, code-switching, when there is more than one word switched and established borrowing that can be found in a dictionary.

Here are the features distinguishing between borrowing and code-switching according to (Muysken 2000):

As we can see the main criteria for distinguishing between code-mixing and borrowing demonstrate the level of adaptation of the word to the system of the main language, that includes phonetic, morphological and syntactic adaptation. They are, however not absolute, here is the division as can be found in (Poplack 1980a) in regard to Spanish/English code-mixing:

When only partial adaptation takes place, we can talk about `borrowability' of the inserted words or the level to which it becomes part of the matrix language. This allows the line between loan words and code-mixing to be drawn at very different places. Therefore, as one of this work's main goals is unification of code-mixing description, I will try to establish this line in regard to my best ability to reflect the difference while using automatic processing.

Thus, embedding an element from another language into a clause is code-mixing, but adding it to the lexicon is borrowing (Muysken 2000). However, if you consider this matter from the speakers perspective the distinction might be different. As they operate lexicons and grammars of both languages, they might view the lexicons as two subsets that intersect, so rather than borrowing the items from that intersection the process can be viewed as lexical sharing.

Construction Grammar however suggests to look at borrowing versus code-mixing topic within the dimensions or listedness (the level to which the word adapted within the language) and lexicality (supra-lexical/sublexical) (Goldberg 1995). Here is how it would be represented this way:




spontaneous code-mixing

conventionalized code-mixing



established loans

Although code-mixing is mostly spontaneous, there are certain patterns of mixing that are more common in one community rather than in the other even if they speak the same languages (Poplack & Sankoff 1988). Such code-mixing is classified as conventionalized. Established loans are naturally the ones that have long taken its firm position in the language. The nonce loans, the term introduced in (Haugen 1950) describes elements that are borrowed spontaneously and do not have any status in the receiving speech community.

As we have discussed before nouns and noun phrases are most easily borrowed and these borrowed nouns phrases can be complex (Sankoff, Poplack, and Vanniarajan 1990: 80). Although they still can be lexicalized, such combinations are not borrowed as easily in the language (we've already seen the NPs in the hierarchy). But it's not only the length of the borrowed element that influences the `borrowability', for instance, plural nouns tend to fall under code-mixing and N-insertions in particular, but rarely under nonce borrowing.

Based on Haugen's theory on loan words, (Muysken 2009) puts forward the following hierarchy of borrowability:

nouns > adjectives > verbs > prepositions > coordinating conjunctions > quantifiers > determiners > free pronouns > clitic pronouns > subordinating conjunctions

This hierarchy is derived thorough statistics only and no explanation is currently available. Moreover, it seems that this hierarchy is not universal for every language pair.

What we can say for certain however is that it is obvious that code-mixing that involves agglutinative languages is more predisposed to borrowing, as they are defined by the absence of lexical selection by affixes. There are no conjugation classes or any special morphophonemic rules, etc. As the affixes are non-selective, they always fall under equivalence, because a lexical base in one language is equivalent to one in another language.

For the same reasons fusional languages are highly resistant to borrowing (Budzhak-Jones 1998), (Budzhak-Jones and Poplack 1997). There are also very extreme noun/verb asymmetries in borrowability: nouns are often borrowed uninflected, but not verbs (Nortier & Schatz 1992). he asymmetry exists in agglutinative languages as well, but there are still many verbs borrowed.

First of all, it is important to point out that insertion is a different process in regard to different languages. It turned out to be so primarily because there is no agreement on what should be considered insertion and what is borrowing, but also because it depends on types of languages as we have already discussed with fusional and agglutinative distinction. As at this moment we cannot just choose a universal way to describe insertion in different languages; however, if we want to create a corpus we will have to have to deal with some generalization and naturally it is going to be towards simplification of automatic processing.

2.2.3 Determining matrix language

(Haugen 1956:39) states the following:

Any item that occurs in speech must be a part of some language if it is to convey any meaning to the hearer...The real question is whether a given stretch of speech is to be assigned to one language or the other.

Therefore, when studying insertion, it is common to divide the languages of the discourse into matrix-language and embedded language (or languages). It is important to understand which elements are inserted into which language and which language the person is speaking at that moment to, so that we could first of all, make sure it is an insertional type of code-mixing, but also make more precise analysis of the occurring switch.

There are five ways that are being used to determine the base-language. First of all, we can just regard it as the language of the conversation (Berk-Seligson 1956:323). This idea seams intuitively plausible, however when languages are too mixed together it can be hard to determine, especially automatically, sometimes even the speakers themselves cannot say, which language was the main one in their speech. The second approach is by counting morphemes of the words that are uttered (Myers-Scotton (I993b: 68). This approach is more statistical. If we assume that the matrix language has the most words and morphemes. This model however does not take into account that some languages naturally have more morphemes in general. If we have a language pair of a polysynthetic language or an agglutinative one and an isolating one; in this case the amount of morphemes will mean almost nothing. A more psycholinguistic approach is seeing the matrix language as the language in which the speaker is more proficient. Proficiency cannot be a very reliable criterion though. In cases of balanced bilingualism it can also be hard to determine even by the speaker themselves. Moreover, different situations might provoke the speaker to use one language or another, depending for instance of who they are talking to. Another approach is from left to right (Doron 1983). Despite its disturbing simplicity it might be a good way of determining the matrix language. The trick is that determining the base language is only needed when analyzing insertion and if it is impossible to have an inserted word as the first one in the sentence, because if the rest of it is in one another language than it is alternation and if it is a mix of two languages that it is congruent lexicalization. This however is only a good approach if we look at the separate sentences and examples and not the conversation in whole. A much more worked over approach was introduced in (Milroy, Muysken 1994). They suggested using a structurally oriented model, where some element or a set of elements determine the matrix language.

There is also another approach with is based on a governmental model (DiSciullo, Muysken, and Singh 1986). It suggests that there is no single matrix language for a particular clause, but that every governing element in the sentence establishes a matrix structure. From this follows that unless the chain of government is broken, the language of the tree is determined by its highest element, which is usually a finite verb or in case of a subordinate clause it is the complementizer (Klavans 1985), (Troffers-Daller 1994).

After choosing the strategy or some compromise between them there is still another issue, which is determining which language the word belongs to. Sometimes if the languages are similar or have been in contact for a long time many words may be very similar; often the morphology can point to one language or the other, but if the languages are morphophonemically similar than it can be hard or even impossible to assign the word to a particular language.

3. Congruent lexicalization

Similar to insertion, there is another type of code-mixing congruent lexicalization (van Hout and Muysken 1995). It's structure can be visualized as such:

Unlike insertion, congruent lexicalization involves several mixed-in constituents, sometimes so many that is is hard to determine the main language of the discourse determine to which language does syntactic structure belong, as grammatical relations between two languages interlace too tightly. This code-mixing pattern is common for second generation immigrants and bilinguals speaking closely related languages (Vakhtin, Golovko 2004:28). The reason for that is that congruent lexicaliztation results from frequent trigger words, therefore overabundance of homophonous words (especially in relative languages) can cause code-mixing. But even if there is no lexical correspondence categorial and linear equivalence is also a cause for congruent lexicaliztation. It is easily understandable, as this type of code-mixing is possible due to grammatical convergence. Vocabulary comes from two languages and the grammar structure belongs to both at the same time. Not the whole grammar has to be shared by both languages, often there is just alignment of the major constituents, but not all the internal structure of these constituents.

For many bilingual communities some structural convergence is commonplace, which raises many issues of language contact and language change and whether there is a causal link. The controversy brings out the question of whether if there is some connection is code-mixing for the convergence or is it the other way around; and does this convergence always mean reduction and simplification of both languages.

(Muysken 2009) grants us the hierarchy that he compiled that represents the degree to which congruent Iexicalization occurs in various communities in respect to Dutch:



Dutch in Australia/English

Muluccan Malay/Dutch


Moroccan Arabic/Dutch


As we can see the pairs that are higher in the hierarchy can be regarded as intralinguistic variation.

4. Alternation

Another very common strategy of code mixing is alternation. It can be represented on this scheme:

Although two language exist in one clause they remain separate. A good example of this type of code-mixing is actually the name of Poplack's article (Poplack 1980):

Sometimes I'll start a sentence in Spanish y termino en espaсol'

`Sometimes I'll start a sentence in Spanish and finish it in Spanish' (mistake is made by one of Poplack's informants)

Unlike insertions where most embed elements are nouns and adjectives, alternation is often provoked by particles and adverbs (Muysken 2009)

Moreover, there are syntactical differences to be considered. Alternation are more likely to appear on the boundary of a major clause.

(Treffers-Daller 1994) contains a corpus French/Dutch code-mixing in Brussels, which is characterized by a high number of alternations. There are two important points that can be made on the basis of her data. The first one is that the alternation only occurs where the word order is the same and that it usually happens on the border of two major clauses (Muysken 2000). Based on her corpus, she also proposes hierarchy of probability of various constituents to be part of alternation. She also uses a probabilistic approach:

coordinated NPs/PPs > dislocated NPs/PPs > adverbial PPs/NPs > before subordinate clauses > predicative NPs/APs/possessive PPs > subject or object NPs and clauses > indirect questions

Alternation is tightly tied to the syntax. The switches of any type can occur either in the center of the clause or on the periphery. The switch usually involve a left- or right-dislocated element or can be found in the beginning of the second of two conjoint clauses.

Although when discussing alternation linguists mostly mean sentence-internal switching, mixing in in between the utterances is also entirely possible and alternational marking is theoretically substantiated; as in between the clauses the alternation occurs at the boundary and when the language is switched it remains the same:

(4) Adios, amigos! See you tomorrow.

`Goodbye, friends! See you tomorrow'

Although usually the alternation is taking place on the boundary there is still possibly for it to be found in a connected structure under equivalence, the same as congruent lexicaliztation (Nait M'Barek & Sankoff 1944), (Poplack & Meechan 1995).

5. Constraints

code mixing text

5.1 Equivalence Constraint

As I have mentioned before there is an equivalence constraint (S.Poplack, D.Sankoff 1981), existence of which is supported by many linguists. There are at least two ways to look at it, some believe that code-mixing should not violate syntactical structures of either language (also mentioned as switch-alpha constraint in (Choi 1999), others believe that a language assimilates into another one (dual structure principal in (S.N.Sridhar, K.K.Sridhar 1980); matrix language model in (C.Myers Scotton 1989); matrix language principle in (Kamwangamalu 1998). The latter follows the theory that the adopting language sets the aspect, tense, agreement, etc (Bhatt R.M., 1997). Whatever the mechanism is, it is clear that the speaker tries to avoid any grammatical conflicts when producing the utterances.

Consequently, this constraint states that a switch cannot occur within a constituent generated by a rule from one language if this rule does not exist in another and therefore neither can violate any syntactic rules.

This constraint may be demonstrated on one of the classical examples in (5) which were generated by Gingras (1974) and then tested on a group of Chicano bilinguals for acceptability.

(5) El MAN que CAME ayer WANTS JOHN comprar A CAR nuevo.

Spanish: El hombre que vino ayer quire que John compre un coche nuevo

English: 'The man who came yesterday wants John to buy a new car'.


Spanish: `Dile a Larry que se calle la boca'

English: 'Tell Larry to shut his mouth'.

These sentences have very similar structures; they both contain a verb phrase and a verb phrase complement, where both verbs, when used in English require infinitive complementizer rule apply to it, but is Spanish the same construction comes with a subjunctive complementizer. Although (5) has words switching almost every word, and (6) has a switch between two constituents, their biggest difference is in regard to linear equivalence (Gingras 1974). By using infinitive complementizer the first sentence violates the constraint as it is not a Spanish construction. The first half of the sentence is compiled out of constants that do not go against any rules; English and Spanish map on each other perfectly there:

El MAN que CAME ayer WANTS…

The man who came yesterday wants …

El hombre que vino ayer quire…

Here the switch may occur anywhere, but not further. Thus, all Gingras' informants found the full sentence unacceptable.

However, A CAR nuevo doesn't follow the English adjective-noun word order and although some Spanish adjectives may precede the noun, nuevo is not one of them. When structures are not equivalent in two languages the constituents tend to be uttered in one of the languages as in the second example. This sentence was found acceptable by 94% of Gingras' informants.

Equivalence has been verified as a tendency in many language pairs: Spanish/English (Poplack 1980), Finnish/English (Poplack et al. 1987), French/Arabic (Naпt M'Barek & Sankoff 1988), English/Tamil (Sankoff et al. 1990), Wolof/French and Fongbe/French (Poplack & Meechen 1995), Ukranian/English (Budzhak-Jones 1995), French/English (Turpin 1998) and possibly more.

Nevertheless, (Di Sciullo, Muysken and Singh 1986) disagrees with this constraint. They argue that it does not include any notions of structural or hierarchical relations (which most grammatical principles are built on) and only relies on linear sequence.

Thus (Di Sciullo, Muysken and Singh 1986) suggests an alternative description of the equivalence constraint that involves the notion of government. Let's consider verbs, adpositions, etc (governing elements) and noun phrases as governed elements. Each category is perceived by the speaker as equivalent. The linear equivalence should be viewed as a subclass of categorial equivalence and the governed elements (e.g. noun phrases) must be perceived by the speakers as equivalent. Linear equivalence is simply it subcase of categorial equivalence, under the government theory, as for instance the rightward government verb is not exactly equivalent to a leftward government, etc. According to this government constraint switching is possible only between elements that are not related to government (for example in PP the preposition governs the NP and in VP the verb governs the object). They claim that this constraint is priortized over every other. It has been proven on for French/English/Italian code-mixing, as well as Hindi/English.

Although, it has also has been argued against by (Klavans 1985), as it claims that simple, certainly frequently occurring examples are impossible, such as switching between V and Obj.NP:

(7) Los hombres comieron the sandwiches

`The men ate (Spanish) the sandwiches (English).'

And at the same time it allows very rare example such as:

(8) La plupart des canadiens scrivono `c'.

`The majority of (the) Canadians (French) write (Italian) `c'.'

To defend itself (Di Sciullo, Muysken and Singh 1986) however states that although they claim that government constraint is universal it can be inflicted with additional constraints.

5.2 Free-Morpheme Constraint

Another constraint that seems to be much less frequently violated is the free-morpheme constraint, which basically precludes such formations:

*EAT - iendo

The example demonstrates that an English root `eat' combined with a Spanish bound morpheme -iendo, an analogue of the English -ing. So, basically, what this constraints restricts is code-switching within one word. This however implies full switch, including phonological change. Any evidence of violation of the constraint to the full level does not seem to have been documented yet. But I believe it might be a problem to distinguish between nonce-loans adapted to the morphology of the matrix language and code-switching in the languages that have similar phonology and have been in close contact for a very long time, I will talk of it more later in regard to Udmurt/Russian code-mixing.

So what the constraint states is that the affix governs the root and therefore it has to be in the same language that the root. But there is another situation when the restriction can be obviated: if the base of the word is categorically equivalent to the affix language. This brings us back to the equivalence constraint. It is just that in this case we have it within a single word. This means that free-morpheme constraint can be violated in regard to the cases where the rules of morphology concern general categories that exist in both languages, such as nouns, verbs, etc, when for example we deal with to agglutinative languages, but it has to be obeyed in regard to language-specific categories such as conjugation types, declension classes, etc.

Although there are reports of counter-examples (Eliasson 1989), (Myers-Scotton 1993), (Bentahila and Davies 1991), (Backus 2003), they are all concerning insertional switches. No cases of alternational switches (such as L1 L1 L1/L2 L2 ) were ever reported. The insertional cases are usually disregarded as borrowings. The are examples that were mentioned in (Clyne 1980) that are very close, but they concern close languages and therefore are still problematic.

(Clyne 1980) cites a few instances that show free-morpheme violation, even though Clyne does state that these examples are very rare in their corpus.

(9) That's what Papschi mein -s to say.

`That's what Papschi means to say.'

The name Papschi is pronounced with German morphology. It can be a trigger, but either way this utterance contains two switches from English to German and then from German back to English (possibly because of noting the previous switch). The second switch occurs within the word.

A similar situation can be found in the following example (10); with another morpheme:

(10) in meine Mutter -s car.

`In my mother's car.'

A very similar switch at possessive morpheme can be seen in this English/Dutch example (13).

(13) naar mijn vriendin's place

`At my girlfriend's place'

Or even more interesting switch for just a single morpheme but each time it occurs:

(14) Es waren hundert-s und hundert-s of Leute.

`It was hundreds and hundreds of people'

Clyne gives another very unusual example:

(15) Dan somstimes go voor'n hour nog in bed.

`Then sometimes go for an hour to bed.'

Interestingly, Dutch som(s) already means sometimes. However, the switch is probably triggered by this word, which proves the possibility of triggering by the `ambivalence' of the word not only for a separate word, but within one as well.

The existence of this constraint is widely discussed, some argue for it (Poplack 1980), some against (Clyne 1987; 2003),(Berruto 2005), but it seems that even if it is violated at times it is not a norm, and as it happens in the speech of particular speakers we cannot claim that it is typical for any particular community or language pair.

5.3 Closed-Class Constraint

In addition there is a constraint on all the constraints deduced in (Joshi 1983) and soon taken up by (Doron 1983), which limits the switching of closed-class elements, such as, quantifiers, tense morphemes, complementizers, pronouns, prepositions, determiners and other is they exist in the language. The constraint was worked out on the basis of Marathi/English code-mixing, but to some extend has been proven on many other language pairs.

5.4 Language-Specific Constraints

In addition to constraints carried over to supposedly every language pair, there are also a few language specific constraints. For example, a widely studied Spanish/English data suggests that there is a constraint that prohibits switching between noun and following modifying advective (Woolford 1983).

(16) *the casa big `the house big'

However, it seems that the switch of a clitic pronoun can sometimes occur:

(17) Yo it comprй. `I it bought.'

She also suggest the restriction on switches that include verbs with empty subjects and auxiliaries with some negatives.

(18) *Was training para pelear. ` fight'

(19)*I am no terca. `...stubborn'

Woolford supposes that this constraint exists due to language-specific transformation rather than a language-specific phrase-structure rules, which is is another reason for a need of more language pairs' corpora, as Spanish and English are both SVO. And according to (Klavans 1985) conflicts in code-mixing cannot be explained through constraints when in comes to differently structured language pairs.

Nevertheless, the work on Hindi/English (SOV vs SVO) code-mixing( (Di Sciullo, Muysken and Singh 1986) suggests that such a pair might be constraint due to the more of a `Hindlish' structure of the discourse (Hindy with lexical transferal from English), something that we have discussed in this paper already in regard to English-Ukrainian code-mixing. They also state that language-specific constraints are complementaryy to the general constraints and do not override them.

For Hindi/English they observe the following constraints:

- switching occurs differently between subject and verb and verb and object, plus the second is much rarer

- complements of a preposition must be in the same language as the preposition, as in `sonata for two violins'

- phrases inside a phrase structure tree must be in the same language

More constraints can be found in (Pfaff 1976). She, as well as (Wentz and McClure 1977) and (Timm 1975), sates that in Spanish/English code-mixing clitic pronoun object must always be in the same language as the governing verb. She notes that the mixies between Determiner + Noun are very rare (found ungrammatical in (Wentz and McClure 1977), as well as full clause switches, which are found frequent in (Gumperz 1976). She also dissagrees with Gumperz who claims that conjunctions are always in the same language as the conjoined sentence. She also found switches of prepositions to be impossible and of full pronoun phrases to be very rare.

Pfaff also formulates a semantic constraint (no one has supported it yet, but no one seems to have objected to it either), although semantic issues are often discussed in regard to code-mixing. She claims that the PP can switch if they are temporal or figurative, but not locative.

5.5 Summary

As we can see there are many controversial topics among scholars, those include government and free-morpheme constraints, whether code-mixing is surface or deep structure phenomenon and are there mixed grammar in some communities or is it always switching between separate grammars.

In regard of the non-language-specific constraints the evidence that we have looked at suggests that even if there are cases when they are violated the general tendency is follow them. Our current aim is to create a prototype of a resource that will be able to help in determining what are the conditions in which the constraints do not work, as well as give more accuracy in distinguishing between language pair specific patterns and speaker-specific. On the issue of `mixed grammar vs. two separate grammars' I want to point out that most bilingual environment (either a big community or just a family for instance) presumes convergence of languages through contact (even if this contact exists is only one person's mind), therefore the `mixed grammar' analysis is more plausible. This argument is strengthened by triggering, syntactic convergence, and syntactic transference. I have looked over all the constraints and discussed arguments for against them. After analyzing the data I myself cannot take any one of them as universal, but I can certainly accept the tendency that they represent. I will let the reader decide on whether they want to accept or reject each one of them, but I hope that the principles I have developed for annotation of multilingual texts the model of the corpus that has been built will be of a help in proving any of the reader theories regarding these or any other constraints for that matter.

Many difficulties in the discussion on code-mixing constraints are however due to the unclear division between code-mixing and borrowing/transference/interference, as well as using the term `ungrammatical' for just a tendency. To create annotation principles I had to decide on the distinction between terminology and the border lines of the terms myself. The decisions and the reasons for them will be discussed in the next part.

6. Annotation principles

Based on what we know about code-mixing now, I have complied a list of things that should be annotated. This compilation is based on the assumption that there are only two languages being switched in the text that is being annotated. This approach is chosen simply for the purpose of simplification of explanation. With a few minor modifications the principles can be used for more languages used by the author (or speaker, if the corpus is recorded).

First of all it is important to annotate the every single word with its language. The language of the word can be determined with use of grammar dictionary, however it is false to assume that we can determine the language by looking solely at morphology. The word can be a nonce-borrowing and have a stem of one language, but morphology of another. If one considers nonce-loans a part of the language whose morphology it acquired than this point may not make too much sense, however it is important to make sure that all morphological markers belong to one language. Moreover, relying on the morphology of the word when determining its language can only suffice from the assumption that free-morpheme constraint is not violated. This however as we have seen is not always the case. This work is based on the annotation format of UniParser (Arkhangelskiy et al. 2012), in which only separate words are annotated, but not the sentences or clauses. Nonetheless, this should not stop us from marking the phenomena that involve multiple elements. If the switch involves a few constituents the first constituent should be marked as such, the rest should be annotated as following that one. Describing which elements are `mixed-in' should also allow to distinguish different `directions' of code-switching. For example, if one wants to search separately any alternation switches from L1 to L2 and the same from L2 to L1.

When annotating insertion a few strategies can be chosen, depending on different understanding of what it stands for. For this work the approach was chosen based on our capabilities of automatic annotation. It is however also one of the most popular approaches today. We consider insertion any occurrence of one or more words (in a row) of the language different from the matrix language which is inserted inside the sentence, meaning that it is not in the beginning or in the end of that sentence, when there is only single insertion in the sentence (otherwise it would qualify as congruent lexicalization). However if the inserted word or words exist in both languages than it should be considered borrowing rather than insertion. The tag marking of the length of the switch segment should also be marked on the first word, so that if someone considers all single word insertions they could disregard them. As for the matrix language, determining it is only needed for annotating insertion; we have decided to chose the left-to-right approach. It seams to be most suitable strategy in our situation, as the first word of the sentence in our annotation is always in the matrix language if the sentence contains insertion. When an element is inserted in the beginning of the sentence it can only be considered alternation or congruent lexicalization. Thus if inserted elements get the `insertion' marker, the matrix language is naturally the opposite.

Подобные документы

  • The solving of the equation bose-chaudhuri-hocquenghem code, multiple errors correcting code, not excessive block length. Code symbol and error location in the same field, shifts out and fed into feedback shift register for the residue computation.

    презентация [111,0 K], добавлен 04.02.2011

  • Характеристика особенностей автоматизации управлением IT-инфраструктуры из нескольких серверов путем внедрения в процесс системного администрирования методологии "Infrastructure as Code". Подробное описание инструментов, которые используются на практике.

    статья [196,3 K], добавлен 10.12.2016

  • Program of Audio recorder on visual basic. Text of source code for program functions. This code can be used as freeware. View of interface in action, starting position for play and recording files. Setting format in milliseconds and finding position.

    лабораторная работа [87,3 K], добавлен 05.07.2009

  • Дистрибутиви та особливості архітектури QNX, існуючі процеси та потоки, засоби та принципи синхронізації. Організація зв'язку між процесами. Алгоритм роботи системи та результати її тестування. Опис основних елементів програмного коду файлу code.c.

    курсовая работа [132,0 K], добавлен 09.06.2015

  • Проектирование устройства, выполняющего функцию определения минимального давления на основе информации о показаниях полученных от 7 датчиков. Разработка набора команд управления микроконтроллером в среде программного обеспечения Code Vision AVR.

    курсовая работа [24,5 K], добавлен 28.06.2011

  • Program automatic system on visual basic for graiting 3D-Graphics. Text of source code for program functions. Setting the angle and draw the rotation. There are functions for choose the color, finds the normal of each plane, draw lines and other.

    лабораторная работа [352,4 K], добавлен 05.07.2009

  • Інструменти для розробки сайту. Застосування парсингу HTML-сторінок для створення web-системи з реалізації комп’ютерних комплектуючих по магазинах постачальниках з оптимальним пошуком при формуванні заказу. Аналіз можливостей фреймворку Code Igniter.

    дипломная работа [918,4 K], добавлен 08.06.2013

  • Program game "Tic-tac-toe" with multiplayer system on visual basic. Text of source code for program functions. View of main interface. There are functions for entering a Players name and Game Name, keep local copy of player, graiting message in chat.

    лабораторная работа [592,2 K], добавлен 05.07.2009

  • Creation of the graphic program with Visual Basic and its common interface. The text of program code in programming of Visual Basic language creating in graphics editor. Creation of pictures in Visual Basic, some graphic actions with graphic editor.

    лабораторная работа [1,8 M], добавлен 06.07.2009

  • Practical acquaintance with the capabilities and configuration of firewalls, their basic principles and types. Block specific IP-address. Files and Folders Integrity Protection firewalls. Development of information security of corporate policy system.

    лабораторная работа [3,2 M], добавлен 09.04.2016

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.