Return to the home page for Patrick Kellogg

French Idiom Transducer
Michelle Gregory, Patrick Kellogg, and David Mankin

0. Introduction
1. Methodology
2. Experiments
3. General Discussion
4. Conclusion
References
Appendix A: List of French Idioms
Appendix B: French Paragraph
Appendix C: List of programs and associated files

0. Introduction.

Machine Translation is perhaps one of the most difficult tasks in Natural Language Processing. Machine translation (MT) involves various sub-fields of NLP, including tagging, parsing, and generation. The goal of MT is to convert a text in one language, the source language, to a text in another language, the target language, with the meaning preserved. The input to an MT program is, thus, a string of words or sentences in the source language. The input must be parsed both semantically and syntactically. These parses are sent to the generator, which decodes the parses in the target language to construct an output that is similar in meaning to the input. This process is exemplified in Figure 1 (adapted from Wehrli 1998, pg. 1390). (1)

In order for MT to be successful, a machine translator must have knowledge of words, some sort of semantic representation, structural information, and pragmatics knowledge in both the source and the target languages. If languages only differed in their symbolic representations of words, the task of MT would be easy. In fact, MT does pretty well on translation tasks that involve only two words in the different languages that correspond to the same object or concept, as exemplified in 1 (translated on Babel Fish, http://babelfish.altavista.com/): (2)

(1) French English
bois wood

Similarly, if the translation task involves only short clauses in which there is a one-to-one mapping between the source and the target (even if that mapping involves changes in word order), MTs do fairly well:

(2) Jai construit les étegères  en bois.
1SG-pp build-pastdef-det-p shelves of wood
'I built the wood shelves'

However, translation tasks are generally more complex than exemplified here. One of the most difficult problems in MT is the translation of expressions that can not be translated compositionally from their parts (Storrer & Schwall 1995). Such expressions include idioms (kick the bucket), phrasal verbs (look up), and colloquial terms and jargon-like terms). Consider the following translation from Babel Fish:

(3) French Idiom English Translation Babel Fish Translation
manger le morceau to spill the beans to eat the piece

In this example, the French is literally translated as 'eat the piece', but the idiomatic meaning 'spill the beans' was the intended target. (3)

As one might expect, non-compositional expressions need to be added to the lexicon of the source (and target) language if they are to be correctly translated. However, many non-compositional expressions can undergo syntactic transformations, especially if they involve a verb. Thus, any realistic attempt to augment an MT with non-compositional expressions must be able to identify these expressions but allow for their productivity at the same time (Wehrli 1998). Translation of non-compositional expressions has been investigated for Korean-English translators (Sung & Yung 1993) and French-English (Wehrli 1998), among others. However, to our knowledge, there has been no commercial implementation of an MT that includes detailed knowledge of non-compositional expressions. The evidence of this fact is that we could not find a translation program available on the web that could accurately translate the list of French idioms we gave it. However, we should note that there are plenty of stock phrases that are translated correctly, it's just that none seem to be augmented with a dictionary of traditional idiomatic phrases.

In this paper, we propose a French Idiom Transducer  (FIT) that works in conjunction with Systran web translators. Systran has been in the translation business for over 40 years and is the technology for several web translators such as Alta Vista's Babel Fish (Jurafsky & Martin 2000). FIT is a wrapper program that takes as input strings in the source language, identifies and the idiomatic strings and sends the output to a Systran web translator. It then uses its knowledge of the input idioms to change the output from the literal translations to the correct idiomatic phrases.  We describe the methodology and development of FIT in §1. In §2 we detail two experiments run with FIT, providing a discussion of the results. In §3 we provide a general discussion, with a conclusion following in §4.

1. Methodology

Examples such as 3, above, demonstrate the need for the augmentation of non-compositional expressions in MT. In this section we describe the tools and methodology involved in our French Idiom Transducer.  We used a variety of computational tools  to accomplish the task of translation of non-compositional expressions. The tools include both existing tools and ones that were developed specifically for this task. There were two major tools developed for this task. The first was the development of FIT, and the second was the creation of an interface in which to use it.

1.1. French Idiom Transducer. The French Idiom Transducer was developed to wrap an existing web interface program, as shown in figure 2, below.

The FIT program consists of several information tables and a program to process text.  The information tables are

  1. Idiomatic Lexical Database (idioms_full.txt).  This consists of a list of idioms with four different canonical forms for each: the French source (in canonical form), the way the Babel Fish translator translates the idiom literally, the correct English idiomatic translation of the French, and the way the translator translates this English text back into French.  (This last value is not currently used by FIT, but would be needed to implement the reverse system)
  2. French Verb List (full_conj_fr.txt).  This list contains 296 French verbs, conjugated in 45 forms each. (The list was built from a list of verbs available here: http://www.arts.uwaterloo.ca/FREN/pleiade/dico-c.htm)
  3. English Verb List (full_conj_en.txt).  This list contains 15 English  verbs conjugated in 33 (non-unique) forms each.  This list was built by hand, and provides coverage of the verbs which were actually needed for our FIT translation program
  4. French Form to English Form Mapping (map.txt).  This list contains a mapping for each French conjugated verb form to the matching English conjugated form. 

The processing performed by the FIT program was in two stages.  First was the locating of idiomatic phrases in the French source text. The second stage was the replacing of poorly translated English in the output of the Babel Fish translator.

In the first stage, FIT used the idiomatic database and the French verb list to construct a regular expression that would match all forms of each idiom.  For instance, the idiom in 4, repeated from 3, above, generated the following expression:

(4) French Idiom English Translation Babel Fish Translation
manger le morceau to spill the beans to eat the piece

\s*(mangais|mangait|mangez|mangèrent|manga|mange|mangâmes|mangerons|mangeront|mangiez|a mangé|mangâtes|avons mangé|mangerions |mangaient|aimangé|mangerais|mangerait|mangera|mangions|as mangé| avez mangé |mangeriez|mangerai|mangent|mangeraient|mangai |mangeras|mangons|mangas|mangerez|manger|manges|ont mangé)(?:\b|\s+) (le morceau)(?:\b|\s+)

The source text was then searched one word at a time looking for matches to each of the idioms in the idiom list. When an idiom was found, several pieces of information about it were saved on the idiom stack where they could be retrieved by stage two. 

The canonical forms of an idiom included markings of which parts were the verb phrases, and which parts were to be used literally (no need for conjugation or agreement changes).  The canonical format is extensible to be able to mark adjectives that must agree or other similar issues, though we did not use any of these in the current version of FIT.  In the canonical form the verbs were all listed in their infinitive forms.  With this available, it was easy to both expand to all forms in order to make expressions to match verbs, and to conjugate to create a specific form.

With this information recorded, FIT proceeded on to stage 2.  The first step in stage 2 is to translate the French source to English using Babel Fish.  The goal of stage two is to find the literal translations of the idioms in the English text and replace them with properly formed idiomatic phrases.  To accomplish this, FIT starts with the idiom stack, and begins looking for the expected literal English translation of the first idiom from the original source.  Similar to above, the expected literal canonical form is expanded into a regular expression used to search through the translated output.  For example , the literal translation of idiom 4, above, produced the following expression:

\s*(are eating|have eaten|may eat|will eat|is eating|am eating|eating|has eaten|had eaten|eat|ate|eats|to eat)(?:\b|\s+)(the piece)(?:\b|\s+)

Using the expression of the next expected idiom, the string is searched word by word until the literal text is found. When it is found, the forms of the French verbs and the forms of the English literal verbs are unified to find the appropriate form(s) of the verbs in the idiomatic phrase.  This is challenging because neither the verb form in French or English can be determined uniquely by the conjugated form.  For instance, some forms of verbs in French (e.g. 1SG-IND, 1SG-SBJ, 3SG-IND, 3SG-SBJ) all have the same conjugation, yet the conjugations in English do not share the same conjugation.  Similarly some verbs in English share conjugations between forms that other verbs to not (e.g. cut is 1SG-PR and PAS, while take is 1SG-PR and took is PAS).  However, using the possible French forms together with the possible English forms produces closer to unambiguous results that can be used to conjugate the idiomatic phrase.

The last step in stage two is to conjugate the idiom and replace the literal translation of the idiom in the translated string with the properly conjugated idiom.  Finally, the correctly improved translation is returned from FIT. The process is exemplified in the following diagram for the sentence L'homme a mangé le morceau.

1.2. User Interface. We created a web interface to make FIT accessible and user friendly. To use the web-based translation, one can call the Perl routine "idiom.pl" using the URL http://csel.cs.colorado.edu/~kelloggp/cgi-bin/idiom.pl .   This URL produces the interface diagramed in Figure 4.

Figure 4: Web Interface for FIT

The page will translate paragraphs of text, or the user can enter a fully-defined URL to translate an entire web page.

Since web pages are written in HTML, we need to parse the original web page in order to translate it. HTML (hypertext markup language) is made up of "text" and "tags" that describe how to format the text. We want to translate the text on the web page without disturbing the tags, since English-language browsers will not understand tags that have been translated. (For example, turning the tag <BODY> into the French <CORPS> would cause browsers to fail to recognize it.)

Figure 5.  Michelle's lovely homepage in its original form.

Figure 6. Michelle's lovely homepage after running through the web page translation interface.  (Note, it's in French now.  Oui!)

However, translating the text between tags "one-at-a-time" is also a bad idea, since each translation requires a call to http://babelfish.altavista.com. The web site is often busy and can take 15-30 seconds to respond. If we translate over twenty individual fragments, the translation interface will often "time out" and return with an error. So, we decided to pull the text out of the web page and pass only one sentence to http://bablefish.altavista.com in the form "Text fragment one . XXX . text fragment two . XXX .  etc". The " . XXX . " separators are ignored by Babel Fish, and the periods force each sentence fragment to be treated as a separate sentence. The tags are saved separately in an array, to be used later.

There are three problems with this approach. First, authors can write web pages in many different ways. A sentence in HTML can span several different lines, with tags in the middle of the sentence. For example, "This is<BR>a legal<BR>HTML sentence" is valid. However, our code will treat the text between tags as individual sentences, leading to a strange translation.  Identifying sentence boundaries across several lines might be an intractable problem.

After the text string comes back from the translator, we re-parse it back into the web page, alternating the translated text with the saved tag array. Babel Fish has a limit to how many words can be translated at one time, so sometimes the returned text is truncated and half of the final web page will be missing. This could be fixed by doing repeated calls to the translator, but we have not added that functionality to our current version. Also, punctuation is sometimes deleted by our program, since we are using periods in our separator ( . XXX . ).

Our full list of English verb conjugations was written by hand, but for French conjugations, we used a web page from http://www.arts.uwaterloo.ca/FREN/pleiade/dico-c.htm.  To convert this from an HTML document to a machine-readable text file took one more Perl program, conjhtml2txt.pl. We also wrote numerous test programs and other Perl scripting files to assist development.

2. Experiments

In order to test the implementation of FIT, we first composed a list of French idioms and phrasal verbs. The list of idioms was compiled of common English phrasal verbs, and French idioms from a French Idiom Dictionary (Denoeu, Sices, & Sices 1996). The common English phrasal verbs were extracted from several English websites. To ensure that they had French counterparts, we only used the phrasal verbs that appeared in the French-English Idiom Dictionary. (4) Because most of the translation tasks of FIT are from French to English, we also extracted from the dictionary various French idioms. Our list totals 17 French phrasal verbs and idioms. (5)

We translated each idiom (in list format) using Babel Fish to test whether the idioms were in fact mistranslated. As is evident from Appendix A, each of the idiomatic expressions were mistranslated (Babel Fish translated the idiomatic expressions literal). We then ran FIT on the list to be sure that the program worked. Running the list of French non-compositional expressions through FIT allowed a great opportunity to debug our program and revisit basic issues. Once fixed, FIT correctly translated all 17 French expressions.

However, finding idioms or phrasal verbs in isolation is a pretty simple task, requiring only a regular expression and a dictionary. In more natural translation tasks, the non-compositional expressions will be used in context. This means that the verbs in these expressions will be conjugated differently depending on the subject (both person and number) and the tense used. Additionally, the translation will have to sound natural in the English output at the other end of the translation program. We conducted two experiments to test the usefulness of FIT in real translation tasks.

2.1. Experiment 1. The first test we employed was to see if some of the idioms listed in Appendix A could be found as used in a natural language context. We had a native French Speaker (Franck Biasca, personal communication) write a short paragraph using some of the idioms on our list (see Appendix B). The issues we had to deal with for this experiment included: ensuring that FIT recognized the idioms even with conjugated verbs; ensuring that it knew what to do with the non-verb idioms; ensuring that the conjugated verbs were expanded correctly into both the literal English translation and then to the idiomatic English translation; ensuring that the literal translations in the idiom list matched those actually in the translation from bable fish, in the context we were expecting to find them.

The results of experiment one were successful in most cases. The original text in given in 5a, the Alta Vista translation in 5b, and FIT plus Alta Vista in 5c. As is evident from 5b and 5c, translation performs improved with the addition of FIT. 5 out of the 6 idioms were correctly identified and translated. 1 idiom was not identified at all, see the underlined text in 5c. This is due to the fact that lassairé savoir was translated by Alta Vista literally, placing a pronoun in the verb phrase. Thus, the literal string was not identified by FIT. Of course, the problem of idiom productivity will have to be dealt with, see §3 for more details. (6)

(5a) Original

Hier soir, j'ai recu un coup de fil de mon ami, Philippe.  Il voulait m'inviter voir un film, mais il ne voulait pas faire la queue.  Au contraire, il m'a mis tout de suite dans le bain--trop de queue---pas de film. Il m'a laissé savoir que il me donnerait un coup de coude, sans faire trainer la chose.

(5b) Alta Vista

Yesterday evening, I received a phone call of my friend, Philippe. He wanted to invite me to see a film, but he did not want to make the tail. On the contrary, it put to me immediately in the bath -- too much film tail-not. It let to me know that it would give me a blow of elbow for the bank, without making trainer the thing.

(5c) FIT + Alta Vista

Yesterday evening, I received a ring of my friend, Philippe. He wanted to invite me to see a film, but he did not want to line up. It put to me immediately in the know, too much tail, not of film. It let to me know that it would give me a poke in the ribs, without drawing out the thing.  [Capitalization and punctuation fixed after processing]

As is evident in 5c, there are still some very awkward pieces in the translation. For example, in the first sentence, un coup de fill de mon ami is used in the French original. De is a preposition used to express possession in French, but it is also the proposition from in English. The Alta Vista translator used the possession meaning of de, creating the awkward phrase a ring of my friend in English. Note also that the pronoun il is ambiguous in French between third person masculine singular and English it.  Alta Vista apparently does not have the capability to track reference across sentence boundaries, thus replacing further uses of he with it. FIT does not address such issues, thus our output is still limited by Alta Vista. This example demonstrates that much more work is needed in MT.

2.2. Planned Experiment 2. Experiment two tests the application of FIT on a practical task: translating a web page. As mentioned above, non-compositional expressions are very colloquial, thus, even having a French-English idiom dictionary does not always help one to identify all of these expressions. In order to identify the non-compositional expressions in a specific domain, we had planned to have a French speaker read through a French web page and identify the non-compositional terms and expressions. The domain we planned to use was movie reviews. After identifying the idioms, we then planned to add them to our dictionary and along with their English translations and conjugations. We then plan to run the web page through FIT. We are confident that it would work, given that all of the prices work.

While not surprising that FIT would actually successfully translate the web page, there are still a number of problems that need to be considered. The major one, of course, is the fact that due to the limitations of Babel Fish, we are not able to translate the whole web page, only the first part (see the discussion above in §1.2). Despite these problems, FIT did accurately translate the five French non-compositional expressions.

2.3. Planned Experiment 3. Translating only one web page, and the one that was used to develop the vocabulary, would greatly bias our test in our favor. To ameliorate this, we had planned to choose another domain and look at 3-5 web pages in that domain, and build our non-compositional dictionary based on those web pages. Then we had planned to translate a new web page in the same domain and rate the performance of the system on that page. We considered having a native speaker give "naturalness" ratings on the web page translated by both FIT and Babel Fish, expecting of course that FIT produces a more natural translation. However, due to unexpected technical difficulties, difficulties in scheduling time with native French speakers, and a project due date, we were not able to conduct these final two experiments as planned.

3. General Discussion

 Wedo not have the capability to deal with idioms that are split up, but we could. Several technical issues are worth discussing with regards to the implementation and performance of FIT. 

3.1 Matching particular literal translations.  One of the largest problems we had in our system is the dependence of FIT on the expected literal translation of French idioms.  It turns out that Babel Fish translates the same string of French words differently depending on the context.  While this statement seems intuitive, the extent to which it varies is quite surprising.  For instance, Babel Fish translates "entre par effraction dans" as "enter by effraction in", while "Entre par effraction dans."  (note capitalization and punctuation) is translated as "Enter by effraction."  The translator is translating differently based on whether there is a capital letter or a period.  The fact that we are trying to match the predicted literal translation with the actual literal translation makes our program very fragile both to changes in the remote translation engine and to translation in contexts other than those we have examined already.  Without a way to reliably pass markers in the text through the translation untouched, we cannot do much better other than to try to find more of the contexts that will produce differing literal translations and try to match those.

3.2 Matching verb forms that did not overlap.  A second problem we had, which we were able to overcome, was our use of both the French forms of the verbs and the English forms to try to determine the correct conjugation for the idiom.  As explained above, the unification step was necessary to correctly conjugate the English form in ambiguous situations. However there were also times when the unification failed.  For example, it was often the case that the Babel Fish translation produced a literal verb form that did not match the possible French verb forms we identified for a particular verb. When unification failed, FIT fails to disambiguate, but instead it uses the first form that matches the literal conjugation that came from the translator.  Thus, it is not perfect but is close. 

At one time, we had hypothesized that our results would be better if we were to use a tagger on the French to determine the actual form instead of an ambiguous list of possible forms.  These results, however, show that even when our tagger would think it knows the form, it would be a different form that comes out of the translator.

3.3 Partial idiomatic efforts by translation engines.  Another interesting effect that we observed is that sometimes the translation engine is trying to perform some idiom replacement already. For instance, Babel Fish translates un coup de fil as a phone call.  Our idiom list had it labeled as a ring, while the literal French to English translation is a blow of wire.  Our substitution of "a ring" for "a phone call" in example 3 above seems to be a waste of effort, despite the fact that it was successful.  However, in most other cases Babel Fish makes no effort to replace it. For example, our substitution of un coup de coude with a poke in the ribs is clearly more natural than the literal a blow of elbow, which the translator makes no effort to replace.

3.4 Automating and improving performance.  There are two (mutually-exclusive) ways we have thought of to improve the automation of our FIT.  One is using only the constrained forms of the verbs that are expected in order to match the idioms.  We have no evidence to suggest that this would improve performance, but if there are cases where having extra forms of the verb were slowing down the matching or making the conjugations more ambiguous, having a list of possible verb forms for each idiom might help.  The second method of improving performance and automation would to somehow have the system learn to identify the idioms all on its own rather than having them have to be hard-coded for each file of inputs.  This could possibly be accomplished by reading web pages that have both French and English translations available.  While this is a hard task, statistical methods could be used in order to be able to spot common replacements that are significantly different than the translator's output.  These may point to the correct idioms to replace. 

3.5 Results.  While we were relatively successful at accomplishing what we set out to accomplish, our results are not spectacular.  The deficiencies in the MT systems that already exist (especially Babel Fish, with which we have the most experience) are large enough that simply replacing idioms, no matter how successfully, provides little improvement. 

4. Conclusion

While we clearly have not address all of the issues involved in the machine translation of non-compositional expressions, we have provided insight to how one might go about it. Perhaps the most glaring problem with our approach is the fact that as of yet, we have no way to deal with productivity within these expressions beyond changes in verb conjugation. As discussed in Wasow & Sag (1994), idiomatic expressions can be productive in a variety of ways, including modification of words within the idiom (a hard poke in the ribs), or the words within the expressions can be changed, as in constructional idioms (the more the merrier; the bigger the better).  Thus, it seems that idiom translation could be greatly aided by the use of a parser, as suggested by Wehrli 1998. None-the-less, we have provided a solution for the machine translation of non-compositional expressions that not only has promise, but actually improves the accuracy of existing commercial MT programs, such as Babel Fish.

The reader may  notice that our literature review is a bit sparse. This is mainly due to the fact that we were focussing on the task at hand. Additionally, many relevant articles were not available at CU, and we lacked the foresight to order them in advance from inter-library loan.

[1] All examples are translation from French to English, unless otherwise noted.

[2] It is important to realize that generally context will be the deciding factor in whether the literal or the idiomatic meaning is the target translation. This will be discussed further.

[3] Of course, it would have been ideal to have a native French Speaker pick out frequent French idioms and phrasal verbs from French web pages, and if this were actually implemented, that is what we would do.

[4] Any actual implementation would of course have a much larger dictionary. This small dictionary was developed for test purposes only.

[5] The actual English translation is given in Appendix B. The translation demonstrates that the paragraph is slightly awkward. This is due to the constraint we placed on the French consultant. We asked him to use 3-5 of the idioms in a short paragraph. However, the natural grammatical context for these idioms is what we were after, and this paragraph fulfilled that requirement.

References

Denoeu, F., Sices, D. & Sices, J. (1996). 2001 French and English Idioms. Hauppauge, New York: Baron's.

Jurafsky, D. S. & Martin, J. S. (2000). Speech and Language Processing. Upper Saddle River: Prentice Hall.

Storrer, A. & Schwall, U. (1995). Description and Acquisition of Multiword Lexemes. Machine Translation and the Lexicon. Third International EAMT Workshop. 35-50.

Sung, H.Y. & Yung, T.K. (1995). Idiom-based Analysis of Natural Language for Machine Translation. Journal of the Korea Information Science Society 20. 1148-58.

Nunberg,G., Sag, I. A. & Wasow, T. Idioms. Language 70. 491-538.

Wehrli. (1998).

Appendix A: List of French Idioms

French Idiom

English Translation Babel Fish Translation
triste comme un bonnet de nuit as dull as dishwater sad like a night-cap
manger le morceau to spill the beans to eat the piece
faire le queue to line up to make the tail
defendre stand up for defendre
entre par effraction dans break into enter by effraction in
protester comme tous les diables to storm to protest like all the devils
un coup de coude a poke in the ribs a blow of elbow
dans le bain in the know in the bath
prendre parti pour to side with to take party for
un coup de fil a ring (as in 'give a ring') a phone call
a la petite semaine short term has the small week
laisser savoir to let on to let know
laisser en panne to let down to leave broken down
a dessein on purpose has intention
abattre to break down to cut down
se briser to break up to break

prendre parti pour

to side with to take party for
du bas the bottom line bottom

Appendix B: French Paragraph

Hier soir, j'ai recu un coup de fil de mon ami, Philippe.  Il voulait m'inviter voir un film, mais il ne voulait pas faire la queue.  Au contraire, il m'a mis tout de suite dans le bain--trop de queue, pas de film. A propos du boulot, il m'a laisse savoir que il me donnerait un coup de coude, sans faire trainer la chose.

English Translation:

Yesterday evening, I received a ring from my friend, Philippe. He wanted to invite me to a movie, but he didn't want to stand in line. To the contrary, he immediately put me in the know--long line, no movie. Then he let on that he gave me a poke in the ribs, without drawing it out.  (he let on that he was joking ….)

Appendix C: List of programs and associated files

idiom.pl recursively calls itself to translate paragraphs of text, or can call the package "webpage.pl" to translate entire URL-defined web pages. In addition, both routines include several Perl packages to perform the actual translation.

conjtable_en.pl
conjtable_fr.pl
tif.pl
translation.pl

and several auxiliary text files:

full_conj_en.txt
full_conj_fr.txt
map.txt
idioms_full.txt