[osmt] German -- English MT baby system

Sun Jul 16 19:35:13 CEST 2006

Stephan Oepen wrote:
> hi again,
>
> at last, here is a summary of what happened last week while micha was
> visiting.  the overall goal of his thesis research is the acquisition
> of transfer rules from sentence-aligned sets of MRSs, for example the
> results of parsing aligned EuroParl sentences with suitable grammars.
> this technique would be applicable, in principle, for LOGON too.
>
> in part when working with francis already, we made some adaptations to
> increase our multi-lingual abilities, including:
>
>   - i moved bits of the NoEn transfer grammar that can be shared across
>     language pairs into the `$LOGONROOT/uio/' directory; for now, large
>     parts of the LOGON `mrs.tdl', `predicates.tdl', and `mtr.tdl' have
>     been promoted to this status: (candidate) language-independent core
>     transfer elements.  furthermore, there are `predicates.erg.tdl' and 
>     `predicates.gg.tdl' in that directory, which are ERG-specific (and
>     GG-specific, respectively) but can still be shared across transfer
>     grammars involving english (and german, respectively).
>
>   - i designed a mechanism called variable property mapping (`VPM'), so
>     as to replace what used to be semi-procedural within transfer with
>     a more powerful and declarative mechanism.  the renaming of things
>     like TENSE, NUM, and PERS into the ERG-internal E.TENSE and PNG.PN
>     now happens inside of the ERG; see `$LOGONROOT/lingo/erg/semi.vpm'.
>     the ERG on-line demo at `http://erg.emmtee.net' now reflects these
>     changes.  i expect dan will find things to fine-tune in my mapping
>     over time :-).
>
>   - now with VPMs available, i purged the former
>
>       %transfer-input-defaults%
>       %transfer-properties-accumulator%
>       %transfer-properties-defaults%
>       %transfer-properties-filter%
>       %transfer-values-filter%
>
>     from the code and (i think) all transfer grammars.  also, i set the
>     following to nil in the ERG:
>
>       %mrs-extras-defaults%
>       %mrs-extras-filter%
>
>     both un-filling and defaulting can now achieved in VPMs.
>
>   - in NoEn, i eliminated ERG-specific values (e.g. `no_tense'); there
>     are two new VPMs in the transfer grammar, defined as :in and :out
>     parameters to read-transfer-rules() calls (see the `script').  for
>     the time being, the :in and :out VPMs mostly pass through, though
>     there still is a tiny amount of renaming to bridge between LOGON
>     best practice and various DELPH-IN conventions: i hope to further
>     harmonize over time and will make a few concrete proposals.
>
>   - to get started on german, we imported the DFKI open-source grammar
>     (GG) into the LOGON CVS repository as `$LOGONROOT/dfki/gg'.  based
>     on some informed guessing about the range of properties and values
>     in german, i constructed an initial `semi.vpm' for GG too.  i know
>     berthold will find many things to correct here!
>
>   - in order to get a training MRS bank for micha, he translated the
>     MRS test suite into german, we created an [incr tsdb()] skeleton,
>     and then parsed it (using the LKB, such that the VPM could apply)
>     and treebanked (hastily).  the initial training corpus, thus, had
>     some 80 sentences.  exporting MRSs in [incr tsdb()] provides the
>     starting point to the transfer rule acquisition script.
>
>   - to get a stub DeEn transfer grammar, we copied NoEn and purged most
>     of its content.  micha ran his acquisition script, and we included
>     the result as `mrs.mtr' in `$LOGONROOT/uio/deen' (1233 rules).  at
>     this point, we had a first working system and started fine-tuning.
>     the graph attached to this message shows incremental improvements
>     over the first 24 or so hours :-).
>
>   - it seems there are a number of strategic choices to be made during
>     transfer rule generation, many of them related to how to `carve up'
>     aligned MRSs, and how much mis-alignment to tolerate.  i hear micha
>     has continued problem analysis and fine-tuning since he left oslo,
>     and i look forward to final results in his thesis :-).
>
> so much for the original DeEn experiment.  berthold has since agreed to
> become the maintainer of the GG-specific VPM and upload future versions 
> of the grammar into CVS (much like what francis does for JaCY).  we did
> use half an hour or so on doing the same steps for EnDe, but to date we
> have not even looked at results in detail.  in principle, the transfer
> rule acquisition approach should work well bi-directionally.
>
> some of the things that remain to be done:
>
>   - it seems the (LKB) unknown name and number generation is sensitive
>     to capitalization, much to my surprise: [ CARG "berthold" ] worked
>     fine, [ CARG "Berthold" ] did not.  also, we should add the generic
>     lexical entries into GG once we start getting serious about EnDe.
>
>   - i need to recompile PET to include the VPM code; currently, we are
>     limited to using the LKB for parsing too.
>
>   - to run things from source, DFKI has to locate their ACL 8.0 license
>     file (probably sent in email late in 2005).
>
>   
>   - for the MRS test suite, GG coverage left room for improvement; also
>     it would seem worthwhile to systematically harmonize analyses, much
>     like we did across NorGram, the ERG, and JaCY.
>
> once we are content with this approach on the MRS test suite, we should
> consider going back to the EuroParl materials, maybe pick five thousand
> or so sentences and work on them systematically.  one problem i noticed
> when looking at EuroParl for the first time is the pre-processing that
> was applied:
>
>   In the meantime , I should like to observe a minute' - s silence ,
>
> presumably, we would have to go back to the original text and redo the
> sentence alignment (they supply the scripts they used, so it should be
> feasible doing this).  
>
> micha, berthold, francis, everyone, please correct and comment as you
> see fit!  i plan to use a bit of my time this coming week to wrap up
> LOGON for a first public release at ACL|COLING.  hence, it might make
> sense to try and get updated versions of GG and (at least) DeEn into
> CVS before the end of the week?  francis, what is the status of JaEn
> and EnJa?
>
>   
Dear Stephan,

LOGON support is now part of the standard GG distribution. Since I had 
trouble with cvs write access last Fri, please get the latest version 
from http://gg.dfki.de. It now generates also from sanitised MRSs. Can 
you replace the entire dfki/gg subdirectory with the contents of the 
official distribution? Note that the script file is "lkb/script" and not 
"lkb/_script", as it was in the pre-release.

Micha, since I had to change semi.vpm, can you regenerate the DE-EN  
transfer rules, (and EN-DE)?  You can get the current grammar either 
from the web site, from SVN, or from /project/cl/systems/logon/dfki/gg/. 
Please do not forget to reparse the MRS test suite and update the tree 
bank. The annotated treebank you can use for updating is in 
/project/cl/systems/lingo/lkb/src/tsdb/home/MRS-DE/.

Stephan, the .mtr files Micha sent to you last Fri probably do not work 
with the corrected semi.vpm.  So for the demo, either use the old 
system, or update both GG and the transfer grammars.

EN-DE used to fail for one out of two reasons.  First, different case in 
CARG, and second, index features prontype and grind, none of which were 
defined for German. I have incorporated these features into the semi.vpm 
mappings now, although I could not yet test it. The CARG problem still 
persists.

Best,

Berthold

>                                                    all best  -  oe
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2285 7989
> +++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
> +++       --- oe at csli.stanford.edu; oe at ifi.uio.no; stephan at oepen.net ---
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>