[osmt] German -- English MT baby system

Sun Jul 9 21:20:34 CEST 2006

hi again,

at last, here is a summary of what happened last week while micha was
visiting.  the overall goal of his thesis research is the acquisition
of transfer rules from sentence-aligned sets of MRSs, for example the
results of parsing aligned EuroParl sentences with suitable grammars.
this technique would be applicable, in principle, for LOGON too.

in part when working with francis already, we made some adaptations to
increase our multi-lingual abilities, including:

  - i moved bits of the NoEn transfer grammar that can be shared across
    language pairs into the `$LOGONROOT/uio/' directory; for now, large
    parts of the LOGON `mrs.tdl', `predicates.tdl', and `mtr.tdl' have
    been promoted to this status: (candidate) language-independent core
    transfer elements.  furthermore, there are `predicates.erg.tdl' and 
    `predicates.gg.tdl' in that directory, which are ERG-specific (and
    GG-specific, respectively) but can still be shared across transfer
    grammars involving english (and german, respectively).

  - i designed a mechanism called variable property mapping (`VPM'), so
    as to replace what used to be semi-procedural within transfer with
    a more powerful and declarative mechanism.  the renaming of things
    like TENSE, NUM, and PERS into the ERG-internal E.TENSE and PNG.PN
    now happens inside of the ERG; see `$LOGONROOT/lingo/erg/semi.vpm'.
    the ERG on-line demo at `http://erg.emmtee.net' now reflects these
    changes.  i expect dan will find things to fine-tune in my mapping
    over time :-).

  - now with VPMs available, i purged the former

      %transfer-input-defaults%
      %transfer-properties-accumulator%
      %transfer-properties-defaults%
      %transfer-properties-filter%
      %transfer-values-filter%

    from the code and (i think) all transfer grammars.  also, i set the
    following to nil in the ERG:

      %mrs-extras-defaults%
      %mrs-extras-filter%

    both un-filling and defaulting can now achieved in VPMs.

  - in NoEn, i eliminated ERG-specific values (e.g. `no_tense'); there
    are two new VPMs in the transfer grammar, defined as :in and :out
    parameters to read-transfer-rules() calls (see the `script').  for
    the time being, the :in and :out VPMs mostly pass through, though
    there still is a tiny amount of renaming to bridge between LOGON
    best practice and various DELPH-IN conventions: i hope to further
    harmonize over time and will make a few concrete proposals.

  - to get started on german, we imported the DFKI open-source grammar
    (GG) into the LOGON CVS repository as `$LOGONROOT/dfki/gg'.  based
    on some informed guessing about the range of properties and values
    in german, i constructed an initial `semi.vpm' for GG too.  i know
    berthold will find many things to correct here!

  - in order to get a training MRS bank for micha, he translated the
    MRS test suite into german, we created an [incr tsdb()] skeleton,
    and then parsed it (using the LKB, such that the VPM could apply)
    and treebanked (hastily).  the initial training corpus, thus, had
    some 80 sentences.  exporting MRSs in [incr tsdb()] provides the
    starting point to the transfer rule acquisition script.

  - to get a stub DeEn transfer grammar, we copied NoEn and purged most
    of its content.  micha ran his acquisition script, and we included
    the result as `mrs.mtr' in `$LOGONROOT/uio/deen' (1233 rules).  at
    this point, we had a first working system and started fine-tuning.
    the graph attached to this message shows incremental improvements
    over the first 24 or so hours :-).

  - it seems there are a number of strategic choices to be made during
    transfer rule generation, many of them related to how to `carve up'
    aligned MRSs, and how much mis-alignment to tolerate.  i hear micha
    has continued problem analysis and fine-tuning since he left oslo,
    and i look forward to final results in his thesis :-).

so much for the original DeEn experiment.  berthold has since agreed to
become the maintainer of the GG-specific VPM and upload future versions 
of the grammar into CVS (much like what francis does for JaCY).  we did
use half an hour or so on doing the same steps for EnDe, but to date we
have not even looked at results in detail.  in principle, the transfer
rule acquisition approach should work well bi-directionally.

some of the things that remain to be done:

  - it seems the (LKB) unknown name and number generation is sensitive
    to capitalization, much to my surprise: [ CARG "berthold" ] worked
    fine, [ CARG "Berthold" ] did not.  also, we should add the generic
    lexical entries into GG once we start getting serious about EnDe.

  - i need to recompile PET to include the VPM code; currently, we are
    limited to using the LKB for parsing too.

  - to run things from source, DFKI has to locate their ACL 8.0 license
    file (probably sent in email late in 2005).

  - for the MRS test suite, GG coverage left room for improvement; also
    it would seem worthwhile to systematically harmonize analyses, much
    like we did across NorGram, the ERG, and JaCY.

once we are content with this approach on the MRS test suite, we should
consider going back to the EuroParl materials, maybe pick five thousand
or so sentences and work on them systematically.  one problem i noticed
when looking at EuroParl for the first time is the pre-processing that
was applied:

  In the meantime , I should like to observe a minute' - s silence ,

presumably, we would have to go back to the original text and redo the
sentence alignment (they supply the scripts they used, so it should be
feasible doing this).  

micha, berthold, francis, everyone, please correct and comment as you
see fit!  i plan to use a bit of my time this coming week to wrap up
LOGON for a first public release at ACL|COLING.  hence, it might make
sense to try and get updated versions of GG and (at least) DeEn into
CVS before the end of the week?  francis, what is the status of JaEn
and EnJa?

                                                     all best  -  oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2285 7989
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at csli.stanford.edu; oe at ifi.uio.no; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

-------------- next part --------------
A non-text attachment was scrubbed...
Name: deen.pdf
Type: application/pdf
Size: 12370 bytes
Desc: not available
URL: <http://lists.emmtee.net/archives/logon/attachments/20060709/153a8bfb/attachment.pdf>