MT0: Toy Machine Translation Program, Nobo Komagata, 5/30/03, 7/8/03
General idea
- This toy machine translation program is based on the syntax-directed
translation approach built around a shift-reduce parser. The key idea
is the same as the ones used in computer language translation in compilers.
- Unlike
computer languages, human languages are ambiguous. In order to analyze
all the available parses, the program implements a manual backtracking/alternative
selection mechanism. Manually, the user can explore the entire parse
space. The trace of operation is kept in a configuration stack. This
stack is provided in addition to the parse stack, which is used for the shift-reduce
operations. Thus, the program is not a direct implementation of a Push-Down Automaton.
- One
obvious question is whether and how we can automatize the process so that
the program can evolve into a real machine translation system. Furthermore,
the user must feel frustrated using this program. Analyze why would
naturally lead to a variety of key concepts in Natural Language Processing.
Operation
- Loading example grammar files: The default example file is "http://www.tcnj.edu/~komagat/MT0/Example1.txt". There is another example file called "Example2.txt".
- Loading other grammar files: Text files on the server "www.tcnj.edu/" can be referenced in the same way. Files cannot be loaded from other servers.
- Modifying the grammar: The grammar in the grammarArea can be edited.
In order to use the modified grammar, you must press RESET.
- Entering
input sentence: Enter a sentence so that it would match the terminal symbols
(case sensitive). Unless punctuation symbols are actually part of the
grammar, do not use them.
Grammar format
- The grammar format must follow the pattern "X {Y} -> A B C",
where X, A, B, and C are grammatical categories (called "type" in the program)
and Y is the semantic interpretation associated with X. There must
be exactly one category on the left-hand side (LHS). There must be
at least one category on the RHS. There is no preset limit on the number
of categories on the RHS. The semantic interpretation section can contain
any string. However, the categories that appear on the RHS will be
replaced with the semantic representation associated with the category. Below
is an example rule from "Example2.txt".
The effect of this rule is that the categories NP and VP will be combined
to form a single category S with the corresponding semantics "NP wa VP".
If NP and VP have semantic representations "Furby" and "neru", the
interpreted semantic representation for S will be "Furby wa neru".
- A dictionary entry typically appears with a single category on the RHS as shown below.
- If the semantic effect of a word is not reflected in the target language, the interpretation section can be left empty. However, a pair of curly brackets are still required.
- If there are multiple instances of the same category on the RHS and they
are associated with distinct semantic representations, the categories must
be distinguished by indices in the form of "_#", where # is a numeral. This
is necessary to ensure that the corresponding semantic representations can
be used correctly in the semantic translation part between "{" and "}".
- NP {NP_1 to NP_2} -> NP_1 and NP_2
- In
the following example, commas "," are used to match commas in the input (they
must be space-delimited). Since these commas are not associated with
a particular semantics, they are not distinguished by "_#". Note that
the comma in the interpretation section has nothing to do with the commas
on the RHS. In general, you can insert any string (except for some
special symbols such as "}" and ">") necessary to generate desired semantic
representation.
- S {Disc, AdvP S} -> AdvP , Disc , S
- Much like commas, "and" is not defined in "Example2.txt".
When "and" appears in a pattern like "NP and NP", it is used as a special
symbol to activate this rule as in the example below. Note that the
semantics of "and" is embedded in this rule and the corresponding interpretation
"to" is generated by the rule, not derived from the lexical entry "and".
However, there are other ways to do the same thing.
- NP {NP_1 to NP_2} -> NP_1 and NP_2
- Convention: Use lowercase for terminals (dictionary entries) and upper case for nonterminals (grammar rules).
<End>