Letter to Phoneme Rule Engine

Creating a text-to-speech voice that has accurate pronunciation is complicated and challenging, especially for a diverse language such as English.

Using a pronunciation dictionary (such as the one by the Carnegie Mellon University1) has its limitations. The dictionary can be very large, requiring megabytes of memory to store it which has an impact on cache locality and performance. It is also difficult to deal with made up words or words that are not present in the dictionary. Care also needs to be made to ensure the pronunciations have a consistent accent (e.g. keeping hot and cot consistent).

Using grapheme to phoneme rules (such as the ones described in the Naval Research Laboratory (NRL) Report 79482) drastically reduces the space needed to pronounce all words in a dictionary and significantly improve pronunciation of unknown words. The eSpeak text-to-speech engine provides a mechanism for describing grapheme to phoneme rules3 such as the ones provided in the NRL Report.

A key problem with grapheme to phoneme rules is how to ensure that they are accurate. This is easily verified by using a complete and accurate pronunciation dictionary as the test set for the rules.

NOTE: It is possible to create a pronunciation dictionary by listing all the words in the dictionary as grapheme to phoneme rules.

Expressing Letter To Phoneme Rules

This is a description of the types of expressions supported by the Cainteoir grapheme to phoneme rules engine.

These come in several forms:

Form Example Description
letter-sequence /phoneme-sequence/ cat /k'at/ A match to the letter-sequence causes the phoneme-sequence to be generated.
letter-sequence /phoneme-sequence/ language si /s'i:/ es A match to the letter-sequence causes the phoneme-sequence to be generated (which is noted to be in a specific language).
letter-sequence /phoneme-sequence/ language+accent cot /k'0t/ en+RP A match to the letter-sequence causes the phoneme-sequence to be generated (which is noted to be in a specific language and accent).
letter-sequence "replacement-text" 12 "twelve" A match to the letter-sequence causes the replacement-text to be scanned by the rules (NOTE: a "a" is infinitely recursive).

Where the letter-sequence may consist of one or more of the following regular expression-like constructs4:

Pattern Example Description
8-bit character t Matches if the character is at the current string position (NOTE: a utf-8 character will match as 1-4 8-bit characters).
\[characterset\] \[ck\] Matches if one of the characters in characterset is at the current string position.
(option|option) (c|k|ck) Matches if one of the option sub-patterns match at the current string position.
character? mr.? Optionally matches if the character is at the current string position.
^ ^nano Matches if the current string position is at the start of a word.
$ ing$ Matches if the current string position is at the end of a word.

NOTE: To enable advanced optimisation techniques, patterns involving repeated items are excluded.

Multiple-Pass Interpreter

The letter to phoneme rules will be scanned in order and the first matching rule is the one that will be chosen and the resulting action taken. The text is then processed from the next string marker. This is to simplify the scanning algorithm:

void letter2phoneme(rule *rules, const char *str)
{
	for (rule *r = rules ; r != NULL ; r = r->next) {
		const char *next = match(r, str);
		if (next) { // a match was found ...
			doaction(r);
			if (!*next) // end of string ...
				return;
			str = next; // advance
			r = rules;  // reset the scanner
		}
	}
	printf("no matching rule found!\n");
}

Single-Pass Interpreter

This can be inefficient for large rule sets (e.g. a large pronunciation dictionary). A faster approach at runtime would be to have a single Discrete Finite-State Automata (DFA) rule with multiple end-points, each of which holds the action required. The scanner would then match against this big automata, picking the longest matching sequence (behaving differently to the first matching algorithm above):

void letter2phoneme(rule *r, const char *str)
{
	while (*str) {
		action *m = match(r, str);
		if (m) { // a match was found ...
			doaction(m);
			str = m->next; // advance
		} else {
			printf("no matching rule found!\n");
			return; // abort
		}
	}
}

For a pronunciation dictionary, this single-pass model behaves like a trie data structure12.

In order to build the DFA, the loader could scan each rule as it is being added and avoid inserting duplicate nodes. This should be possible without ambiguities by limiting the supported expressions to forward-scanning expressions only (i.e. no repetition).

Virtual Machine

Each rule is transformed to an internal opcode representation. This provides a simpler sequence processing model operating on a stack and register virtual machine8.

It should be possible to serialise the opcode sequence in a text form (for automated regression testing and debugging) as well as a compact binary bytecode form (for optimal storage).

Just-in-Time Compilation

A Just-in-Time (JIT) compiler could replace the interpreter to convert the scanner into efficient native code, taking advantage of native registers and SSE instructions.

References

  1. The CMU Pronunciation Dictionary. Carnegie Mellon University (CMU).
  2. Elovitz, H. S., Johnson, R. W., McHugh, A., Shore, J. E., Automatic Translation of English Text to Phonetics by Means of Letter-to-Sound Rules (NRL Report 7948). Naval Research Laboratory (NRL). Washington, D. C., 1976.
  3. eSpeak: Pronunciation Dictionaries.
  4. Regular Expressions Reference - Basic Syntax.
  5. Carl Friedrich Bolz, An Efficient and Elegant Regular Expression Matcher in Python. 2010.
  6. Russ Cox, Implementing Regular Expressions.
  7. Russ Cox, Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...). 2007.
  8. Russ Cox, Regular Expression Matching: the Virtual Machine Approach. 2009.
  9. Russ Cox, Regular Expression Matching in the Wild. 2010.
  10. Brian W. Kernighan, Rob Pike, Regular Expressions. 1999.
  11. Erik Corry, Christian Plesner Hansen, Lasse Reichstein Holst Nielsen, Irregexp, Google Chrome's New Regexp Implementation. 2009.
  12. Trie. Wikipedia