|The grammar behind Utter Command
Human-Machine Grammar is a system of words and rules designed to allow humans
to communicate commands to computers. It takes into consideration that humans
have an extensive natural language capacity that has evolved over millions
of years and that we use seemingly without effort, while computers do not
yet have the ability to understand the meaning of speech.
It also takes into consideration that while language seems easy
for humans, different phrasings encompass a considerable span of cognitive
effort. Human-Machine Grammar is designed to limit cognitive effort in order
to free up as much of the brain as possible to concentrate on the task at
Natural language allows for a wide, textured range of communications,
but controlling a computer only requires a relatively small set of distinct
commands. Human-Machine Grammar is a relatively succinct set of words that
can be combined according to a concise set of grammar rules to communicate
a small set of commands. The system is relatively easy for humans to learn,
and computers can respond to the commands without having to decode natural
language or be loaded down with large sets of synonymous commands. (for
more details see Structured vs.
Human-Machine Grammar, like any language system in use, is an active,
evolving set of words and rules. Redstart Systems founder Kimberly Patch
began using speech recognition software and writing custom speech macros
in 1994. Over time it became obvious that a formal grammar was needed. She
has been developing these ideas since 1998; much of what she developed,
including the logic underpinning the grammar, is covered in a series
We encourage people to use the HMG system when writing custom speech
The grammar rules
1. Match the
words used for a command as closely as possible with what the command does.
2. Use words the user
sees on the screen.
3. Be consistent.
4. Balance the ease of
saying a command with the ease of remembering a command.
5. Use one-word commands
very sparingly. Beyond one word, however, keep
the number of words used in any given command to a minimum.
6. Eliminate unnecessary
7. Eliminate synonyms.
8. Reuse vocabulary words.
9. Use existing word
10. Follow the way people
naturally adjust language to fit a situation.
11. Use combined commands
that follow the order of events.
12. Allow the user to
follow the action when necessary.
13. Use phrase modes,
or words that keep mode changes within single commands, to
give the computer more information.
14. Be thorough and
consistent in enabling menu commands across all programs.
15. When appropriate,
allow different ways to invoke the same function.
16. In general, think
of on-screen elements like text, symbols and graphics as logical objects,
and enable similar objects to be manipulated in similar ways.
The 16 Human-Machine Grammar rules are aimed at keeping the speech interface
vocabulary small and easy to remember and predict. These guidelines cut
out alternate wordings and establish consistent patterns across the entire
set of commands, making it much easier to remember or guess how a command
should be worded. The examples below are taken from Redstart Systems' Utter
Command speech interface for general computer use.
Rule 1: Match the words used for a command as closely as possible
with what the command does.
This makes commands easier to remember.
"Line" refers to a line of text
refers to clicking an on-screen element with the mouse arrow
"File" refers to a file
"Folder" refers to a folder
Rule 2 Use words the user sees on the screen.
This also makes commands easier to remember.
When enabling menu commands, for example, use existing words - the
menu labels - to indicate menu actions.
Rule 3: Be consistent.
This also makes commands easier to remember and guess. Consistency
means always using the same term to refer to an object or action, and the
same constructions to build commands.
Notice the patterns in these groups of commands:
, "3 Lines"
, "4 Graphs"
, "2 Layers"
, "2 Enter"
, "Word Bold"
, "Line Duplicate"
"7 Lines Cut"
, "5 Words Bold"
, "3 5 Words Delete"
"My Documents Folder"
, "Google Site"
, "Alpha File"
Rule 4: Balance the ease of saying a command with the ease of
remembering a command.
The ease of saying a command is always important, but becomes
even more important the more often a command is used. In contrast, the ease
of remembering a command is always important, but becomes even more important
for commands that are not frequently used.
Clicking the mouse arrow is common, making it important that the
command for clicking be easy to say. "Mouse Click" is particularly difficult.
is much easier, and also matches what the command does.
In enabling menu commands, it's important to use the words on the
menu labels because even though they might not be worded well for ease of
saying, most of them are good enough, there are many of them, and it is
much easier to remember commands that you see on screen.
Rule 5: Use one-word commands very sparingly; beyond one word,
however, keep the number of words used in any given command to a minimum
One-word commands are easy to remember and say, but are more
likely than longer commands to be tripped accidentally when you mean to
say them as text. There are a very few commands that are used extremely
often, including "Enter", "Space", "Backspace", and "Escape". It makes sense
to enable these few, very common commands as one-word commands. In situations
where the system is limited to commands, like when the focus is on a dialog
box, and when the command you want to say is on-screen, one-word commands
also make sense.
Otherwise, however, commands should consist of more than one word,
and just two if possible. Keeping the number of words used in commands to
a minimum makes commands easier to remember, say and combine.
Rule 6: Eliminate unnecessary words.
This rule is closely related to rule 5. One key to keeping commands
succinct is eliminating unnecessary words.
Here are some things to think about when paring a command to only
Articles like "a" and "the", and polite, getting-started and redundant
filler words are never needed.
When identifying an object is enough to imply an action, it isn't
necessary to include the action word. Identifying a folder - for instance,
- is enough to indicate that the folder named "Budget"
be called up by the program in use.
Here's a command that shows it isn't necessary to include the object
(cursor) the action (move) or the type of units (characters). "3 Left"
is enough to indicate that the cursor be moved three characters to the left.
The bottom line is if there's no need to differentiate, there's no
need to have the user spend brain cycles and time on remembering and saying
a specific word.
Commands that contain only necessary information follow the way we
work out jargon in repetitive human-human communication situations. For
instance, a fast food worker putting in two orders of french fries typically
says "two fry".
Rule 7: Eliminate synonyms.
A vocabulary without synonyms is smaller, which makes commands
easier to remember and predict. It also makes combining commands practical,
which, in turn, makes using a computer faster and more efficient.
The word "This", for instance, refers to something that's highlighted
or on the clipboard. It's the only word that carries these meanings. If
you want to say a command that carries out a single action on a selection,
like ""This Cut"
, or "This Bold"
you know you'll use "This".
The word "Back" and only back refers to moving something in the direction
to the left of the cursor. "Word 3 Back"
, for instance, moves the
word nearest the cursor 3 words to the left.
"Forward" and only forward refers to moving something in the direction
to the right of the cursor. "Graph 2 Forward"
, for instance, moves
the paragraph nearest the cursor down two paragraphs.
This key rule is in stark contrast to most existing speech interfaces.
The default configuration of Nuance's NaturallySpeaking, for instance, offers
four different ways for the user to voice the punctuation mark "Open Quote"
and four more ways for the user to voice the punctuation mark "Close Quote".
It uses many synonyms, including "Start", "Begin", "Give Me", "Check", "Show",
"Open", "Bring Up", "Edit" and "View" as the first word or words in commands
that bring up a program or dialog box. And it offers 16 synonymous wordings
for checking mail, 16 for creating a new mail message, five for opening
a selected email message, and five for closing an email message. This total
of 42 wordings for four functions are specific to one email program.
Synonymous wordings pose another problem. Similar words came about
because they have subtly different meanings. These differences are key to
keeping the length of commands short and enabling different types of functions.
If "Back" and "Forward" always refer to moving an object there's no need
to include wording that indicates moving an object (like "Move") along with
the directional words Back and Forward.
Rule 8: Reuse vocabulary words.
The world's languages regularly reuse vocabulary words. Context
makes this possible, and it's important to take advantage of vocabulary
reuse in order to keep command vocabulary small and easy to remember.
"Top", for instance, refers to the beginning of a document - the
command "Go Top"
puts the cursor at the beginning of a document.
"Top" also refers to the portion of a word, line, paragraph or document
that lies before the cursor. For example "Graph Top"
, selects the
portion of a paragraph that is before the cursor and "Doc Top"
from the cursor to the beginning of the document.
Numbers are also used in several different ways. Numbers can refer
to hitting a key a number of times, like "3 Backspace"
a number of objects, like "3 Lines"
The numbers 1 to 100 also indicate several types of absolute measures.
, for instance, adjusts the computer's speaker to its
middle volume setting.
Rule 9: Use existing word pairs.
This takes advantage of the instinctive knowledge that pairs
carry related meanings. It also helps makes the vocabulary concise and easy
"Back" and "Forward" are a pair. We also use "On" and "Off". For
example, "Speech On"
and "Speech Off"
turn the microphone
on and off. Another common pair is "Before" and "After" - "5 Before"
moves the cursor 5 words to the left, while "5 After"
moves the cursor
5 words to the right.
Rule 10: Follow the way people naturally adjust language to fit a
This makes commands easier to learn and remember.
It's unusual to find a command that no existing word matches, but
this does happen occasionally. In these cases where language must be stretched
to fit a situation it is important that it be done in a natural way.
To select the three words before the cursor, for instance, say "3
, and to select three words after the cursor, "3 Afters"
Although these constructions might seem a little strange at first glance,
they're easy to learn and remember because they follow natural patterns.
"Afters" is already in use - it's a British term for dessert, as in what
you have after a meal. There's another precedent that's closer to home:
when people talk about hitting the Page Up key several times they talk about
hitting several "page ups".
Rule 11: Use combined commands that follow the order of events.
Combining commands makes the interface more efficient by cutting
down on the steps necessary to carry out computer functions. This also cuts
down on mistakes simply because there are fewer steps.
When combining several steps into one command it's easier to picture
the action and easier to remember the command if the command wording follows
the way the command will be carried out.
For example, "3 Lines Bold"
selects, then bolds the three
lines below the cursor, "3 Graphs Cut"
selects, then cuts the three
paragraphs below the cursor.
In general, commands contain one or more of three types of events:
|placing the cursor
|selecting an object
|carrying out an action
And in general
|moving the cursor comes first
|followed by selecting an object like text a program
element, a file, or a program
|followed by actions like moving, formatting, copying,
deleting, or opening
The three types of control keys don't have a natural chronological
order and so instead follow a prescribed order:
||Shift, Control, Alternate (this is also reverse alphabetical
|| "Shift Control a", but
not "Control Shift a"
One consequence of using commands that follow the order of events
is that you're initiating an action ("Window Close"
) rather than
telling the computer to do something ("Close Window"
). This is a
subtle point, but using words that depict closing the window directly rather
than words that direct a third party - the computer - to do so is simpler
and so uses less cognitive effort. This practice also makes commands more
consistent and eliminates alternate wordings.
Combined commands also give you an efficient way to recover from
mistakes - like you mis-counting or the computer mishearing - so you don't
become mired in a succession of miscues. Consider this scenario: you're
attempting to quickly and efficiently change "two" to "to" immediately after
having said "two". The command "Left Backspace"
carries this out
in a single command. If you accidentally say "2 Left Backspace"
however, instead of "to" the user is left with "wo" with the cursor to the
left of the letters. You can correct this mistake with a single combined
command: "Delete t"
Rule 12: Allow the user to follow the action when necessary.
When you use the mouse to carry out an action that involves several
separate steps, like selecting a paragraph, cutting the paragraph, moving
the cursor to another location, then pasting the paragraph, you, by default,
follow exactly what's happening because you have to initiate each step.
When you're using speech - and especially when you're using long
speech commands - it's important to make sure that you're able to follow
the action. For instance, when you select, cut, move and paste text using
a single command, you should be able to see the text highlighted in its
original location before it's cut, then highlighted after it's pasted in
the new location. This allows you to easily follow the action so you can
automatically confirm what's happening rather than having to figure out
what occurred after the fact, perform another operation to confirm an action,
or simply take on faith that an action was carried out correctly.
It's important that this kind of feedback not become annoying, however,
so it should happen quickly. Audio feedback is also useful, but should be
used sparingly so that it doesn't become annoying. Here are a couple of
examples from Utter Command:
When you turn off the microphone youre often turning away from your
computer - the audio Speech Off and Microphone Off confirmations mean you
don't have to wait to see the microphone change color.
When you copy and cut to the UC Clipboard files you hear confirmations
so you know your text has been pasted into the correct clipboard file.
Feedback can be subtle. Here are a couple of subtle examples from
When you move the mouse using speech you can more easily follow the
action because the mouse arrow wiggles slightly at the end of a command.
The wiggle is subtle enough that it usually doesn't enter the user's awareness
unless she is told about it, but it is enough to draw her eye to the new
When you combine closing a window and clicking "Yes" or "No" to save
a file the arrow briefly pauses in front of the proper button so you can
see which button the arrow clicks. In addition, the arrow waits twice as
long in front of the Yes button as the No button.
Rule 13 Use phrase modes, or words that keep mode changes within
single commands, to allow the human to give the computer more information
"Short" and "Long", for example, are used to distinguish between
several different types of ambiguous spoken commands:
|symbolic and written forms, such as "3" versus "three"
and "star" versus "*"
|full forms of words and their abbreviations such as
"January" versus "Jan."
|words that sound exactly the same - homophones like
"pair", "pare", and "pear"
|different formats of the date or time, such as "6-21-05"
versus "June 21, 2005"
|numbers and number values in otherwise ambiguous combined
commands, such as moving the cursor down then typing a number versus
moving the cursor down a number of lines
|command words and text, such as typing a single word
that also appears in the menu bar across the top of many programs
||"3" allows the computer to determine what the user means
based on context, "3 Short" types "3" and "3 Long" types
||"Star" leaves the form up
to the computer, "Star Short" types "*", and "Star Long"
||"Versus" allows the computer
to decide between the long and commonly abbreviated versions of this
word, "Versus Short" types "vs." and "Versus Long" types
||"3 Down" moves the
cursor down three lines, "3 Short Down" returns "3", then moves
the cursor down 1 line, "Window" clicks the window menu in
programs that have one, and "Window Long" types "Window"
Short and long can be further modified with a number in the case
of multiple homophones. These are arranged in alphabetical order.
For example, "4", leaves the form up to the computer, "4 Short"
types "4", "4 Long"
types "four", "4 Long 1"
and "4 Long 2"
types "Fore"; "Pair Long 1"
types "Pair", "Pair
types "Pare", "Pair Long 3"
types "Pear". In these cases,
"Long 1 to 10" is not functionally different from "Short 1 to 10".
This method has the advantage of scalability. As computers get better
at distinguishing between forms, users will naturally shift the task of
choosing back to the computer by using the default single words more often.
Phrase modes also avoid the well-known problem of users losing their
bearings with modes that must be turned on and off.
Rule 14: When appropriate, allow different ways to invoke the same
This is the speech equivalent of a graphical user interface that
allows you to go through a menu, click a button on the desktop or press
a keyboard combination to carry out a function depending on the situation.
It's important to note that this refers to different ways to carry
out the same function - enabling existing pathways to the same command -
rather than the common use of synonymous wordings for the same function.
Enabling different ways of carrying out the same function allows
you to take advantage of any existing knowledge you might have about a program.
For instance, you should have the choice of using a single speech
command that invokes a deep menu function (File Save), or a single speech
command that carries out a series of keystrokes that accomplishes the same
thing "Control s". This both taps existing knowledge and reduces the chances
that a user will be unable to figure out a way to do something by speech
even given special circumstances that restrict options.
It's also important that you can invoke functions using only local
knowledge - that is, what you see on the screen.
Dialog boxes present a bit of a special case, because on-screen words
exist for dialog boxes in two places: on the menu and on the top of the
dialog box. Unfortunately, in some programs some of these labels differ.
In these cases you should have the choice of calling up the dialog box using
a command based on the words used to name the dialog box in the menu system
(for instance, the first word of the NatSpeak vocabulary manager menu label
is "Edit"), or a command based on the words on the top of the dialog box
(for instance, the first word of the NatSpeak Vocabulary Editor dialog box
is "Vocabulary"), or any existing keyboard shortcut.
Rule 15: Be thorough and consistent in enabling menu commands across
This guideline is related to the second and third guidelines
- use words that you see on-screen, and be as consistent as possible. Consistency
is good for both people and computers. It helps people remember and it enables
Here's how to enable all the menu commands in any program (these
commands work from within the target program):
|File menu commands are made up of the first two words
of a command as it appears on the menu, ignoring company names, version
numbers, and the words "and", and "or".
|Menu commands that call up a submenu can also be accessed
using the first word of the menu plus the word "Menu" (see rule 14
for the logic behind this additional wording.)
|Menu commands that call up dialog boxes can also be
accessed using the first word of the dialog box label plus the word
"Box" or "Open". Note that sometimes the dialog box label does not
match the words used to indicate the dialog box on the menu. (See
rule 14 for the logic behind this additional wording.)
|Commands like tabs and text boxes within dialog boxes
can be invoked directly using the first word of the dialog box plus
the first word of the tab or text box. This type of command can also
be combined with standard input to a text box, like a number, or checking
a box. This type of command can be further combined to open the dialog
box, provide the input, then close the dialog box by adding the word
"Close" to the end of the command.
Here's how to handle the difficult cases:
|If a top-level menu has just one word, add "Menu" after
the word. For example, "Edit Menu".
|If a two-word menu command conflicts with another command
in the menu system, add the next word of the menu item label if possible.
|If a non-top-level menu command has just one word or
is a multi-word command whose conflict with another command can't
be resolved by adding subsequent words, add the first word of the
menu or menu branch directly before the menu command to the front
of the speech command. In the event of continued conflict, add a number
to the end of the speech command. Commands are numbered left to right
and top to bottom according to their positions in the menu system.
|If menu commands don't contain words, number them in
the standard order of left to right and top to bottom. For instance
the "Format/Background" submenu in Word contains just blocks of color.
These rules make it possible for you to figure out commands by going
through existing menus and dialog boxes, gradually saving steps until you
become used to the most efficient commands.
These rules work no matter how menu items are constructed, but they
work best when menu items are generated according to the well-established
good interface guidelines that call for consistent, descriptive, noun-based
These rules work well to fully enable a program's menu system for
speech. There are also a couple of practical matters to consider. You should
be able to quickly enable any portion of the menu and dialog box commands
for any given program at any given time. And you should be able to change
individual wordings in this standard template. We recommend changes, however,
only in cases in which an often-used command is especially awkwardly worded.
In addition, you might want to enable some program menus or program
menu items so they work wether or not that program is active. One good example
is speech engine software menus, which should be accessible whether or not
the system focus is on the speech engine program. UC fully enables the UC
and NatSpeak program menus this way (see UC Lesson 1.6
It's also sometimes useful to enable key functions from certain programs
so they can be accessed globally. UC enables Web site, file and folder access,
and Windows system and sound controls this way (see UC Lesson 10.15
Here's how to enable menu commands that should be accessible globally.
|Start the command with the name of the program (such
as NatSpeak) or, to call up a default program, the name of the type
of program (such as Media or Mail), followed by just the first word
of the menu item.
|If a command conflicts with another command in the menu
system, add the next word of the menu item label if possible.
|If a conflict with another command cannot be resolved
by adding subsequent words, insert the first word of the menu or menu
branch that is directly before the menu command after the name of
the program (so that it is the second word of the command). In the
event of continued conflict, add a number to the end of the speech
command. Commands are numbered left to right and top to bottom according
to their positions in the menu system.
|If menu commands don't contain words, number them in
the standard order of left to right and top to bottom.
Rule 16: In general, think of on-screen elements like text, symbols
and graphics as logical objects, and enable similar objects to be manipulated
in similar ways.
This is less a rule than a guideline. Keeping this in mind will
enable you to follow the rules better, and will facilitate a smaller and
more useful set of commands, which, of course, makes commands easier to
Here's an example. The basic elements, or objects, of text are characters,
words, lines, sentences, paragraphs and documents. Once these are defined
they can be manipulated, and the cursor can be moved around them, using
the same command structures with different object words.
In the case of characters, words, lines, sentences, paragraphs and
documents, each text object must be defined in several different ways. Take
"line", for instance. You need to indicate moving the cursor up or down
by a line and selecting up or down by a line. Here are the needed variations:
Line Up, Line, Line Ups, Lines. Follow the same pattern to set up the variations
for the other objects. Paragraph: Graph Up, Graph, Graph Ups, Graphs. Letter:
Left, Right, Lefts, Rights. Word: Before, After, Befores, Afters. Because
document is the largest object, variations aren't needed. Once these are
defined, and you know how to use one of them, it's trivial to apply the
command structure to the others. For example, once you know you can say
to select the next 3 lines, "3 Graphs"
, "3 Lefts"
and even "3 Lines Delete"
The key to manipulating objects is identifying the delimiters - whatever
defines an object. Double punctuation marks, like parentheses and brackets,
are text objects because they define phrases. Text objects delimited by
double punctuation marks play a relatively minor role in prose, but a much
more important role in mathematics and programming. Double punctuation marks,
along with any other symbolic or label-type delimiters, can be treated in
much the same way as any other text object in order to facilitate easy movement
among and manipulation of the objects they define. Such objects can also
be manipulated as a group using a group name - any object delimited by double
punctuation marks, for instance, could be defined as a "layer". It is also
useful to specify such an object minus the delimiters. This can be done
by adding "Minus" to the end of the command.
There are other important objects in specialized text, and their
delimiters can include spacing and formatting. Screenplays, for instance,
have several important recurring objects: names of characters, shot headers,
and description. Because screenplay formatting is standardized, these elements
can be treated as objects.