The grammar behind Utter Command

Human-Machine Grammar

Human-Machine Grammar is a system of words and rules designed to allow humans to communicate commands to computers. It takes into consideration that humans have an extensive natural language capacity that has evolved over millions of years and that we use seemingly without effort, while computers do not yet have the ability to understand the meaning of speech.

It also takes into consideration that while language seems easy for humans, different phrasings encompass a considerable span of cognitive effort. Human-Machine Grammar is designed to limit cognitive effort in order to free up as much of the brain as possible to concentrate on the task at hand.

Natural language allows for a wide, textured range of communications, but controlling a computer only requires a relatively small set of distinct commands. Human-Machine Grammar is a relatively succinct set of words that can be combined according to a concise set of grammar rules to communicate a small set of commands. The system is relatively easy for humans to learn, and computers can respond to the commands without having to decode natural language or be loaded down with large sets of synonymous commands. (for more details see Structured vs. Natural Language)

Human-Machine Grammar, like any language system in use, is an active, evolving set of words and rules. Redstart Systems founder Kimberly Patch began using speech recognition software and writing custom speech macros in 1994. Over time it became obvious that a formal grammar was needed. She has been developing these ideas since 1998; much of what she developed, including the logic underpinning the grammar, is covered in a series of talks.

We encourage people to use the HMG system when writing custom speech commands.

The grammar rules

1. Match the words used for a command as closely as possible with what the command does.
2. Use words the user sees on the screen. 
3. Be consistent. 
4. Balance the ease of saying a command with the ease of remembering a command. 
5. Use one-word commands very sparingly. Beyond one word, however, keep the number of words used in any given command to a minimum.
6. Eliminate unnecessary words. 
7. Eliminate synonyms. 
8. Reuse vocabulary words. 
9. Use existing word pairs. 
10. Follow the way people naturally adjust language to fit a situation. 
11. Use combined commands that follow the order of events. 
12. Allow the user to follow the action when necessary. 
13. Use phrase modes, or words that keep mode changes within single commands, to give the computer more information.
14. Be thorough and consistent in enabling menu commands across all programs. 
15. When appropriate, allow different ways to invoke the same function. 
16. In general, think of on-screen elements like text, symbols and graphics as logical objects, and enable similar objects to be manipulated in similar ways.

The 16 Human-Machine Grammar rules are aimed at keeping the speech interface vocabulary small and easy to remember and predict. These guidelines cut out alternate wordings and establish consistent patterns across the entire set of commands, making it much easier to remember or guess how a command should be worded. The examples below are taken from Redstart Systems' Utter Command speech interface for general computer use.

Rule 1: Match the words used for a command as closely as possible with what the command does.

This makes commands easier to remember.

"Line" refers to a line of text
"Touch" refers to clicking an on-screen element with the mouse arrow
"File" refers to a file
"Folder" refers to a folder

Rule 2 Use words the user sees on the screen.


This also makes commands easier to remember.

When enabling menu commands, for example, use existing words - the menu labels - to indicate menu actions.

Rule 3: Be consistent.

This also makes commands easier to remember and guess. Consistency means always using the same term to refer to an object or action, and the same constructions to build commands.

Notice the patterns in these groups of commands:
"2 Words"
, "3 Lines", "4 Graphs", "2 Layers", "5 Backspace", "2 Enter"
"Line Delete"
, "Word Bold", "Line Duplicate", "Word Duplicate"
"7 Lines Cut", "5 Words Bold", "3 5 Words Delete"
"My Documents Folder"
, "Google Site", "Alpha File"

Rule 4: Balance the ease of saying a command with the ease of remembering a command.

The ease of saying a command is always important, but becomes even more important the more often a command is used. In contrast, the ease of remembering a command is always important, but becomes even more important for commands that are not frequently used.

Clicking the mouse arrow is common, making it important that the command for clicking be easy to say. "Mouse Click" is particularly difficult. "Touch" is much easier, and also matches what the command does.

In enabling menu commands, it's important to use the words on the menu labels because even though they might not be worded well for ease of saying, most of them are good enough, there are many of them, and it is much easier to remember commands that you see on screen.

Rule 5: Use one-word commands very sparingly; beyond one word, however, keep the number of words used in any given command to a minimum

One-word commands are easy to remember and say, but are more likely than longer commands to be tripped accidentally when you mean to say them as text. There are a very few commands that are used extremely often, including "Enter", "Space", "Backspace", and "Escape". It makes sense to enable these few, very common commands as one-word commands. In situations where the system is limited to commands, like when the focus is on a dialog box, and when the command you want to say is on-screen, one-word commands also make sense.

Otherwise, however, commands should consist of more than one word, and just two if possible. Keeping the number of words used in commands to a minimum makes commands easier to remember, say and combine.

Rule 6: Eliminate unnecessary words.

This rule is closely related to rule 5. One key to keeping commands succinct is eliminating unnecessary words.

Here are some things to think about when paring a command to only necessary words:

Articles like "a" and "the", and polite, getting-started and redundant filler words are never needed.

When identifying an object is enough to imply an action, it isn't necessary to include the action word. Identifying a folder - for instance, "Budget Folder" - is enough to indicate that the folder named "Budget" be called up by the program in use.

Here's a command that shows it isn't necessary to include the object (cursor) the action (move) or the type of units (characters). "3 Left" is enough to indicate that the cursor be moved three characters to the left.

The bottom line is if there's no need to differentiate, there's no need to have the user spend brain cycles and time on remembering and saying a specific word.

Commands that contain only necessary information follow the way we work out jargon in repetitive human-human communication situations. For instance, a fast food worker putting in two orders of french fries typically says "two fry".

Rule 7: Eliminate synonyms.


A vocabulary without synonyms is smaller, which makes commands easier to remember and predict. It also makes combining commands practical, which, in turn, makes using a computer faster and more efficient.

The word "This", for instance, refers to something that's highlighted or on the clipboard. It's the only word that carries these meanings. If you want to say a command that carries out a single action on a selection, like ""This Cut", or "This Bold" you know you'll use "This".

The word "Back" and only back refers to moving something in the direction to the left of the cursor. "Word 3 Back", for instance, moves the word nearest the cursor 3 words to the left.

"Forward" and only forward refers to moving something in the direction to the right of the cursor. "Graph 2 Forward", for instance, moves the paragraph nearest the cursor down two paragraphs.

This key rule is in stark contrast to most existing speech interfaces. The default configuration of Nuance's NaturallySpeaking, for instance, offers four different ways for the user to voice the punctuation mark "Open Quote" and four more ways for the user to voice the punctuation mark "Close Quote". It uses many synonyms, including "Start", "Begin", "Give Me", "Check", "Show", "Open", "Bring Up", "Edit" and "View" as the first word or words in commands that bring up a program or dialog box. And it offers 16 synonymous wordings for checking mail, 16 for creating a new mail message, five for opening a selected email message, and five for closing an email message. This total of 42 wordings for four functions are specific to one email program.

Synonymous wordings pose another problem. Similar words came about because they have subtly different meanings. These differences are key to keeping the length of commands short and enabling different types of functions. If "Back" and "Forward" always refer to moving an object there's no need to include wording that indicates moving an object (like "Move") along with the directional words Back and Forward.

Rule 8: Reuse vocabulary words.


The world's languages regularly reuse vocabulary words. Context makes this possible, and it's important to take advantage of vocabulary reuse in order to keep command vocabulary small and easy to remember.

"Top", for instance, refers to the beginning of a document - the command "Go Top" puts the cursor at the beginning of a document.

"Top" also refers to the portion of a word, line, paragraph or document that lies before the cursor. For example "Graph Top", selects the portion of a paragraph that is before the cursor and "Doc Top" selects from the cursor to the beginning of the document.

Numbers are also used in several different ways. Numbers can refer to hitting a key a number of times, like "3 Backspace" or selecting a number of objects, like "3 Lines".

The numbers 1 to 100 also indicate several types of absolute measures. "Volume 50", for instance, adjusts the computer's speaker to its middle volume setting.

Rule 9: Use existing word pairs.


This takes advantage of the instinctive knowledge that pairs carry related meanings. It also helps makes the vocabulary concise and easy to remember.

"Back" and "Forward" are a pair. We also use "On" and "Off". For example, "Speech On" and "Speech Off" turn the microphone on and off. Another common pair is "Before" and "After" - "5 Before" moves the cursor 5 words to the left, while "5 After" moves the cursor 5 words to the right.

Rule 10: Follow the way people naturally adjust language to fit a situation.


This makes commands easier to learn and remember.

It's unusual to find a command that no existing word matches, but this does happen occasionally. In these cases where language must be stretched to fit a situation it is important that it be done in a natural way.

To select the three words before the cursor, for instance, say "3 Befores", and to select three words after the cursor, "3 Afters". Although these constructions might seem a little strange at first glance, they're easy to learn and remember because they follow natural patterns. "Afters" is already in use - it's a British term for dessert, as in what you have after a meal. There's another precedent that's closer to home: when people talk about hitting the Page Up key several times they talk about hitting several "page ups".

Rule 11: Use combined commands that follow the order of events.


Combining commands makes the interface more efficient by cutting down on the steps necessary to carry out computer functions. This also cuts down on mistakes simply because there are fewer steps.

When combining several steps into one command it's easier to picture the action and easier to remember the command if the command wording follows the way the command will be carried out.

For example, "3 Lines Bold" selects, then bolds the three lines below the cursor, "3 Graphs Cut" selects, then cuts the three paragraphs below the cursor.

In general, commands contain one or more of three types of events:

· 
placing the cursor
· 
selecting an object
· 
carrying out an action

And in general

· 
moving the cursor comes first
· 
followed by selecting an object like text a program element, a file, or a program
· 
followed by actions like moving, formatting, copying, deleting, or opening

The three types of control keys don't have a natural chronological order and so instead follow a prescribed order:

Shift, Control, Alternate (this is also reverse alphabetical order)
For instance,
    "Shift Control a", but not "Control Shift a"

One consequence of using commands that follow the order of events is that you're initiating an action ("Window Close") rather than telling the computer to do something ("Close Window"). This is a subtle point, but using words that depict closing the window directly rather than words that direct a third party - the computer - to do so is simpler and so uses less cognitive effort. This practice also makes commands more consistent and eliminates alternate wordings.

Combined commands also give you an efficient way to recover from mistakes - like you mis-counting or the computer mishearing - so you don't become mired in a succession of miscues. Consider this scenario: you're attempting to quickly and efficiently change "two" to "to" immediately after having said "two". The command "Left Backspace" carries this out in a single command. If you accidentally say "2 Left Backspace", however, instead of "to" the user is left with "wo" with the cursor to the left of the letters. You can correct this mistake with a single combined command: "Delete t".

Rule 12: Allow the user to follow the action when necessary.

When you use the mouse to carry out an action that involves several separate steps, like selecting a paragraph, cutting the paragraph, moving the cursor to another location, then pasting the paragraph, you, by default, follow exactly what's happening because you have to initiate each step.

When you're using speech - and especially when you're using long speech commands - it's important to make sure that you're able to follow the action. For instance, when you select, cut, move and paste text using a single command, you should be able to see the text highlighted in its original location before it's cut, then highlighted after it's pasted in the new location. This allows you to easily follow the action so you can automatically confirm what's happening rather than having to figure out what occurred after the fact, perform another operation to confirm an action, or simply take on faith that an action was carried out correctly.

It's important that this kind of feedback not become annoying, however, so it should happen quickly. Audio feedback is also useful, but should be used sparingly so that it doesn't become annoying. Here are a couple of examples from Utter Command:

When you turn off the microphone youre often turning away from your computer - the audio Speech Off and Microphone Off confirmations mean you don't have to wait to see the microphone change color.

When you copy and cut to the UC Clipboard files you hear confirmations so you know your text has been pasted into the correct clipboard file.

Feedback can be subtle. Here are a couple of subtle examples from Utter Command:

When you move the mouse using speech you can more easily follow the action because the mouse arrow wiggles slightly at the end of a command. The wiggle is subtle enough that it usually doesn't enter the user's awareness unless she is told about it, but it is enough to draw her eye to the new mouse location.

When you combine closing a window and clicking "Yes" or "No" to save a file the arrow briefly pauses in front of the proper button so you can see which button the arrow clicks. In addition, the arrow waits twice as long in front of the Yes button as the No button.

Rule 13 Use phrase modes, or words that keep mode changes within single commands, to allow the human to give the computer more information

"Short" and "Long", for example, are used to distinguish between several different types of ambiguous spoken commands:

· 
symbolic and written forms, such as "3" versus "three" and "star" versus "*"
· 
full forms of words and their abbreviations such as "January" versus "Jan."
· 
words that sound exactly the same - homophones like "pair", "pare", and "pear"
· 
different formats of the date or time, such as "6-21-05" versus "June 21, 2005"
· 
numbers and number values in otherwise ambiguous combined commands, such as moving the cursor down then typing a number versus moving the cursor down a number of lines
· 
command words and text, such as typing a single word that also appears in the menu bar across the top of many programs

    Examples:

"3" allows the computer to determine what the user means based on context, "3 Short" types "3" and "3 Long" types "three"
"Star" leaves the form up to the computer, "Star Short" types "*", and "Star Long" types "star"
"Versus" allows the computer to decide between the long and commonly abbreviated versions of this word, "Versus Short" types "vs." and "Versus Long" types "versus"
"3 Down" moves the cursor down three lines, "3 Short Down" returns "3", then moves the cursor down 1 line, "Window" clicks the window menu in programs that have one, and "Window Long" types "Window"

Short and long can be further modified with a number in the case of multiple homophones. These are arranged in alphabetical order.

For example, "4", leaves the form up to the computer, "4 Short" types "4", "4 Long" types "four", "4 Long 1" types "For", and "4 Long 2" types "Fore"; "Pair Long 1" types "Pair", "Pair Long 2" types "Pare", "Pair Long 3" types "Pear". In these cases, "Long 1 to 10" is not functionally different from "Short 1 to 10".

This method has the advantage of scalability. As computers get better at distinguishing between forms, users will naturally shift the task of choosing back to the computer by using the default single words more often.

Phrase modes also avoid the well-known problem of users losing their bearings with modes that must be turned on and off.

Rule 14: When appropriate, allow different ways to invoke the same function.


This is the speech equivalent of a graphical user interface that allows you to go through a menu, click a button on the desktop or press a keyboard combination to carry out a function depending on the situation.

It's important to note that this refers to different ways to carry out the same function - enabling existing pathways to the same command - rather than the common use of synonymous wordings for the same function.

Enabling different ways of carrying out the same function allows you to take advantage of any existing knowledge you might have about a program.

For instance, you should have the choice of using a single speech command that invokes a deep menu function (File Save), or a single speech command that carries out a series of keystrokes that accomplishes the same thing "Control s". This both taps existing knowledge and reduces the chances that a user will be unable to figure out a way to do something by speech even given special circumstances that restrict options.

It's also important that you can invoke functions using only local knowledge - that is, what you see on the screen.

Dialog boxes present a bit of a special case, because on-screen words exist for dialog boxes in two places: on the menu and on the top of the dialog box. Unfortunately, in some programs some of these labels differ. In these cases you should have the choice of calling up the dialog box using a command based on the words used to name the dialog box in the menu system (for instance, the first word of the NatSpeak vocabulary manager menu label is "Edit"), or a command based on the words on the top of the dialog box (for instance, the first word of the NatSpeak Vocabulary Editor dialog box is "Vocabulary"), or any existing keyboard shortcut.

Rule 15: Be thorough and consistent in enabling menu commands across all programs.


This guideline is related to the second and third guidelines - use words that you see on-screen, and be as consistent as possible. Consistency is good for both people and computers. It helps people remember and it enables automation.

Here's how to enable all the menu commands in any program (these commands work from within the target program):

In general,

· 
File menu commands are made up of the first two words of a command as it appears on the menu, ignoring company names, version numbers, and the words "and", and "or".
· 
Menu commands that call up a submenu can also be accessed using the first word of the menu plus the word "Menu" (see rule 14 for the logic behind this additional wording.)
· 
Menu commands that call up dialog boxes can also be accessed using the first word of the dialog box label plus the word "Box" or "Open". Note that sometimes the dialog box label does not match the words used to indicate the dialog box on the menu. (See rule 14 for the logic behind this additional wording.)
· 
Commands like tabs and text boxes within dialog boxes can be invoked directly using the first word of the dialog box plus the first word of the tab or text box. This type of command can also be combined with standard input to a text box, like a number, or checking a box. This type of command can be further combined to open the dialog box, provide the input, then close the dialog box by adding the word "Close" to the end of the command.

Here's how to handle the difficult cases:

· 
If a top-level menu has just one word, add "Menu" after the word. For example, "Edit Menu".
· 
If a two-word menu command conflicts with another command in the menu system, add the next word of the menu item label if possible.
· 
If a non-top-level menu command has just one word or is a multi-word command whose conflict with another command can't be resolved by adding subsequent words, add the first word of the menu or menu branch directly before the menu command to the front of the speech command. In the event of continued conflict, add a number to the end of the speech command. Commands are numbered left to right and top to bottom according to their positions in the menu system.
· 
If menu commands don't contain words, number them in the standard order of left to right and top to bottom. For instance the "Format/Background" submenu in Word contains just blocks of color.

These rules make it possible for you to figure out commands by going through existing menus and dialog boxes, gradually saving steps until you become used to the most efficient commands.

These rules work no matter how menu items are constructed, but they work best when menu items are generated according to the well-established good interface guidelines that call for consistent, descriptive, noun-based menu items.

These rules work well to fully enable a program's menu system for speech. There are also a couple of practical matters to consider. You should be able to quickly enable any portion of the menu and dialog box commands for any given program at any given time. And you should be able to change individual wordings in this standard template. We recommend changes, however, only in cases in which an often-used command is especially awkwardly worded.

In addition, you might want to enable some program menus or program menu items so they work wether or not that program is active. One good example is speech engine software menus, which should be accessible whether or not the system focus is on the speech engine program. UC fully enables the UC and NatSpeak program menus this way (see UC Lesson 1.6 and 1.9). It's also sometimes useful to enable key functions from certain programs so they can be accessed globally. UC enables Web site, file and folder access, and Windows system and sound controls this way (see UC Lesson 10.15).

Here's how to enable menu commands that should be accessible globally.

· 
Start the command with the name of the program (such as NatSpeak) or, to call up a default program, the name of the type of program (such as Media or Mail), followed by just the first word of the menu item.
· 
If a command conflicts with another command in the menu system, add the next word of the menu item label if possible.
· 
If a conflict with another command cannot be resolved by adding subsequent words, insert the first word of the menu or menu branch that is directly before the menu command after the name of the program (so that it is the second word of the command). In the event of continued conflict, add a number to the end of the speech command. Commands are numbered left to right and top to bottom according to their positions in the menu system.
· 
If menu commands don't contain words, number them in the standard order of left to right and top to bottom.


Rule 16: In general, think of on-screen elements like text, symbols and graphics as logical objects, and enable similar objects to be manipulated in similar ways.


This is less a rule than a guideline. Keeping this in mind will enable you to follow the rules better, and will facilitate a smaller and more useful set of commands, which, of course, makes commands easier to remember.

Here's an example. The basic elements, or objects, of text are characters, words, lines, sentences, paragraphs and documents. Once these are defined they can be manipulated, and the cursor can be moved around them, using the same command structures with different object words.

In the case of characters, words, lines, sentences, paragraphs and documents, each text object must be defined in several different ways. Take "line", for instance. You need to indicate moving the cursor up or down by a line and selecting up or down by a line. Here are the needed variations: Line Up, Line, Line Ups, Lines. Follow the same pattern to set up the variations for the other objects. Paragraph: Graph Up, Graph, Graph Ups, Graphs. Letter: Left, Right, Lefts, Rights. Word: Before, After, Befores, Afters. Because document is the largest object, variations aren't needed. Once these are defined, and you know how to use one of them, it's trivial to apply the command structure to the others. For example, once you know you can say "3 Lines" to select the next 3 lines, "3 Graphs", "3 Lefts", and even "3 Lines Delete" are intuitive.

The key to manipulating objects is identifying the delimiters - whatever defines an object. Double punctuation marks, like parentheses and brackets, are text objects because they define phrases. Text objects delimited by double punctuation marks play a relatively minor role in prose, but a much more important role in mathematics and programming. Double punctuation marks, along with any other symbolic or label-type delimiters, can be treated in much the same way as any other text object in order to facilitate easy movement among and manipulation of the objects they define. Such objects can also be manipulated as a group using a group name - any object delimited by double punctuation marks, for instance, could be defined as a "layer". It is also useful to specify such an object minus the delimiters. This can be done by adding "Minus" to the end of the command.

There are other important objects in specialized text, and their delimiters can include spacing and formatting. Screenplays, for instance, have several important recurring objects: names of characters, shot headers, and description. Because screenplay formatting is standardized, these elements can be treated as objects.