The aural rendering of a document, already commonly used by the
blind and print-impaired communities, combines speech synthesis and
"auditory icons." Often
such aural presentation occurs by converting the document to plain
text and feeding this to a screen reader -- software or hardware that
simply reads all the characters on the screen. This results in less
effective presentation than would be the case if the document
structure were retained. Style sheet properties for aural presentation
may be used together with visual properties (mixed media) or as an
aural alternative to visual presentation.
Besides the obvious accessibility advantages, there are other large
markets for listening to information, including in-car use, industrial
and medical documentation systems (intranets), home entertainment, and
to help users learning to read or who have difficulty reading.
When using aural properties, the canvas consists of a three-dimensional physical
space (sound surrounds) and a temporal space (one may specify sounds
before, during, and after other sounds). The CSS properties also
allow authors to vary the quality of synthesized speech (voice type,
frequency, inflection, etc.).
Example(s):
H1, H2, H3, H4, H5, H6 {
voice-family: paul;
stress: 20;
richness: 90;
cue-before: url("ping.au")
}
P.heidi { azimuth: center-left }
P.peter { azimuth: right }
P.goat { volume: x-soft }
This will direct the speech synthesizer to speak headers in a voice
(a kind of "audio font") called "paul", on a flat tone, but in a very
rich voice. Before speaking the headers, a sound sample will be played
from the given URL. Paragraphs with class "heidi" will appear to come
from front left (if the sound system is capable of spatial audio), and
paragraphs of class "peter" from the right. Paragraphs with class
"goat" will be very soft.
-
'volume'
-
Value: | <number> | <percentage> | silent | x-soft | soft | medium | loud |
x-loud | inherit
| Initial: | medium
| Applies to: | all elements
| Inherited: | yes
| Percentages: | refer to inherited value
| Media: | aural
|
Volume refers to the
median volume of the waveform. In other words, a highly inflected
voice at a volume of 50 might peak well above that. The overall values
are likely to be human adjustable for comfort, for example with a
physical volume control (which would increase both the 0 and 100
values proportionately); what this property does is adjust the dynamic
range.
Values have the following meanings:
- <number>
- Any number between '0' and '100'.
'0' represents the minimum audible
volume level and 100 corresponds to the
maximum comfortable level.
- <percentage>
- Percentage values are calculated relative to the inherited value,
and are then clipped to the range '0' to '100'.
- silent
- No sound at all. The value '0' does not mean
the same as 'silent'.
- x-soft
- Same as '0'.
- soft
- Same as '25'.
- medium
- Same as '50'.
- loud
- Same as '75'.
- x-loud
- Same as '100'.
User agents should allow the values corresponding to '0' and '100'
to be set by the listener. No one setting is universally applicable;
suitable values depend on the equipment in use (speakers, headphones),
the environment (in car, home theater, library) and personal
preferences. Some examples:
- A browser for in-car use has a setting for when there is lots of
background noise. '0' would map to a fairly high level and '100' to a
quite high level. The speech is easily audible over the road noise but
the overall dynamic range is compressed. Cars with better
insulation might allow a wider dynamic range.
- Another speech browser is being used in an apartment, late at
night, or in a shared study room. '0' is set to a very quiet level and
'100' to a fairly quiet level, too. As with the first example, there
is a low slope; the dynamic range is reduced. The actual volumes are
low here, whereas they were high in the first example.
- In a quiet and isolated house, an expensive hi-fi home theater
setup. '0' is set fairly low and '100' to quite high; there is wide
dynamic range.
The same author style sheet could be used in all cases, simply by
mapping the '0' and '100' points suitably at the client side.
-
'speak'
-
Value: | normal | none | spell-out | inherit
| Initial: | normal
| Applies to: | all elements
| Inherited: | yes
| Percentages: | N/A
| Media: | aural
|
This property specifies whether text will be rendered aurally and
if so, in what manner (somewhat analogous to the 'display' property). The possible
values are:
- none
- Suppresses aural rendering so that the
element requires no time to render. Note, however, that
descendants may override this value and will be spoken. (To
be sure to suppress rendering of an
element and its descendants, use the
'display' property).
- normal
- Uses language-dependent pronunciation rules for rendering
an element and its children.
- spell-out
- Spells the text one letter at a time (useful for acronyms and
abbreviations).
Note the difference between an element whose 'volume' property has a value of
'silent' and an element whose 'speak' property has the value 'none'.
The former takes up the same time as if it had been spoken, including
any pause before and after the element, but no sound is generated. The
latter requires no time and is not rendered (though its descendants
may be).
These properties specify a pause to be observed before (or after)
speaking an element's content. Values have the following
meanings:
- <time>
- Expresses the pause in absolute time units (seconds and milliseconds).
- <percentage>
- Refers to the inverse of the value of the
'speech-rate' property.
For example, if the speech-rate is 120 words per minute
(i.e., a word takes half a second, or 500ms) then a 'pause-before' of 100% means a
pause of 500 ms and a 'pause-before' of 20% means
100ms.
The pause is inserted between the element's content and any 'cue-before' or 'cue-after' content.
Authors should use relative units to create more robust style
sheets in the face of large changes in speech-rate.
-
'pause'
-
Value: | [ [<time> | <percentage>]{1,2} ] | inherit
| Initial: | depends on user agent
| Applies to: | all elements
| Inherited: | no
| Percentages: | see descriptions of 'pause-before' and 'pause-after'
| Media: | aural
|
The 'pause' property is a
shorthand for setting 'pause-before' and 'pause-after'. If two values are
given, the first value is 'pause-before' and the second is
'pause-after'. If only one
value is given, it applies to both properties.
Example(s):
H1 { pause: 20ms } /* pause-before: 20ms; pause-after: 20ms */
H2 { pause: 30ms 40ms } /* pause-before: 30ms; pause-after: 40ms */
H3 { pause-after: 10ms } /* pause-before: ?; pause-after: 10ms */
Auditory icons are another way to distinguish semantic
elements. Sounds may be played before and/or after the element to
delimit it. Values have the following meanings:
- <uri>
- The URI must designate an auditory icon resource. If the URI resolves to something other than an audio file, such as an image, the resource should be ignored and the property treated as if it had the value 'none'.
- none
- No auditory icon is specified.
Example(s):
A {cue-before: url("bell.aiff"); cue-after: url("dong.wav") }
H1 {cue-before: url("pop.au"); cue-after: url("pop.au") }
The 'cue' property is a shorthand
for setting 'cue-before'
and 'cue-after'. If two
values are given, the first value is 'cue-before' and the second is
'cue-after'. If only one
value is given, it applies to both properties.
Example(s):
The following two rules are equivalent:
H1 {cue-before: url("pop.au"); cue-after: url("pop.au") }
H1 {cue: url("pop.au") }
If a user agent cannot render an auditory icon (e.g., the user's
environment does not permit it), we recommend that it produce an
alternative cue (e.g., popping up a warning, emitting a warning sound,
etc.)
Please see the sections on the :before and :after
pseudo-elements for information on other content generation
techniques.
Similar to the 'cue-before' and 'cue-after' properties, this
property specifies a sound to be played as a background
while an element's content is spoken.
Values have the following meanings:
- <uri>
- The sound designated by this <uri> is played
as a background while the element's content is spoken.
- mix
- When present, this keyword means that
the sound inherited from the parent element's 'play-during' property continues
to play and the sound designated by the <uri> is mixed with it. If
'mix' is not specified, the element's background sound replaces
the parent's.
- repeat
- When present, this keyword means that the sound will repeat if it
is too short to fill the entire duration of the element. Otherwise,
the sound plays once and then stops. This is similar to the 'background-repeat'
property. If the sound is too long for the element, it is clipped once
the element has been spoken.
- auto
- The sound of the parent element continues to play
(it is not restarted, which would have been the case if this property
had been inherited).
- none
- This keyword means that there is silence. The sound of the
parent element (if any) is silent during the current element and
continues after the current element.
Example(s):
BLOCKQUOTE.sad { play-during: url("violins.aiff") }
BLOCKQUOTE Q { play-during: url("harp.wav") mix }
SPAN.quiet { play-during: none }
Spatial audio is an important stylistic property for aural
presentation. It provides a natural way to tell several voices apart,
as in real life (people rarely all stand in the same spot in a
room). Stereo speakers produce a lateral sound stage. Binaural
headphones or the increasingly popular 5-speaker home theater setups
can generate full surround sound, and multi-speaker setups can create
a true three-dimensional sound stage. VRML 2.0 also includes spatial
audio, which implies that in time consumer-priced spatial audio
hardware will become more widely available.
-
'azimuth'
-
Value: | <angle> | [[ left-side | far-left | left | center-left | center |
center-right | right | far-right | right-side ] || behind ] |
leftwards | rightwards | inherit
| Initial: | center
| Applies to: | all elements
| Inherited: | yes
| Percentages: | N/A
| Media: | aural
|
Values have the following meanings:
- <angle>
- Position is described in terms of an angle
within the range '-360deg' to '360deg'.
The value '0deg' means directly ahead in the center of the sound
stage. '90deg' is to the right, '180deg' behind, and '270deg' (or,
equivalently and more conveniently, '-90deg') to the left.
- left-side
- Same as '270deg'. With 'behind', '270deg'.
- far-left
- Same as '300deg'. With 'behind', '240deg'.
- left
- Same as '320deg'. With 'behind', '220deg'.
- center-left
- Same as '340deg'. With 'behind', '200deg'.
- center
- Same as '0deg'. With 'behind', '180deg'.
- center-right
- Same as '20deg'. With 'behind', '160deg'.
- right
- Same as '40deg'. With 'behind', '140deg'.
- far-right
- Same as '60deg'. With 'behind', '120deg'.
- right-side
- Same as '90deg'. With 'behind', '90deg'.
- leftwards
- Moves the sound
to the left, relative to the current angle.
More precisely, subtracts 20 degrees.
Arithmetic is carried out modulo 360 degrees. Note that
'leftwards' is more accurately described as "turned
counter-clockwise," since it always subtracts 20 degrees,
even if the inherited azimuth is already behind the listener (in which
case the sound actually appears to move to the right).
- rightwards
- Moves the sound
to the right, relative to the
current angle. More precisely, adds 20 degrees. See 'leftwards'
for arithmetic.
This property is most likely to be implemented by mixing the same
signal into different channels at differing volumes. It might also
use phase shifting, digital delay, and other such techniques to
provide the illusion of a sound stage. The precise means used to
achieve this effect and the number of speakers used to do so are
user agent-dependent; this property merely identifies the desired end
result.
Example(s):
H1 { azimuth: 30deg }
TD.a { azimuth: far-right } /* 60deg */
#12 { azimuth: behind far-right } /* 120deg */
P.comment { azimuth: behind } /* 180deg */
If spatial-azimuth is specified and the output device cannot
produce sounds behind the listening position, user agents
should convert values in the rearwards hemisphere to forwards
hemisphere values. One method is as follows:
- if 90deg < x <= 180deg then x := 180deg - x
- if 180deg < x <= 270deg then x := 540deg - x
-
'elevation'
-
Value: | <angle> | below | level | above | higher | lower | inherit
| Initial: | level
| Applies to: | all elements
| Inherited: | yes
| Percentages: | N/A
| Media: | aural
|
Values of this property have the following meanings:
- <angle>
- Specifies the elevation as an angle, between '-90deg' and '90deg'.
'0deg' means on the forward horizon, which loosely means level with
the listener. '90deg' means directly overhead and '-90deg' means directly
below.
- below
- Same as '-90deg'.
- level
- Same as '0deg'.
- above
- Same as '90deg'.
- higher
- Adds 10 degrees to the current elevation.
- lower
- Subtracts 10 degrees from the current elevation.
The precise means used to achieve this effect and the
number of speakers used to do so are undefined. This property merely
identifies the desired end result.
Example(s):
H1 { elevation: above }
TR.a { elevation: 60deg }
TR.b { elevation: 30deg }
TR.c { elevation: level }
-
'speech-rate'
-
Value: | <number> | x-slow | slow | medium | fast | x-fast | faster | slower
| inherit
| Initial: | medium
| Applies to: | all elements
| Inherited: | yes
| Percentages: | N/A
| Media: | aural
|
This property specifies the speaking rate. Note that both absolute
and relative keyword values are allowed (compare with 'font-size'). Values have
the following meanings:
- <number>
- Specifies the speaking rate in words per minute, a quantity that varies
somewhat by language but is nevertheless widely supported by speech
synthesizers.
- x-slow
- Same as 80 words per minute.
- slow
- Same as 120 words per minute
- medium
- Same as 180 - 200 words per minute.
- fast
- Same as 300 words per minute.
- x-fast
- Same as 500 words per minute.
- faster
- Adds 40 words per minute to the current speech rate.
- slower
- Subtracts 40 words per minutes from the current speech rate.
The value is a comma-separated, prioritized list of voice family
names (compare with 'font-family'). Values have the
following meanings:
- <generic-voice>
- Values are voice families. Possible values
are 'male', 'female', and 'child'.
- <specific-voice>
- Values are specific instances (e.g., comedian, trinoids, carlos, lani).
Example(s):
H1 { voice-family: announcer, male }
P.part.romeo { voice-family: romeo, male }
P.part.juliet { voice-family: juliet, female }
Names of specific voices may be quoted, and indeed must be quoted
if any of the words that make up the name does not conform to the
syntax rules for identifiers. It is also
recommended to quote specific voices with a name consisting of more
than one word. If quoting is omitted, any whitespace characters before and
after the font name are ignored and any sequence of whitespace
characters inside the font name is converted to a single space.
-
'pitch'
-
Value: | <frequency> | x-low | low | medium | high | x-high | inherit
| Initial: | medium
| Applies to: | all elements
| Inherited: | yes
| Percentages: | N/A
| Media: | aural
|
Specifies the average pitch (a frequency) of the speaking voice. The
average pitch of a voice depends on the voice family. For example,
the average pitch for a standard male voice is around 120Hz,
but for a female voice, it's around 210Hz.
Values have the following meanings:
- <frequency>
- Specifies the average pitch of the speaking voice in hertz (Hz).
- x-low, low,
medium, high, x-high
- These values do not map to absolute frequencies since
these values depend on the voice family. User agents should map
these values to appropriate frequencies based on the voice family
and user environment. However, user agents must map these values in
order (i.e., 'x-low' is a lower frequency than 'low', etc.).
Specifies variation in average pitch. The perceived pitch of a
human voice is determined by the fundamental frequency and typically
has a value of 120Hz for a male voice and 210Hz for a female voice.
Human languages are spoken with varying inflection and pitch; these
variations convey additional meaning and emphasis. Thus, a highly
animated voice, i.e., one that is heavily inflected, displays a high
pitch range. This property specifies the range over which these
variations occur, i.e., how much the fundamental frequency may deviate
from the average pitch.
Values have the following meanings:
- <number>
- A value between '0' and '100'. A pitch range of '0' produces
a flat, monotonic voice. A pitch range of 50 produces normal
inflection. Pitch ranges greater than 50 produce animated voices.
Specifies the height of "local peaks" in the intonation contour
of a voice. For example, English is a stressed
language, and different parts of a sentence are assigned primary,
secondary, or tertiary stress. The value of 'stress' controls the amount of
inflection that results from these stress markers. This property is a
companion to the 'pitch-range' property and is
provided to allow developers to exploit higher-end auditory displays.
Values have the following meanings:
- <number>
- A value, between '0' and '100'. The meaning of values
depends on the language being spoken. For example,
a level of '50' for a
standard, English-speaking male voice (average pitch = 122Hz), speaking
with normal intonation and emphasis would have a different
meaning than '50' for an Italian voice.
Specifies the richness, or brightness, of the speaking voice. A
rich voice will "carry" in a large room, a smooth voice will not.
(The term "smooth" refers to how the wave form looks when drawn.)
Values have the following meanings:
- <number>
- A value between '0' and '100'.
The higher the value, the more the voice will carry.
A lower value will produce a soft, mellifluous voice.
An additional speech property, speak-header, is
described in the chapter on tables
This property specifies how punctuation is spoken. Values have the
following meanings:
- code
- Punctuation such as semicolons,
braces, and so on are to be spoken literally.
- none
- Punctuation is not to be spoken, but instead rendered
naturally as various pauses.
-
'speak-numeral'
-
Value: | digits | continuous | inherit
| Initial: | continuous
| Applies to: | all elements
| Inherited: | yes
| Percentages: | N/A
| Media: | aural
|
This property controls how numerals are spoken. Values have the
following meanings:
- digits
- Speak the numeral as individual digits. Thus, "237" is spoken
"Two Three Seven".
- continuous
- Speak the numeral as a full number. Thus, "237" is spoken
"Two hundred thirty seven". Word representations are language-dependent.
|