Updating large value data types
The inclusion of speaker demographics brings in many more independent variables, that may help to account for variation in the data, and which facilitate later uses of the corpus for purposes that were not envisaged when the corpus was created, such as sociolinguistics.
A third property is that there is a sharp division between the original linguistic event captured as an audio recording, and the annotations of that event.
Two sentences, read by all speakers, were designed to bring out dialect variation: The remaining sentences were chosen to be phonetically rich, involving all phones (sounds) and a comprehensive range of diphones (phone bigrams).
Additionally, the design strikes a balance between multiple speakers saying the same sentence in order to permit comparison across speakers, and having a large range of sentences covered by the corpus to get maximal coverage of diphones.
At the top level there is a split between training and testing sets, which gives away its intended use for developing and evaluating statistical models.
Moreover, notice that all of the data types included in the TIMIT corpus fall into the two basic categories of lexicon and text, which we will discuss below.
Even the speaker demographics data is just another instance of the lexicon data type.
Finally, notice that even though TIMIT is a speech corpus, its transcriptions and associated data are just text, and can be processed using programs just like any other text corpus.
Therefore, many of the computational methods described in this book are applicable.
The same holds true of text corpora, in the sense that the original text usually has an external source, and is considered to be an immutable artifact.