BEFORE YOU TRAIN
- A set of feature files computed from the audio training data, one each for every recording you have in the training corpus. Each recording can be transformed into a sequence of feature vectors using a front-end executable provided with the SPHIN-III training package. Each front-end executable provided performs a different analysis of the speech signals and computes a different type of feature.
- A control file containing the list of feature-set filenames with
full paths to them. An example of the entries in this file:
dir/subdir1/utt1 dir/subdir1/utt2 dir/subdir2/utt3Note that the extensions are not given. They will be provided separately to the trainer. It is a good idea to give unique names to all feature files, even if including the full paths seems to make each entry in the control file unique. You will find later that this provides a lot of flexibility for doing many things.
- A transcript file in which the transcripts corresponding to the feature files are listed in exactly the same order as the feature filenames in the control file.
- A main dictionary which has all acoustic events and words in
the transcripts mapped onto the acoustic units you want to train.
Redundancy in the form of extra words is permitted. The dictionary
must have all alternate pronunciations marked with paranthesized serial
numbers starting from (2) for the second pronunciation. The marker
(1) is omitted. Here's an example:
DIRECTING D AY R EH K T I ng DIRECTING(2) D ER EH K T I ng DIRECTING(3) D I R EH K T I ng
- A filler dictionary, which usually lists the
non-speech events as "words" and maps them to user_defined phones.
This dictionary must at least have the entries
<s> SIL <sil> SIL </s> SILThe entries stand for
<s> : begining-utterance silence <sil> : within-utterance silence </s> : end-utterance silenceNote that the words <s>, </s> and <sil> are treated as special words and are required to be present in the filler dictionary. At least one of these must be mapped on to a phone called "SIL". The phone SIL is treated in a special manner and is required to be present. The sphinx expects you to name the acoustic events corresponding to your general background condition as SIL. For clean speech these events may actually be silences, but for noisy speech these may be the most general kind of background noise that prevails in the database. Other noises can then be modelled by phones defined by the user. During training SIL replaces every phone flanked by "+" as the context for adjacent phones. The phones flanked by "+" are only modeled as CI phones and are not used as contexts for triphones. If you do not want this to happen you may map your fillers to phones that are not flanked by "+".
- A phonelist, which is a list of all acoustic units that you want to
train models for. The SPHINX does not permit you to have units
other than those in your dictionaries. All units in your
two dictionaries must be listed here. In other words, your phonelist
must have exactly the same units used in your dictionaries, no more and no
less. Each phone must be listed on a separate line in the file, begining from
the left, with no extra spaces after the phone. an example:
AA AE OW B CH
- Are all the transcript words in the dictionary/filler dictionary?
- Make sure that the size of transcript matches the .ctl file.
- Check the boundaries defined in the .ctl file to make sure they exist ie, you have all the frames that are listed in the control file
- Verify the phonelist against the dictionary and fillerdict
If you have only about 50-60 words in your vocabulary, and if your entire test data vocabulary is covered by the training data, then you are probably better off training word models rather than phone models. To do this, simply define the phoneset as your set of words themselves and have a dictionary that maps each word to itself and train. Also, use a lesser number of fillers, and if you do need to train phone models make sure that each of your tied states has enough counts (at least 5 or 10 instances of each).