Páginas

domingo, 3 de março de 2013

DATA PREPARATION

BEFORE YOU TRAIN

DATA PREPARATION
You will need the following files to begin training:
  1. A set of feature files computed from the audio training data, one each for every recording you have in the training corpus. Each recording can be transformed into a sequence of feature vectors using a front-end executable provided with the SPHIN-III training package. Each front-end executable provided performs a different analysis of the speech signals and computes a different type of feature.
  2. A control file containing the list of feature-set filenames with full paths to them. An example of the entries in this file:
    dir/subdir1/utt1
    dir/subdir1/utt2
    dir/subdir2/utt3
    
    Note that the extensions are not given. They will be provided separately to the trainer. It is a good idea to give unique names to all feature files, even if including the full paths seems to make each entry in the control file unique. You will find later that this provides a lot of flexibility for doing many things.
  3. A transcript file in which the transcripts corresponding to the feature files are listed in exactly the same order as the feature filenames in the control file.
  4. A main dictionary which has all acoustic events and words in the transcripts mapped onto the acoustic units you want to train. Redundancy in the form of extra words is permitted. The dictionary must have all alternate pronunciations marked with paranthesized serial numbers starting from (2) for the second pronunciation. The marker (1) is omitted. Here's an example:
                 
    DIRECTING            D AY R EH K T I ng
    DIRECTING(2)         D ER EH K T I ng
    DIRECTING(3)         D I R EH K T I ng
    
  5. A filler dictionary, which usually lists the non-speech events as "words" and maps them to user_defined phones. This dictionary must at least have the entries
    <s>     SIL
    <sil>   SIL
    </s>    SIL  
    
    The entries stand for
    <s>     : begining-utterance silence
    <sil>   : within-utterance silence
    </s>    : end-utterance silence
    
    Note that the words <s>, </s> and <sil> are treated as special words and are required to be present in the filler dictionary. At least one of these must be mapped on to a phone called "SIL". The phone SIL is treated in a special manner and is required to be present. The sphinx expects you to name the acoustic events corresponding to your general background condition as SIL. For clean speech these events may actually be silences, but for noisy speech these may be the most general kind of background noise that prevails in the database. Other noises can then be modelled by phones defined by the user. During training SIL replaces every phone flanked by "+" as the context for adjacent phones. The phones flanked by "+" are only modeled as CI phones and are not used as contexts for triphones. If you do not want this to happen you may map your fillers to phones that are not flanked by "+".
  6. A phonelist, which is a list of all acoustic units that you want to train models for. The SPHINX does not permit you to have units other than those in your dictionaries. All units in your two dictionaries must be listed here. In other words, your phonelist must have exactly the same units used in your dictionaries, no more and no less. Each phone must be listed on a separate line in the file, begining from the left, with no extra spaces after the phone. an example:
    AA
    AE
    OW
    B
    CH
    
Here's a quick checklist to verify your data preparation before you train:
  1. Are all the transcript words in the dictionary/filler dictionary?
  2. Make sure that the size of transcript matches the .ctl file.
  3. Check the boundaries defined in the .ctl file to make sure they exist ie, you have all the frames that are listed in the control file
  4. Verify the phonelist against the dictionary and fillerdict
When you have a very small closed vocabulary (50-60 words)
If you have only about 50-60 words in your vocabulary, and if your entire test data vocabulary is covered by the training data, then you are probably better off training word models rather than phone models. To do this, simply define the phoneset as your set of words themselves and have a dictionary that maps each word to itself and train. Also, use a lesser number of fillers, and if you do need to train phone models make sure that each of your tied states has enough counts (at least 5 or 10 instances of each).

Nenhum comentário:

Postar um comentário