Páginas

domingo, 3 de março de 2013

CREATING THE CD UNTIED MODEL DEFINITION FILE

TRAINING CONTINUOUS MODELS

The next step is the CD-untied training, in which HMMs are trained for all context-dependent phones (usually triphones) that are seen in the training corpus. For the CD-untied training, we first need to to generate a model definition file for all the triphones occuring in the training set. This is done in several steps:
    First, a list of all triphones possible in the vocabulary is generated from the dictionary. To get this complete list of triphones from the dictionary, it is first necessary to write the list of phones in the following format:
    phone1 0 0 0 0
    phone2 0 0 0 0
    phone3 0 0 0 0
    phone4 0 0 0 0
    ...
    
    The phonelist used for the CI training must be used to generate this, and the order in which the phones are listed must be the same. Next, a temporary dictionary is generated, which has all words except the filler words (words enclosed in ++()++ ). The entry
    SIL    SIL
    
    must be added to this temporary dictionary, and the dictionary must be sorted in alphabetical order. The program "quick_count" provided with the SPHINX-III package can now be used to generate the list of all possible triphones from the temporary dictionary. It takes the following arguments:
    FLAG DESCRIPTION
    -q mandatory flag to tell quick_count to consider all word pairs while constructing triphone list
    -p formatted phonelist
    -b temporary dictionary
    -o output triphone list
    Here is a typical output from quick_count
    AA(AA,AA)s              1
    AA(AA,AE)b              1
    AA(AA,AO)1              1
    AA(AA,AW)e              1
    
    The "1" in AA(AA,AO)1 indicates that this is a word-internal triphone. This is a carry over from Sphinx-II. The output from quick_count has to be now written into the following format:
    AA AA AA s
    AA AA AE b
    AA AA AO i
    AA AA AW e
    
    This can be done by simply replacing "(", ",", and ")" in the output of quick_count by a space and printing only the first four columns. While doing so, all instances of " 1" must be replaced by " i". To the top of the resulting file the list of CI phones must be appened in the following format
    AA - - -
    AE - - -
    AO - - -
    AW - - -
    ..
    ..                                                         
    AA AA AA s
    AA AA AE b
    AA AA AO i
    AA AA AW e
    

    For example, if the output of the quick_count is stored in a file named "quick_count.out", the following perl command will generate the phone list in the desired form. perl -nae '$F[0] =~ s/\(|\)|\,/ /g; $F[0] =~ s/1/i/g; print $F[0]; if ($F[0] =~ /\s+$/){print "i"}; print "\n"' quick_count.out The above list of triphones (and phones) is converted to the model definition file that lists all possible triphones from the dictionary. The program used from this is "mk_model_def" with the following arguments number of states per HMM
    FLAG DESCRIPTION
    -moddeffn model definition file with all possible triphones(alltriphones_mdef)to be written
    -phonelstfn list of all triphones
    -n_state_pm
    In the next step we find the number of times each of the triphones listed in the alltriphones_mdef occured in the training corpus To do this we call the program "param_cnt" which takes the following arguments:
    FLAG DESCRIPTION
    -moddeffn model definition file with all possible triphones(alltriphones_mdef)
    -ts2cbfn takes the value ".cont." if you are building continuous models
    -ctlfn control file corresponding to your training transcripts
    -lsnfn transcript file for training
    -dictfn training dictionary
    -fdictfn filler dictionary
    -paramtype write "phone" here, without the double quotes
    -segdir /dev/null
    param_cnt writes out the counts for each of the triphones onto stdout. All other messages are sent to stderr. The stdout therefore has to be directed into a file. If you are using csh or tcsh it would be done in the following manner:
    (param_cnt [arguments] > triphone_count_file) >&! LOG
    
    Here's an example of the output of this program
    +GARBAGE+ - - - 98
    +LAUGH+ - - - 29
    SIL - - - 31694
    AA - - - 0
    AE - - - 0
    ...
    AA AA AA s 1
    AA AA AE s 0
    AA AA AO s 4
    
    The final number in each row shows the number of times that particular triphone (or filler phone) has occured in the training corpus. Not that if all possible triphones of a CI phone are listed in the all_triphones.mdef the CI phone itself will have 0 counts since all instances of it would have been mapped to a triphone. This list of counted triphones is used to shortlist the triphones that have occured a minimum number (threshold) of times. The shortlisted triphones appear in the same format as the file from which they have been selected. The shortlisted triphone list has the same format as the triphone list used to generate the all_triphones.mdef. The formatted list of CI phones has to be included in this as before. So, in the earlier example, if a threshold of 4 were used, we would obtain the shortlisted triphone list as
    AA - - -
    AE - - -
    AO - - -
    AW - - -
    ..
    ..                                 
    AA AA AO s
    ..
    
    The threshold is adjusted such that the total number of triphones above the threshold is less that the maximum number of triphones that the system can train (or that you wish to train). It is good to train as many triphones as possible. The maximum number of triphones may however be dependent on the memory available on your machine. The logistics related to this are described in the beginning of this manual. Note that thresholding is usually done so to reduce the number of triphones, in order that the resulting models will be small enough to fit in the computer's memory. If this is not a problem, then the threshold can be set to a smaller number. If the triphone occurs too few times, however, (ie, if the threshold is too small), there will not be enough data to train the HMM state distributions properly. This would lead to poorly estimated CD untied models, which in turn may affect the decision trees which are to be built using these models in the next major step of the training.
    A model definition file is now created to include only these shortlisted triphones. This is the final model definition file to be used for the CD untied training. The reduced triphone list is then to the model definition file using mk_model_def with the following arguments: number of states per HMM
    FLAG DESCRIPTION
    -moddeffn model definition file for CD untied training
    -phonelstfn list of shortlisted triphones
    -n_state_pm
Finally, therefore, a model definition file which lists all CI phones and seen triphones is constructed. This file, like the CI model-definition file, assigns unique id's to each HMM state and serves as a reference file for handling and identifying the CD-untied model parameters. Here is an example of the CD-untied model-definition file: If you have listed five phones in your phones.list file,
SIL B AE T
and specify that you want to build three state HMMs for each of these phones, and if you have one utterance listed in your transcript file:
<s> BAT A TAB </s> for which your dictionary and fillerdict entries are:
Fillerdict:
<s>   SIL
</s>  SIL
Dictionary:
A      AX 
BAT    B AE T
TAB    T AE B
then your CD-untied model-definition file will look like this:
# Generated by /mk_model_def on Thu Aug 10 14:57:15 2000
0.3
5 n_base
7 n_tri
48 n_state_map
36 n_tied_state
15 n_tied_ci_state
5 n_tied_tmat                                                                  
#
# Columns definitions
#base lft  rt p attrib   tmat  ...state id's ...
SIL     -   -  - filler    0    0       1      2     N
AE      -   -  -    n/a    1    3       4      5     N
AX      -   -  -    n/a    2    6       7      8     N
B       -   -  -    n/a    3    9       10     11    N
T       -   -  -    n/a    4    12      13     14    N
AE      B   T  i    n/a    1    15      16     17    N
AE      T   B  i    n/a    1    18      19     20    N
AX      T   T  s    n/a    2    21      22     23    N
B       SIL AE b    n/a    3    24      25     26    N
B       AE  SIL e   n/a    3    27      28     29    N
T       AE  AX e    n/a    4    30      31     32    N
T       AX  AE b    n/a    4    33      34     35    N

The # lines are simply comments. The rest of the variables mean the following:

  n_base      : no. of CI phones (also called "base" phones), 5 here
  n_tri       : no. of triphones , 7 in this case
  n_state_map : Total no. of HMM states (emitting and non-emitting)
                The Sphinx appends an extra terminal non-emitting state
                to every HMM, hence for 5+7 phones, each specified by
                the user to be modeled by a 3-state HMM, this number
                will be 12phones*4states = 48
  n_tied_state: no. of states of all phones after state-sharing is done.
                We do not share states at this stage. Hence this number is the
                same as the total number of emitting states, 12*3=36
n_tied_ci_state:no. of states for your CI phones after state-sharing     
                is done. The CI states are not shared, now or later.
                This number is thus again the total number of emitting CI
                states, 5*3=15
 n_tied_tmat   : The total number of transition matrices is always the same
                 as the total number of CI phones being modeled. All triphones
                 for a given phone share the same transition matrix. This
                 number is thus 5.

Columns definitions: The following columns are defined:
       base  : name of each phone
       lft   : left-context of the phone (- if none)
       rt    : right-context of the phone (- if none)
       p     : position of a triphone. Four position markers are supported:
               b = word begining triphone
               e = word ending triphone
               i = word internal triphone
               s = single word triphone 
       attrib: attribute of phone. In the phone list, if the phone is "SIL",
               or if the phone is enclosed by "+", as in "+BANG+", these
              phones are interpreted as non-speech events. These are
               also called "filler" phones, and the attribute "filler" is
               assigned to each such phone. The base phones and the
               triphones have no special attributes, and hence are 
               labelled as "n/a", standing for "no attribute"
      tmat   : the id of the transition matrix associated with the phone      
 state id's  : the ids of the HMM states associated with any phone. This list
               is terminated by an "N" which stands for a non-emitting
               state. No id is assigned to it. However, it exists, and is
               listed.

Nenhum comentário:

Postar um comentário