domingo, 3 de março de 2013



Modeling Context-dependent phones (ex. triphones) with untied states requires the largest amount of hardware resources. Take a moment to check if you have enough. The resources required depend on the type of model you are going to train, the dimensionality and configuration of your feature vectors, and the number of states in the HMMs.
Semi-continuous models
To train 5-state/HMM models for 10,000 triphones:
5 states/triphone                    = 50,000 states
For a 4-stream feature-set, each     = 1024 floating point numbers/state
state has a total of 4*256 mixture   
                                     = 205Mb buffer for 50,000 states
Corresponding to each of the four feature streams, there are 256 means and 256 variances in the codebook. ALL these, and ALL the mixture weights and transition matrices are loaded in into the RAM, and during training an additional buffer of equal size is allocated to store intermediate results. These are later written out into the hard disk when the calculations for the current training iteration are complete. Note that there are as many transition matrices as you have phones (40-50 for the English language, depending on your dictionary) All this amounts to allocating well over 400 Mb of RAM. This is a bottleneck for machines with smaller memory. No matter how large your training corpus is, you can actually train only about 10,000 triphones at the cd-untied stage if you have ~400 Mb of RAM (A 100 hour broadcast news corpus typically has 40,000 triphones). You could train more if your machine is capable of handling the memory demands effectively (this could be done, for example, by having a large amount of swap space). If you are training on multiple machines, *each* will require this much memory. In addition, at the end of each iteration, you have to transmit all buffers to a single machine that performs the norm. Networking issues need to be considered here.
The cd-untied models are used to build trees. The number of triphones you train at this stage directly affects the quality of the trees, which would have to be built using fewer triphones than are actually present in the training set if you do not have enough memory.
Continuous models
For 10,000 triphones:
5 states/triphone         = 50,000 states
39 means (assuming a
39-component feature
vector) and 39
variances per state       = 79 floating points per state
                          = 15.8Mb buffer for 50,000 states
Thus we can train 12 times as many triphones as we can when we have semicontinuous models for the same amount of memory. Since we can use more triphones to train (and hence more information) the decision trees are better, and eventually result in better recognition performance. back to index

Nenhum comentário:

Postar um comentário