BEFORE YOU TRAIN
|MODELING CONTEXT-DEPENDENT PHONES WITH UNTIED STATES: SOME MEMORY REQUIREMENTS|
To train 5-state/HMM models for 10,000 triphones:
5 states/triphone = 50,000 states For a 4-stream feature-set, each = 1024 floating point numbers/state state has a total of 4*256 mixture weights = 205Mb buffer for 50,000 statesCorresponding to each of the four feature streams, there are 256 means and 256 variances in the codebook. ALL these, and ALL the mixture weights and transition matrices are loaded in into the RAM, and during training an additional buffer of equal size is allocated to store intermediate results. These are later written out into the hard disk when the calculations for the current training iteration are complete. Note that there are as many transition matrices as you have phones (40-50 for the English language, depending on your dictionary) All this amounts to allocating well over 400 Mb of RAM. This is a bottleneck for machines with smaller memory. No matter how large your training corpus is, you can actually train only about 10,000 triphones at the cd-untied stage if you have ~400 Mb of RAM (A 100 hour broadcast news corpus typically has 40,000 triphones). You could train more if your machine is capable of handling the memory demands effectively (this could be done, for example, by having a large amount of swap space). If you are training on multiple machines, *each* will require this much memory. In addition, at the end of each iteration, you have to transmit all buffers to a single machine that performs the norm. Networking issues need to be considered here.
The cd-untied models are used to build trees. The number of triphones you train at this stage directly affects the quality of the trees, which would have to be built using fewer triphones than are actually present in the training set if you do not have enough memory.
For 10,000 triphones:
5 states/triphone = 50,000 states 39 means (assuming a 39-component feature vector) and 39 variances per state = 79 floating points per state = 15.8Mb buffer for 50,000 statesThus we can train 12 times as many triphones as we can when we have semicontinuous models for the same amount of memory. Since we can use more triphones to train (and hence more information) the decision trees are better, and eventually result in better recognition performance. back to index