Conclusion & Suggestions for Improvement
Deep belief networks show promise for certain applications in the future, given the theory behind them. Much information has been published on the expressive power of DBNs in[5] and [6].
However, the results of my experiments show that DBNs are perhaps more
sensitive to having an appropriate implementation for a given application than other
learning algorithms/classifiers. Even though the same binary training data was presented
to both the DBN and the SVM and NB classifiers, the SVM and NB classifiers
outperformed the DBN for the document classification application. As mentioned
previously, the most likely reason for this is due to the fact that DBNs iteratively learn
“features-of-features” in each level’s RBM. If the network has an appropriate
implementation for the task at hand, this will potentially lead to a very high accuracy
classifier (in [10] Hinton describes how a DBN can be built to outperform other types of
classifiers on the task of digit recognition). However, if the network and data do not
work perfectly well together, this feature-of-feature learning can potentially lead to
recursively learning features that do not appropriately model the training data. This is
what I believe to be the case in the experiments described in this paper, since the DBN
performed far better than random guessing, but was outperformed by the relatively stock SVM and NB classifiers.
==========================================================================
5.2 Results and Conclusions
The 2-level CRBM models, whose training was performed by mini-batch gradient updates
that are computed using the Contrastive Divergence rules, clearly do not perform
well on the character sequences dataset. Additionally, the mAR models outperform the
2-level CRBM on the specific dataset. An interesting observation made here, is that
when trained on the simpler character datasets, the mAR models generally perform
much better, whereas the CRBM models exhibit slightly better results in certain cases
only, and are still incapable of capturing the structure of the input space, even when
trained on samples from single characters.
In the mAR models, longer history typically leads to better generic capabilities. When
trained on the 20 characters dataset, the mAR models need to consider history that
roughly corresponds to the length of the observed waveforms in order to perform relatively
well. When trained on smaller datasets, even shorter history - roughly 14-th of
the observed waveforms - is adequate. On the other hand, the 2-level CRBM models
generate a contractive system when trained on a very short history and an expanding
system when trained on very long history. The appropriate length of history - the one
that corresponds to the “best” performing model - depends on the difficulty of the learning
task. An indication for this is the fact that in the case of the 20 characters dataset
the “19-6” averaging scheme is the one that corresponds to the best model, whereas
in the case of the “a-d” characters dataset the “19-6” model generates an expanding
system1.
From our analysis, a possible reason why the mAR models outperform the 2-level
CRBM, when trained on the characters dataset, is the fact that the CRBM models try
to capture a more complex structure, but at the same time expend the representational
resources of their hidden layers in only a few hidden states, which leads to an inefficient
exploitation of this added complexity. As a result the much simpler structure captured
by the mAR models leads to generations that are more closely related to handwriten
characters.
On the other hand, when trained on the motion dataset, the 2-level CRBM model reveals
its representational power. It clearly outperforms the mAR models, as it is able to
discover a pattern for each attribute, which is close to the main trend of the corresponding
observed values over time, and it is able to reproduce this pattern by effectively using
its hidden layers in order to encode the different states of the system. On the other
hand, the mAR model is intially able to accurately generate synthetic motion, but fails
to re-produce the same pattern over time and thus results to a contractive system.
The comparative analysis of the 2-level CRBM with the mAR on the motion dataset,
additionally, reveals a potential difficulty of the learning task on the characters dataset.
1This is also the case for the experiments with the single character datasets
The attributes of the motion dataset consists of small repeated sub-sequences, of roughly
30 time frames length. Therefore the observed process that they define can be seen as
a 49-dimensional “harmonic” waveform that is continuous in time. On the other hand,
the character sequences are not continuous in time. Each process has its start and ending
and in between, since the three attributes that define the process are in derivative
space, we have a waveform that makes approximately two cycles.
This property of the data, combined with the fact that training in the 2-level CRBM
is performed by considering individual subsequences in a random order and more importantly
the fact that the training algorithm does not do smoothing, suggests that the
bad performance of the 2-level CRBM on the characters data may be, at least, partially
accredited to the training procedure currently used.