I have just a few days (till Thursday) to recommend a purchase vs.
open-source vs develop-in-house decision for a non-linear regression
problem we have.
We're looking at CART and XELOPES on the commercial side and
Orange and Weka on the open-source side.
The problem is unusal in some respects.
I've appended a description of the problem (especially 'what makes it
unusual') written by someone who's much more familar with the theory
than I am.
I'm hoping it'll make sense to you and you'll be able to say "yes, you
can do that using Weka like this ..." etc etc.
Let me know if things aren't clear enough and I'll expand from my
perspective (which is more practical that theoretical).
Our problem domain is defined by feature vectors of 5 to 20 nominal
features. Our goal is to predict a continuous value for each feature
vector presented to the system. Thus our problem is a non-linear
regression problem. The main difference from a standard setting is that
the feature vectors are not labeled with a corresponding continuous
value, but with a binary one. The continuous value is the rate (or
average) value within a group of feature vectors. If thinking about a
decision tree, the predicted continuous value would be calculated from
the feature vectors within a leaf node (or any other intermediate node).
We also wish to compute a secondary continuous value from the instance
labels within each tree node.
Our splitting rule is not based on the least-squared deviation (LSD) as
the vectors themselves are not labeled with the continuous value, but
instead on an ad hoc rule that tries to maximize the variance of the
computed continuous values across all the child nodes. It's possible LSD
would work using the binary values of each feature vector.
We also have ad hoc stopping rules based on statistical measures. One
rule determines the validity of each proposed split. If a split is not
acceptable the split is undone and the splitting process stopped on that
branch. A second stopping rule applies to each node. Nodes that do not
meet this rule a deleted from the tree.
We introduced a special "other" feature value to handle cases when a
test vector has a feature value that its not represented in the tree.
Not being represented can happen because the value was not present in
the training set (a common occurrence) or because the number of training
instances was too low to include it in the tree. When training, the
"other" feature value is added to all node splits after the optimal
split is decided. All the data of the parent node is considered present
in the "other" child node. When on the testing phase the "other" value
is a match when no other values are.
Our tree is not binary, but with sometimes thousands of splits per node.
This is probably not a theoretical problem as both are equivalent. It
might be an issue as we are planning on using the trees also as
representation of the data for human consumption.
We're considering having a pruning stage. It will eliminate all "other"
nodes that are leaves. And it will merge leaves whose corresponding
continuous values are close enough.
When searching the tree (performance phase) the process starts at the
root and proceeds as usual, with the caveat of the "other" value (as
described), but if at some time a match is not possible the current node
is the result of the search and the output continuous value is computed
from the instances within it.
The number of training feature vectors is large, in the millions.
Wekalist mailing list
|Free forum by Nabble||Edit this page|