Hello.
I have just a few days (till Thursday) to recommend a purchase vs. open-source vs develop-in-house decision for a non-linear regression problem we have. We're looking at CART and XELOPES on the commercial side and Orange and Weka on the open-source side. The problem is unusal in some respects. I've appended a description of the problem (especially 'what makes it unusual') written by someone who's much more familar with the theory than I am. I'm hoping it'll make sense to you and you'll be able to say "yes, you can do that using Weka like this ..." etc etc. Let me know if things aren't clear enough and I'll expand from my perspective (which is more practical that theoretical). Thanks! Tim Bunce. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Our problem domain is defined by feature vectors of 5 to 20 nominal features. Our goal is to predict a continuous value for each feature vector presented to the system. Thus our problem is a non-linear regression problem. The main difference from a standard setting is that the feature vectors are not labeled with a corresponding continuous value, but with a binary one. The continuous value is the rate (or average) value within a group of feature vectors. If thinking about a decision tree, the predicted continuous value would be calculated from the feature vectors within a leaf node (or any other intermediate node). We also wish to compute a secondary continuous value from the instance labels within each tree node. Our splitting rule is not based on the least-squared deviation (LSD) as the vectors themselves are not labeled with the continuous value, but instead on an ad hoc rule that tries to maximize the variance of the computed continuous values across all the child nodes. It's possible LSD would work using the binary values of each feature vector. We also have ad hoc stopping rules based on statistical measures. One rule determines the validity of each proposed split. If a split is not acceptable the split is undone and the splitting process stopped on that branch. A second stopping rule applies to each node. Nodes that do not meet this rule a deleted from the tree. We introduced a special "other" feature value to handle cases when a test vector has a feature value that its not represented in the tree. Not being represented can happen because the value was not present in the training set (a common occurrence) or because the number of training instances was too low to include it in the tree. When training, the "other" feature value is added to all node splits after the optimal split is decided. All the data of the parent node is considered present in the "other" child node. When on the testing phase the "other" value is a match when no other values are. Our tree is not binary, but with sometimes thousands of splits per node. This is probably not a theoretical problem as both are equivalent. It might be an issue as we are planning on using the trees also as representation of the data for human consumption. We're considering having a pruning stage. It will eliminate all "other" nodes that are leaves. And it will merge leaves whose corresponding continuous values are close enough. When searching the tree (performance phase) the process starts at the root and proceeds as usual, with the caveat of the "other" value (as described), but if at some time a match is not possible the current node is the result of the search and the output continuous value is computed from the instances within it. The number of training feature vectors is large, in the millions. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ _______________________________________________ Wekalist mailing list [hidden email] https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist |
Free forum by Nabble | Edit this page |