I was looking for some functions in weka through which I can perform polynomial regression on my dataset. I have already performed linear regression which resulted in high bias. Now I plan to incrementally increase degree of polynomial until reaching some acceptable level of bias. Now I am looking some function(s) where I can first test polynomial of degree 2 and so on.
You could use a kernel-based method in conjunction with a polynomial kernel, e.g., GaussianProcesses or SMOreg (or LibSVM, as a potentially faster alternative to SMOreg).
Make sure you enable “lower order terms” when setting up the polynomial kernel.
> On 9/03/2017, at 10:20 PM, Aftab Akram <[hidden email]> wrote:
> I was looking for some functions in weka through which I can perform polynomial regression on my dataset. I have already performed linear regression which resulted in high bias. Now I plan to incrementally increase degree of polynomial until reaching some acceptable level of bias. Now I am looking some function(s) where I can first test polynomial of degree 2 and so on.
> AFTAB AKRAM
> Doctrate Student
> South China Normal University
> Guangzhou, P.R. China
> Wekalist mailing list
> Send posts to: [hidden email] > List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Not sure how you would accomplish that with Weka. But with Apache commons math 3 maybe something like the below.
It converts Weka instances to double arrays and then uses commons math.
This was code I didn’t end up using and didn’t finish.
I instead did a data transform taking the natural log of both x and y and then doing a linear regression.
This assumes the relationship is a power law like…
Y = aX^b
For me I was mainly interested in how a constantly increasing X affected the Y response variable. If you have more than one attribute to include in the polynomial I’m not sure how that would work. This might be too simplistic for you.
In a couple of synthetic examples, this returns very similar accuracy to
SMOReg with a Poly Kernel of the same order (in this example, 3 -- with
UseLowerOrder = True). The advantage of the Linear Regression approach is
that the outputs are more interpretable, and the coefficients can easily be
used offline for scenario modeling.
> On Jun 10, 2020, at 11:22 AM, Bill Bane <[hidden email]> wrote:
> I have run into this need at times, and used a filtered classifier that:
> a. adds higher-order attribute(s) using /AddExpression/, then
> b. performs /LinearRegression /(with the /EliminateColinearAttributes /flag
> turned off).
> For example, using a cubic model approach:
> weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F
> \"weka.filters.unsupervised.attribute.AddExpression -E a1^2 -N X-2\" -F
> \"weka.filters.unsupervised.attribute.AddExpression -E a1^3 -N X-3\" -F
> \"weka.filters.AllFilter \"" -W weka.classifiers.functions.LinearRegression
> -- -S 1 -C -R 1.0E-8 -additional-stats -num-decimal-places 4
> In a couple of synthetic examples, this returns very similar accuracy to
> SMOReg with a Poly Kernel of the same order (in this example, 3 -- with
> UseLowerOrder = True). The advantage of the Linear Regression approach is
> that the outputs are more interpretable, and the coefficients can easily be
> used offline for scenario modeling.
I’m not quite following how this allows you to do LinearRegression on nonlinear, higher order.
What I’m currently doing is copying the Instances and then taking the logs using Weka MathExpression, this flattens the higher order to linear, then do LinearRegression.
The R^2 indicates decent results for that. So I am assuming I am getting reasonable results and this is giving me a somewhat valid estimate of the degree of the nonlinear power law/polynomial, which is all I really want. I see using it on a comparison basis where higher degrees indicate more complexity and less scalability. So an ordering metric. If it correctly indicates which cases scale better then accuracy on the exact degree doesn’t really matter.
I did see somewhere that R^2 should not be used with nonlinear but with the log transform this is actually linear and is a metric used in the Empirical Complexity paper which uses the same approach.
Right now I’m doing some refactoring on the code to separate out a part that seems could be reusable. I have another use in mind. Also packaging it since it is getting beyond the simple command line tool I originally intended. Even considering a first attempt at modular, basically, just because on that.
Wekalist mailing list -- [hidden email] Send posts to [hidden email] To unsubscribe send an email to [hidden email] To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/postorius/lists/wekalist.list.waikato.ac.nz List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Hi -- this little write-up attached may help describe how linear regression
can be performed on nonlinear data. Of course we need to be careful of
overfitting or extrapolating models using higher-order terms like this, but
for well-contained data sets, it can work satisfactorily.
Hi -- this little write-up attached may help describe how linear regression can be performed on nonlinear data. Of course we need to be careful of overfitting or extrapolating models using higher-order terms like this, but for well-contained data sets, it can work satisfactorily. Cubic_regression_example.pdf <https://weka.8497.n7.nabble.com/file/t5855/Cubic_regression_example.pdf>
This follows up on something earlier where I thought GraalVM improved Weka memory management but it turned out to just be different gc settings.
I thought you could come up with a tool to tune gc along the lines of what I had already been doing. Just keep increasing RandomForest iterations until you run out of memory. The settings that allow more iterations might offer improved memory management. Although, I don’t really have anything to prove that would generalize to other classifiers and their parameters.
I also had the code record information about memory and garbage collection. Either to a csv or arff file.
default.csv is command line invocations with no gc parameters. It ran out of memory doing RandomForest at about 6000 iterations.
test.csv is the current with different gc parms. It made it to 7000 iterations.
So this code alone somewhat serves the original purpose it can to some extent indicate how well gc is working.
However, I noticed that in either case increasing iterations seemed to go along in a very linear way as long as there was free memory. When free memory ran out things got nonlinear as gc tried to manage things on its own. The nonlinear still looked like it might be following a fairly well formed exponential type curve. I wondered if that could be modeled.
To that end, if interested, you could look at either
X = iteration, Y = elapsed
X = iteration Y = old_count
When things go nonlinear most of the action starts occurring with gc in the old gen memory pool.
You can see in the current dataset I did some extra runs to fill in the nonlinear part a little. The code sorts the instances by iteration to allow for this.
The analysis code removes from the instances all attributes except the X and Y.
It also makes sure the attribute of interest (e.g. elapsed or old_count) is strictly increasing. The last run can get an outofmemory error early. It eliminates instances from the back where increasing isn’t the case.
The code then tries to determine the nonlinear break. It removes an instance from the back and checks to see if that improves linearity. If it does it adds the instance to a different nonlinear Instances. Repeating until removing doesn’t improve linearity.
Then we have our nonlinear instances ready for modeling.
For a second use I am considering a version that uses increasing sized dataset splits for any given classifier that will handle that dataset. Then see if at some point it goes nonlinear and model the complexity. To get an idea of how well different classifiers scale with increasing data.
If I finish this I mean at some point to put something together that explains this more clearly and looks a little better.
The visualizations you had were nice. I wasn’t aware you could do some of those with Weka.