Association Rule Mining runtime behavior during rule generation phase

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Association Rule Mining runtime behavior during rule generation phase

Lukas Ehrig

Dear Weka-community,

during some experiments of association rule mining on gene expression data with weka, I was surprised by the tremendous difference of association rules weka could generate on data sets having a similar size (but a different variance).

For the experiments, I used gene expression data (with genes ("items") in the columns, samples ("transactions") in rows) discretized to three expression levels (underexpressed, unchanged, overexpressed).

For one experiment, I cut the data after the first 1000 columns, for one I cut it after the first 10000 columns (just to assess feasibility) and for another one I selected genes (columns) with an above-average variance (about 7000).

For all experiments, weka got 12GB RAM to work on.  I loaded the (already discretized) csv data, applied the numeric-to-nominal filter and ran apriori:

--> Scheme:       weka.associations.Apriori -N 17 -T 0 -C 0.92 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
--> Relation:     variance_selected_discretized-weka.filters.unsupervised.attribute.NumericToNominal-Rfirst-last

Results: In the first two experiments (data cut after some number of columns), weka could generate thousands to ten thousands of association rules before running out of memory, whereas in the last experiment (with data having less columns than the data in the second experiment), weka could only generate 17 (!) association rules before crashing due to insufficient memory (I had to conduct the experiment several times, each time decreasing the -N parameter).

In all experiments, the frequent itemsets can be generated and memory insufficiency occurs during the rule generation phase if too many rules are to be output. What puzzles me, is that the number and  size of the generated itemsets in the first two experiments exceed those of the frequent item sets of the last experiment by far, yet weka could process the larger item sets much better to generate a much higher number of rules.

Please help me to understand why this happens.

With the best wishes for the holidays,

Lukas

---

If you'd like to take a look at the data and weka's full output, please have a look here:

https://drive.google.com/drive/folders/153SOgGpxYMxnVJXBMu_OSIS6hxdnW9SM?usp=sharing


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Association Rule Mining runtime behavior during rule generation phase

Eibe Frank-3
Your third dataset has missing values instead of zeroes. In the other datasets, you do not have missing values. My guess is that in the data you are considering, the absence of zeroes makes it harder to find many rules that exceed a high confidence threshold (such as the 0.92 threshold you have specified).

I'm not sure how you can conclude that the item sets have been generated successfully in all cases. WEKA's Apriori will only output (statistics on) item sets once it has successfully found the required number of rules (N = 18) or the minimum support threshold (M = 0.1 in your case) has been reached. WEKA keeps reducing the support threshold (in steps of size D = 0.05) in an attempt to generate N = 18 rules with confidence >= 0.92 but does not succeed in doing so before it runs out of memory because the support level becomes so low that the number of frequent item sets becomes too large to hold in memory.

WEKA repeatedly runs the Apriori algorithm with smaller and smaller values for the support threshold until the desired number of rules with the given minimum confidence has been found (or the minimum support threshold has been reached). The number of runs of Apriori is stated as the the number of "cycles performed" in the output.

Cheers,
Eibe

On Mon, Dec 24, 2018 at 4:31 PM Lukas Ehrig <[hidden email]> wrote:

Dear Weka-community,

during some experiments of association rule mining on gene expression data with weka, I was surprised by the tremendous difference of association rules weka could generate on data sets having a similar size (but a different variance).

For the experiments, I used gene expression data (with genes ("items") in the columns, samples ("transactions") in rows) discretized to three expression levels (underexpressed, unchanged, overexpressed).

For one experiment, I cut the data after the first 1000 columns, for one I cut it after the first 10000 columns (just to assess feasibility) and for another one I selected genes (columns) with an above-average variance (about 7000).

For all experiments, weka got 12GB RAM to work on.  I loaded the (already discretized) csv data, applied the numeric-to-nominal filter and ran apriori:

--> Scheme:       weka.associations.Apriori -N 17 -T 0 -C 0.92 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
--> Relation:     variance_selected_discretized-weka.filters.unsupervised.attribute.NumericToNominal-Rfirst-last

Results: In the first two experiments (data cut after some number of columns), weka could generate thousands to ten thousands of association rules before running out of memory, whereas in the last experiment (with data having less columns than the data in the second experiment), weka could only generate 17 (!) association rules before crashing due to insufficient memory (I had to conduct the experiment several times, each time decreasing the -N parameter).

In all experiments, the frequent itemsets can be generated and memory insufficiency occurs during the rule generation phase if too many rules are to be output. What puzzles me, is that the number and  size of the generated itemsets in the first two experiments exceed those of the frequent item sets of the last experiment by far, yet weka could process the larger item sets much better to generate a much higher number of rules.

Please help me to understand why this happens.

With the best wishes for the holidays,

Lukas

---

If you'd like to take a look at the data and weka's full output, please have a look here:

https://drive.google.com/drive/folders/153SOgGpxYMxnVJXBMu_OSIS6hxdnW9SM?usp=sharing

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html