Building a classifier for string attributes

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Building a classifier for string attributes

Mark Kimmerly

I am looking for help in building a classifier model for a very simple set of attributes, but a lot of data.

 

The goal is to build a model that will predict a numeric code based on 2 input strings.  For example, the input string pair “nylon hooded running jacket” and “sports apparel” should result in a code of 2459. From viewing the training videos on WekaMOOC, I am guessing that one of the Naïve Bayes classifiers (maybe the Sparse Generative Model) is what I’m looking for, but I’m having difficulty getting the correct filters configured to eliminate error messages and produce a working model.

 

I have a .arff file with the following attributes, the last (numeric) argument being the class attribute:

 

@attribute UPC_DESC string

@attribute XFLD1 string

@attribute DEFAULT_CODE_ID {1868,1886,1892,1904,1910,1912,1923, <2700 codes omitted>, 5016,5024}

 

There are close to 85,000 instances in the data section of the .arff file, with 2750 unique DEFAULT_CODE_ID values.

 

Can anyone suggest a Knowledge Flow configuration that will build a model that provides the best accuracy with this amount and type of data?

 

Thanks,

-Mark


Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right




_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Building a classifier for string attributes

Eibe Frank-2
Administrator

To being with, I would use the FilteredClassifier with StringToWordVector as the filter and NaiveBayesMultinomial as the base classifier.

 

Then, I would start to experiment with different options in StringToWordVector. This should give a reasonable baseline for further experiments.

 

Cheers,

Eibe

 

From: [hidden email]
Sent: Tuesday, 9 July 2019 7:34 AM
To: [hidden email]
Subject: [Wekalist] Building a classifier for string attributes

 

I am looking for help in building a classifier model for a very simple set of attributes, but a lot of data.

 

The goal is to build a model that will predict a numeric code based on 2 input strings.  For example, the input string pair “nylon hooded running jacket” and “sports apparel” should result in a code of 2459. From viewing the training videos on WekaMOOC, I am guessing that one of the Naïve Bayes classifiers (maybe the Sparse Generative Model) is what I’m looking for, but I’m having difficulty getting the correct filters configured to eliminate error messages and produce a working model.

 

I have a .arff file with the following attributes, the last (numeric) argument being the class attribute:

 

@attribute UPC_DESC string

@attribute XFLD1 string

@attribute DEFAULT_CODE_ID {1868,1886,1892,1904,1910,1912,1923, <2700 codes omitted>, 5016,5024}

 

There are close to 85,000 instances in the data section of the .arff file, with 2750 unique DEFAULT_CODE_ID values.

 

Can anyone suggest a Knowledge Flow configuration that will build a model that provides the best accuracy with this amount and type of data?

 

Thanks,

-Mark



Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right


 


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Building a classifier for string attributes

Mark Kimmerly

Thanks Eibe!

 

I have been successfully building models based on your advice, but I have since found that my data source is different than I realized.  I have over 12 million instances to train/test with and a little over 600 unique class attribute values.  If I configure the # of words to keep for the StringToWordVector filter at 1000, the performance evaluation (PE) is at 80% and the resulting model is over a gigabyte in size.  Keeping 10000 words improves the PE to 82%, but the model size is > 2GB. Keeping 100 words produces a model around 700MB, but the PE drops to 70%.  I need to keep the model size below 500MB if possible.

 

Can you provide suggestions on how to reduce the size of the model without sacrificing performance?

 

Thanks again for sharing your knowledge and time!

-Mark

 

 


Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right



From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank
Sent: Tuesday, July 9, 2019 12:20 AM
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] Building a classifier for string attributes

 

EXTERNAL MESSAGE


To being with, I would use the FilteredClassifier with StringToWordVector as the filter and NaiveBayesMultinomial as the base classifier.

 

Then, I would start to experiment with different options in StringToWordVector. This should give a reasonable baseline for further experiments.

 

Cheers,

Eibe

 

From: [hidden email]
Sent: Tuesday, 9 July 2019 7:34 AM
To: [hidden email]
Subject: [Wekalist] Building a classifier for string attributes

 

I am looking for help in building a classifier model for a very simple set of attributes, but a lot of data.

 

The goal is to build a model that will predict a numeric code based on 2 input strings.  For example, the input string pair “nylon hooded running jacket” and “sports apparel” should result in a code of 2459. From viewing the training videos on WekaMOOC, I am guessing that one of the Naïve Bayes classifiers (maybe the Sparse Generative Model) is what I’m looking for, but I’m having difficulty getting the correct filters configured to eliminate error messages and produce a working model.

 

I have a .arff file with the following attributes, the last (numeric) argument being the class attribute:

 

@attribute UPC_DESC string

@attribute XFLD1 string

@attribute DEFAULT_CODE_ID {1868,1886,1892,1904,1910,1912,1923, <2700 codes omitted>, 5016,5024}

 

There are close to 85,000 instances in the data section of the .arff file, with 2750 unique DEFAULT_CODE_ID values.

 

Can anyone suggest a Knowledge Flow configuration that will build a model that provides the best accuracy with this amount and type of data?

 

Thanks,

-Mark

 

Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right

 


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Building a classifier for string attributes

Eibe Frank-2
Administrator

For K classes and M attributes, NaiveBayesMultinomial stores O(K * M) double-precision floats as part of  the model.

 

For 600 classes and 1,000 attributes , this makes for roughly 600 * 1000 * 8 bytes, which is much lower than your memory consumption.

 

What is the actual number of attributes in the NaiveBayesMultinomial model? By default, StringToWordVector does not make break ties when pruning the dictionary based on the desired size set by the user: if you specify wordsToKeep=1000 and the 1000th word has frequency F then all other words that also have frequency F will be kept as well. And, by default, the frequency is established on a per-class basis when the per-class dictionaries are merged into the final dictionary. Perhaps the actual number of attributes/word used in the NaiveBayesMultinomial model is much larger?

 

Can you print the model to take a look at this? Or maybe apply the StringToWordVector filter separately and look at the number of attributes in the filtered data?

 

To reduce the number of attributes based on their predictive value, you can use information gain-based attribute ranking. A command-line example:

 

java weka.Run .AttributeSelection -S ".Ranker -N 2" -E .InfoGainAttributeEval < iris.arff

 

In your setting, you can apply this filter once the string attributes have been converted and before the naïve Bayes model is built.

 

Cheers,

Eibe

             

From: [hidden email]
Sent: Saturday, 20 July 2019 6:32 AM
To: [hidden email]
Subject: Re: [Wekalist] Building a classifier for string attributes

 

Thanks Eibe!

 

I have been successfully building models based on your advice, but I have since found that my data source is different than I realized.  I have over 12 million instances to train/test with and a little over 600 unique class attribute values.  If I configure the # of words to keep for the StringToWordVector filter at 1000, the performance evaluation (PE) is at 80% and the resulting model is over a gigabyte in size.  Keeping 10000 words improves the PE to 82%, but the model size is > 2GB. Keeping 100 words produces a model around 700MB, but the PE drops to 70%.  I need to keep the model size below 500MB if possible.

 

Can you provide suggestions on how to reduce the size of the model without sacrificing performance?

 

Thanks again for sharing your knowledge and time!

-Mark

 

 



Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right


From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank
Sent: Tuesday, July 9, 2019 12:20 AM
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] Building a classifier for string attributes

 

EXTERNAL MESSAGE

To being with, I would use the FilteredClassifier with StringToWordVector as the filter and NaiveBayesMultinomial as the base classifier.

 

Then, I would start to experiment with different options in StringToWordVector. This should give a reasonable baseline for further experiments.

 

Cheers,

Eibe

 

From: [hidden email]
Sent: Tuesday, 9 July 2019 7:34 AM
To: [hidden email]
Subject: [Wekalist] Building a classifier for string attributes

 

I am looking for help in building a classifier model for a very simple set of attributes, but a lot of data.

 

The goal is to build a model that will predict a numeric code based on 2 input strings.  For example, the input string pair “nylon hooded running jacket” and “sports apparel” should result in a code of 2459. From viewing the training videos on WekaMOOC, I am guessing that one of the Naïve Bayes classifiers (maybe the Sparse Generative Model) is what I’m looking for, but I’m having difficulty getting the correct filters configured to eliminate error messages and produce a working model.

 

I have a .arff file with the following attributes, the last (numeric) argument being the class attribute:

 

@attribute UPC_DESC string

@attribute XFLD1 string

@attribute DEFAULT_CODE_ID {1868,1886,1892,1904,1910,1912,1923, <2700 codes omitted>, 5016,5024}

 

There are close to 85,000 instances in the data section of the .arff file, with 2750 unique DEFAULT_CODE_ID values.

 

Can anyone suggest a Knowledge Flow configuration that will build a model that provides the best accuracy with this amount and type of data?

 

Thanks,

-Mark

 

Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right


 

 


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Building a classifier for string attributes

Mark Kimmerly

> What is the actual number of attributes in the NaiveBayesMultinomial model?

 

I am not sure how to determine this, apologies for my inexperience.  Each instance of data contains 2 string attributes and an integer class attribute.  The string pairs are comprised of retail product descriptions and categories, so each instance is similar to {“Nylon Hooded Jacket”, “Outdoor Apparel”, 1883} or {“Diet Cherry Coke”, “Soft Drinks”, 3356}. There are over 12 million instances of this data with an uneven number of instances per class. Of the 600 class attributes, 20 of them account for 80% of the instances:

> Can you print the model to take a look at this? Or maybe apply the StringToWordVector filter separately and look at the number of attributes in the filtered data?

 

Can you provide instructions on how to print the model? I’ve tried adding an Attribute Summarizer after the StringToWordVector so I can look at the results of the filter but attempting to view the image when the model build is complete results in the Java VM consuming 100% of the CPU and memory on the system and eventually a heap space error occurs.  (JAVA_OPTS=-Xmx28g)

 

Thanks,

-Mark

 


Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right



From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank
Sent: Friday, July 19, 2019 10:01 PM
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] Building a classifier for string attributes

 

EXTERNAL MESSAGE


For K classes and M attributes, NaiveBayesMultinomial stores O(K * M) double-precision floats as part of  the model.

 

For 600 classes and 1,000 attributes , this makes for roughly 600 * 1000 * 8 bytes, which is much lower than your memory consumption.

 

What is the actual number of attributes in the NaiveBayesMultinomial model? By default, StringToWordVector does not make break ties when pruning the dictionary based on the desired size set by the user: if you specify wordsToKeep=1000 and the 1000th word has frequency F then all other words that also have frequency F will be kept as well. And, by default, the frequency is established on a per-class basis when the per-class dictionaries are merged into the final dictionary. Perhaps the actual number of attributes/word used in the NaiveBayesMultinomial model is much larger?

 

Can you print the model to take a look at this? Or maybe apply the StringToWordVector filter separately and look at the number of attributes in the filtered data?

 

To reduce the number of attributes based on their predictive value, you can use information gain-based attribute ranking. A command-line example:

 

java weka.Run .AttributeSelection -S ".Ranker -N 2" -E .InfoGainAttributeEval < iris.arff

 

In your setting, you can apply this filter once the string attributes have been converted and before the naïve Bayes model is built.

 

Cheers,

Eibe

             

From: [hidden email]
Sent: Saturday, 20 July 2019 6:32 AM
To: [hidden email]
Subject: Re: [Wekalist] Building a classifier for string attributes

 

Thanks Eibe!

 

I have been successfully building models based on your advice, but I have since found that my data source is different than I realized.  I have over 12 million instances to train/test with and a little over 600 unique class attribute values.  If I configure the # of words to keep for the StringToWordVector filter at 1000, the performance evaluation (PE) is at 80% and the resulting model is over a gigabyte in size.  Keeping 10000 words improves the PE to 82%, but the model size is > 2GB. Keeping 100 words produces a model around 700MB, but the PE drops to 70%.  I need to keep the model size below 500MB if possible.

 

Can you provide suggestions on how to reduce the size of the model without sacrificing performance?

 

Thanks again for sharing your knowledge and time!

-Mark

 

 

 

Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right

From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank
Sent: Tuesday, July 9, 2019 12:20 AM
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] Building a classifier for string attributes

 

EXTERNAL MESSAGE

To being with, I would use the FilteredClassifier with StringToWordVector as the filter and NaiveBayesMultinomial as the base classifier.

 

Then, I would start to experiment with different options in StringToWordVector. This should give a reasonable baseline for further experiments.

 

Cheers,

Eibe

 

From: [hidden email]
Sent: Tuesday, 9 July 2019 7:34 AM
To: [hidden email]
Subject: [Wekalist] Building a classifier for string attributes

 

I am looking for help in building a classifier model for a very simple set of attributes, but a lot of data.

 

The goal is to build a model that will predict a numeric code based on 2 input strings.  For example, the input string pair “nylon hooded running jacket” and “sports apparel” should result in a code of 2459. From viewing the training videos on WekaMOOC, I am guessing that one of the Naïve Bayes classifiers (maybe the Sparse Generative Model) is what I’m looking for, but I’m having difficulty getting the correct filters configured to eliminate error messages and produce a working model.

 

I have a .arff file with the following attributes, the last (numeric) argument being the class attribute:

 

@attribute UPC_DESC string

@attribute XFLD1 string

@attribute DEFAULT_CODE_ID {1868,1886,1892,1904,1910,1912,1923, <2700 codes omitted>, 5016,5024}

 

There are close to 85,000 instances in the data section of the .arff file, with 2750 unique DEFAULT_CODE_ID values.

 

Can anyone suggest a Knowledge Flow configuration that will build a model that provides the best accuracy with this amount and type of data?

 

Thanks,

-Mark

 

Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right

 

 


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Building a classifier for string attributes

Eibe Frank-2
Administrator

You can use Visualization->TextViewer to look at datasets and classifiers in textual form.

 

The Attribute Summarizer generates a histogram for each attribute!

 

Can you perhaps post your flow (either as a .kf file or just a screenshot)?

 

Cheers,

Eibe

 

From: [hidden email]
Sent: Tuesday, 23 July 2019 6:02 AM
To: [hidden email]
Subject: Re: [Wekalist] Building a classifier for string attributes

 

> What is the actual number of attributes in the NaiveBayesMultinomial model?

 

I am not sure how to determine this, apologies for my inexperience.  Each instance of data contains 2 string attributes and an integer class attribute.  The string pairs are comprised of retail product descriptions and categories, so each instance is similar to {“Nylon Hooded Jacket”, “Outdoor Apparel”, 1883} or {“Diet Cherry Coke”, “Soft Drinks”, 3356}. There are over 12 million instances of this data with an uneven number of instances per class. Of the 600 class attributes, 20 of them account for 80% of the instances:

> Can you print the model to take a look at this? Or maybe apply the StringToWordVector filter separately and look at the number of attributes in the filtered data?

 

Can you provide instructions on how to print the model? I’ve tried adding an Attribute Summarizer after the StringToWordVector so I can look at the results of the filter but attempting to view the image when the model build is complete results in the Java VM consuming 100% of the CPU and memory on the system and eventually a heap space error occurs.  (JAVA_OPTS=-Xmx28g)

 

Thanks,

-Mark

 



Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right


From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank
Sent: Friday, July 19, 2019 10:01 PM
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] Building a classifier for string attributes

 

EXTERNAL MESSAGE

For K classes and M attributes, NaiveBayesMultinomial stores O(K * M) double-precision floats as part of  the model.

 

For 600 classes and 1,000 attributes , this makes for roughly 600 * 1000 * 8 bytes, which is much lower than your memory consumption.

 

What is the actual number of attributes in the NaiveBayesMultinomial model? By default, StringToWordVector does not make break ties when pruning the dictionary based on the desired size set by the user: if you specify wordsToKeep=1000 and the 1000th word has frequency F then all other words that also have frequency F will be kept as well. And, by default, the frequency is established on a per-class basis when the per-class dictionaries are merged into the final dictionary. Perhaps the actual number of attributes/word used in the NaiveBayesMultinomial model is much larger?

 

Can you print the model to take a look at this? Or maybe apply the StringToWordVector filter separately and look at the number of attributes in the filtered data?

 

To reduce the number of attributes based on their predictive value, you can use information gain-based attribute ranking. A command-line example:

 

java weka.Run .AttributeSelection -S ".Ranker -N 2" -E .InfoGainAttributeEval < iris.arff

 

In your setting, you can apply this filter once the string attributes have been converted and before the naïve Bayes model is built.

 

Cheers,

Eibe

             

From: [hidden email]
Sent: Saturday, 20 July 2019 6:32 AM
To: [hidden email]
Subject: Re: [Wekalist] Building a classifier for string attributes

 

Thanks Eibe!

 

I have been successfully building models based on your advice, but I have since found that my data source is different than I realized.  I have over 12 million instances to train/test with and a little over 600 unique class attribute values.  If I configure the # of words to keep for the StringToWordVector filter at 1000, the performance evaluation (PE) is at 80% and the resulting model is over a gigabyte in size.  Keeping 10000 words improves the PE to 82%, but the model size is > 2GB. Keeping 100 words produces a model around 700MB, but the PE drops to 70%.  I need to keep the model size below 500MB if possible.

 

Can you provide suggestions on how to reduce the size of the model without sacrificing performance?

 

Thanks again for sharing your knowledge and time!

-Mark

 

 

 

Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right


From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank
Sent: Tuesday, July 9, 2019 12:20 AM
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] Building a classifier for string attributes

 

EXTERNAL MESSAGE

To being with, I would use the FilteredClassifier with StringToWordVector as the filter and NaiveBayesMultinomial as the base classifier.

 

Then, I would start to experiment with different options in StringToWordVector. This should give a reasonable baseline for further experiments.

 

Cheers,

Eibe

 

From: [hidden email]
Sent: Tuesday, 9 July 2019 7:34 AM
To: [hidden email]
Subject: [Wekalist] Building a classifier for string attributes

 

I am looking for help in building a classifier model for a very simple set of attributes, but a lot of data.

 

The goal is to build a model that will predict a numeric code based on 2 input strings.  For example, the input string pair “nylon hooded running jacket” and “sports apparel” should result in a code of 2459. From viewing the training videos on WekaMOOC, I am guessing that one of the Naïve Bayes classifiers (maybe the Sparse Generative Model) is what I’m looking for, but I’m having difficulty getting the correct filters configured to eliminate error messages and produce a working model.

 

I have a .arff file with the following attributes, the last (numeric) argument being the class attribute:

 

@attribute UPC_DESC string

@attribute XFLD1 string

@attribute DEFAULT_CODE_ID {1868,1886,1892,1904,1910,1912,1923, <2700 codes omitted>, 5016,5024}

 

There are close to 85,000 instances in the data section of the .arff file, with 2750 unique DEFAULT_CODE_ID values.

 

Can anyone suggest a Knowledge Flow configuration that will build a model that provides the best accuracy with this amount and type of data?

 

Thanks,

-Mark

 

Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right


 

 

 


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Building a classifier for string attributes

Mark Kimmerly

Here’s both.  😊  The following flow generates a model 1,060,730 KB in size with the TextViewer summary below the screenshot.  I’ve also attached the full output of the TextViewer results.  It’s 3MB uncompressed, so I compressed it into a Windows .zip file.

 

 

Correctly Classified Instances      194915               81.1554 %

Incorrectly Classified Instances     45260               18.8446 %

Kappa statistic                          0.7853

Mean absolute error                      0.0007

Root mean squared error                  0.0212

Relative absolute error                 25.0728 %

Root relative squared error             56.7784 %

Total Number of Instances           240175    

 

Thanks!

-Mark

 


Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right



From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank
Sent: Monday, July 22, 2019 8:16 AM
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] Building a classifier for string attributes

 

EXTERNAL MESSAGE


You can use Visualization->TextViewer to look at datasets and classifiers in textual form.

 

The Attribute Summarizer generates a histogram for each attribute!

 

Can you perhaps post your flow (either as a .kf file or just a screenshot)?

 

Cheers,

Eibe

 

From: [hidden email]
Sent: Tuesday, 23 July 2019 6:02 AM
To:
[hidden email]
Subject: Re: [Wekalist] Building a classifier for string attributes

 

> What is the actual number of attributes in the NaiveBayesMultinomial model?

 

I am not sure how to determine this, apologies for my inexperience.  Each instance of data contains 2 string attributes and an integer class attribute.  The string pairs are comprised of retail product descriptions and categories, so each instance is similar to {“Nylon Hooded Jacket”, “Outdoor Apparel”, 1883} or {“Diet Cherry Coke”, “Soft Drinks”, 3356}. There are over 12 million instances of this data with an uneven number of instances per class. Of the 600 class attributes, 20 of them account for 80% of the instances:

cid:image001.png@01D54140.42C13B90

> Can you print the model to take a look at this? Or maybe apply the StringToWordVector filter separately and look at the number of attributes in the filtered data?

 

Can you provide instructions on how to print the model? I’ve tried adding an Attribute Summarizer after the StringToWordVector so I can look at the results of the filter but attempting to view the image when the model build is complete results in the Java VM consuming 100% of the CPU and memory on the system and eventually a heap space error occurs.  (JAVA_OPTS=-Xmx28g)

 

Thanks,

-Mark

 

 

Mark Kimmerly | Lead Engineer
e:
[hidden email]
Avalara | Tax compliance done right

From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank
Sent: Friday, July 19, 2019 10:01 PM
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] Building a classifier for string attributes

 

EXTERNAL MESSAGE

cid:image002.png@01D540EB.C4462040

For K classes and M attributes, NaiveBayesMultinomial stores O(K * M) double-precision floats as part of  the model.

 

For 600 classes and 1,000 attributes , this makes for roughly 600 * 1000 * 8 bytes, which is much lower than your memory consumption.

 

What is the actual number of attributes in the NaiveBayesMultinomial model? By default, StringToWordVector does not make break ties when pruning the dictionary based on the desired size set by the user: if you specify wordsToKeep=1000 and the 1000th word has frequency F then all other words that also have frequency F will be kept as well. And, by default, the frequency is established on a per-class basis when the per-class dictionaries are merged into the final dictionary. Perhaps the actual number of attributes/word used in the NaiveBayesMultinomial model is much larger?

 

Can you print the model to take a look at this? Or maybe apply the StringToWordVector filter separately and look at the number of attributes in the filtered data?

 

To reduce the number of attributes based on their predictive value, you can use information gain-based attribute ranking. A command-line example:

 

java weka.Run .AttributeSelection -S ".Ranker -N 2" -E .InfoGainAttributeEval < iris.arff

 

In your setting, you can apply this filter once the string attributes have been converted and before the naïve Bayes model is built.

 

Cheers,

Eibe

             

From: [hidden email]
Sent: Saturday, 20 July 2019 6:32 AM
To:
[hidden email]
Subject: Re: [Wekalist] Building a classifier for string attributes

 

Thanks Eibe!

 

I have been successfully building models based on your advice, but I have since found that my data source is different than I realized.  I have over 12 million instances to train/test with and a little over 600 unique class attribute values.  If I configure the # of words to keep for the StringToWordVector filter at 1000, the performance evaluation (PE) is at 80% and the resulting model is over a gigabyte in size.  Keeping 10000 words improves the PE to 82%, but the model size is > 2GB. Keeping 100 words produces a model around 700MB, but the PE drops to 70%.  I need to keep the model size below 500MB if possible.

 

Can you provide suggestions on how to reduce the size of the model without sacrificing performance?

 

Thanks again for sharing your knowledge and time!

-Mark

 

 

 

Mark Kimmerly | Lead Engineer
e:
[hidden email]
Avalara | Tax compliance done right

From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank
Sent: Tuesday, July 9, 2019 12:20 AM
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] Building a classifier for string attributes

 

EXTERNAL MESSAGE

cid:image004.png@01D54140.42C13B90

To being with, I would use the FilteredClassifier with StringToWordVector as the filter and NaiveBayesMultinomial as the base classifier.

 

Then, I would start to experiment with different options in StringToWordVector. This should give a reasonable baseline for further experiments.

 

Cheers,

Eibe

 

From: [hidden email]
Sent: Tuesday, 9 July 2019 7:34 AM
To:
[hidden email]
Subject: [Wekalist] Building a classifier for string attributes

 

I am looking for help in building a classifier model for a very simple set of attributes, but a lot of data.

 

The goal is to build a model that will predict a numeric code based on 2 input strings.  For example, the input string pair “nylon hooded running jacket” and “sports apparel” should result in a code of 2459. From viewing the training videos on WekaMOOC, I am guessing that one of the Naïve Bayes classifiers (maybe the Sparse Generative Model) is what I’m looking for, but I’m having difficulty getting the correct filters configured to eliminate error messages and produce a working model.

 

I have a .arff file with the following attributes, the last (numeric) argument being the class attribute:

 

@attribute UPC_DESC string

@attribute XFLD1 string

@attribute DEFAULT_CODE_ID {1868,1886,1892,1904,1910,1912,1923, <2700 codes omitted>, 5016,5024}

 

There are close to 85,000 instances in the data section of the .arff file, with 2750 unique DEFAULT_CODE_ID values.

 

Can anyone suggest a Knowledge Flow configuration that will build a model that provides the best accuracy with this amount and type of data?

 

Thanks,

-Mark

 

Mark Kimmerly | Lead Engineer
e:
[hidden email]
Avalara | Tax compliance done right

 

 

 


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

All.kf (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Building a classifier for string attributes

Eibe Frank-2
Administrator

It turns out that the SerializedModelSaver in the KnowledgeFlow, in all recent versions of WEKA, has been saving the content of string (and relational) attributes in the training data with the model, massively increasing the model file size in your scenario.

 

I have just checked a fix into the main trunk and the WEKA 3.8 branch. It should be in the next nightly snapshots available from our web site.

 

Thanks for bringing this to our attention!

 

Cheers,

Eibe

 

From: [hidden email]
Sent: Wednesday, 24 July 2019 2:49 AM
To: [hidden email]
Subject: Re: [Wekalist] Building a classifier for string attributes

 

Here’s both.  😊  The following flow generates a model 1,060,730 KB in size with the TextViewer summary below the screenshot.  I’ve also attached the full output of the TextViewer results.  It’s 3MB uncompressed, so I compressed it into a Windows .zip file.

 

 

Correctly Classified Instances      194915               81.1554 %

Incorrectly Classified Instances     45260               18.8446 %

Kappa statistic                          0.7853

Mean absolute error                      0.0007

Root mean squared error                  0.0212

Relative absolute error                 25.0728 %

Root relative squared error             56.7784 %

Total Number of Instances           240175    

 

Thanks!

-Mark

 



Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right


From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank
Sent: Monday, July 22, 2019 8:16 AM
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] Building a classifier for string attributes

 

EXTERNAL MESSAGE

You can use Visualization->TextViewer to look at datasets and classifiers in textual form.

 

The Attribute Summarizer generates a histogram for each attribute!

 

Can you perhaps post your flow (either as a .kf file or just a screenshot)?

 

Cheers,

Eibe

 

From: [hidden email]
Sent: Tuesday, 23 July 2019 6:02 AM
To: [hidden email]
Subject: Re: [Wekalist] Building a classifier for string attributes

 

> What is the actual number of attributes in the NaiveBayesMultinomial model?

 

I am not sure how to determine this, apologies for my inexperience.  Each instance of data contains 2 string attributes and an integer class attribute.  The string pairs are comprised of retail product descriptions and categories, so each instance is similar to {“Nylon Hooded Jacket”, “Outdoor Apparel”, 1883} or {“Diet Cherry Coke”, “Soft Drinks”, 3356}. There are over 12 million instances of this data with an uneven number of instances per class. Of the 600 class attributes, 20 of them account for 80% of the instances:

> Can you print the model to take a look at this? Or maybe apply the StringToWordVector filter separately and look at the number of attributes in the filtered data?

 

Can you provide instructions on how to print the model? I’ve tried adding an Attribute Summarizer after the StringToWordVector so I can look at the results of the filter but attempting to view the image when the model build is complete results in the Java VM consuming 100% of the CPU and memory on the system and eventually a heap space error occurs.  (JAVA_OPTS=-Xmx28g)

 

Thanks,

-Mark

 

 

Mark Kimmerly | Lead Engineer
e:
[hidden email]
Avalara | Tax compliance done right


From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank
Sent: Friday, July 19, 2019 10:01 PM
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] Building a classifier for string attributes

 

EXTERNAL MESSAGE

For K classes and M attributes, NaiveBayesMultinomial stores O(K * M) double-precision floats as part of  the model.

 

For 600 classes and 1,000 attributes , this makes for roughly 600 * 1000 * 8 bytes, which is much lower than your memory consumption.

 

What is the actual number of attributes in the NaiveBayesMultinomial model? By default, StringToWordVector does not make break ties when pruning the dictionary based on the desired size set by the user: if you specify wordsToKeep=1000 and the 1000th word has frequency F then all other words that also have frequency F will be kept as well. And, by default, the frequency is established on a per-class basis when the per-class dictionaries are merged into the final dictionary. Perhaps the actual number of attributes/word used in the NaiveBayesMultinomial model is much larger?

 

Can you print the model to take a look at this? Or maybe apply the StringToWordVector filter separately and look at the number of attributes in the filtered data?

 

To reduce the number of attributes based on their predictive value, you can use information gain-based attribute ranking. A command-line example:

 

java weka.Run .AttributeSelection -S ".Ranker -N 2" -E .InfoGainAttributeEval < iris.arff

 

In your setting, you can apply this filter once the string attributes have been converted and before the naïve Bayes model is built.

 

Cheers,

Eibe

             

From: [hidden email]
Sent: Saturday, 20 July 2019 6:32 AM
To: [hidden email]
Subject: Re: [Wekalist] Building a classifier for string attributes

 

Thanks Eibe!

 

I have been successfully building models based on your advice, but I have since found that my data source is different than I realized.  I have over 12 million instances to train/test with and a little over 600 unique class attribute values.  If I configure the # of words to keep for the StringToWordVector filter at 1000, the performance evaluation (PE) is at 80% and the resulting model is over a gigabyte in size.  Keeping 10000 words improves the PE to 82%, but the model size is > 2GB. Keeping 100 words produces a model around 700MB, but the PE drops to 70%.  I need to keep the model size below 500MB if possible.

 

Can you provide suggestions on how to reduce the size of the model without sacrificing performance?

 

Thanks again for sharing your knowledge and time!

-Mark

 

 

 

Mark Kimmerly | Lead Engineer
e:
[hidden email]
Avalara | Tax compliance done right


From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank
Sent: Tuesday, July 9, 2019 12:20 AM
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] Building a classifier for string attributes

 

EXTERNAL MESSAGE

To being with, I would use the FilteredClassifier with StringToWordVector as the filter and NaiveBayesMultinomial as the base classifier.

 

Then, I would start to experiment with different options in StringToWordVector. This should give a reasonable baseline for further experiments.

 

Cheers,

Eibe

 

From: [hidden email]
Sent: Tuesday, 9 July 2019 7:34 AM
To: [hidden email]
Subject: [Wekalist] Building a classifier for string attributes

 

I am looking for help in building a classifier model for a very simple set of attributes, but a lot of data.

 

The goal is to build a model that will predict a numeric code based on 2 input strings.  For example, the input string pair “nylon hooded running jacket” and “sports apparel” should result in a code of 2459. From viewing the training videos on WekaMOOC, I am guessing that one of the Naïve Bayes classifiers (maybe the Sparse Generative Model) is what I’m looking for, but I’m having difficulty getting the correct filters configured to eliminate error messages and produce a working model.

 

I have a .arff file with the following attributes, the last (numeric) argument being the class attribute:

 

@attribute UPC_DESC string

@attribute XFLD1 string

@attribute DEFAULT_CODE_ID {1868,1886,1892,1904,1910,1912,1923, <2700 codes omitted>, 5016,5024}

 

There are close to 85,000 instances in the data section of the .arff file, with 2750 unique DEFAULT_CODE_ID values.

 

Can anyone suggest a Knowledge Flow configuration that will build a model that provides the best accuracy with this amount and type of data?

 

Thanks,

-Mark

 

Mark Kimmerly | Lead Engineer
e:
[hidden email]
Avalara | Tax compliance done right


 

 

 

 


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Building a classifier for string attributes

Mark Kimmerly

Thanks for the updated version, it decreased the model size by almost 50%. Unfortunately, the model size still presents a problem at over 600MB.

 

We have had success in the past using the Sparse Generative Model classifier, but the amount of data appears to be causing problems.  When the model is built and evaluated using a .arff file with 10000 instances the process only takes a few minutes, the performance evaluator shows good results and the model size is under 50MB.  However, when I increase the number of instances to 8 million all processes in the KnowledgeFlow sequence complete within 5 minutes except the Classifier Performance Evaluator, which never completes. It appears to stall for some reason as the Java VM stops consuming CPU (except for occasional screen updates). I’ve let the KnowledgeFlow continue the evaluation for several hours, but eventually I have to kill the VM as I assume it won’t complete.

 

Is there a limitation on the amount of data that can be fed to the SGM classifier?

 

Thanks,

-Mark

 

 


Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right



From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank
Sent: Tuesday, July 23, 2019 10:11 PM
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] Building a classifier for string attributes

 

EXTERNAL MESSAGE


It turns out that the SerializedModelSaver in the KnowledgeFlow, in all recent versions of WEKA, has been saving the content of string (and relational) attributes in the training data with the model, massively increasing the model file size in your scenario.

 

I have just checked a fix into the main trunk and the WEKA 3.8 branch. It should be in the next nightly snapshots available from our web site.

 

Thanks for bringing this to our attention!

 

Cheers,

Eibe

 

From: [hidden email]
Sent: Wednesday, 24 July 2019 2:49 AM
To: [hidden email]
Subject: Re: [Wekalist] Building a classifier for string attributes

 

Here’s both.  😊  The following flow generates a model 1,060,730 KB in size with the TextViewer summary below the screenshot.  I’ve also attached the full output of the TextViewer results.  It’s 3MB uncompressed, so I compressed it into a Windows .zip file.

 

 

Correctly Classified Instances      194915               81.1554 %

Incorrectly Classified Instances     45260               18.8446 %

Kappa statistic                          0.7853

Mean absolute error                      0.0007

Root mean squared error                  0.0212

Relative absolute error                 25.0728 %

Root relative squared error             56.7784 %

Total Number of Instances           240175    

 

Thanks!

-Mark

 

 

Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right

From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank
Sent: Monday, July 22, 2019 8:16 AM
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] Building a classifier for string attributes

 

EXTERNAL MESSAGE

You can use Visualization->TextViewer to look at datasets and classifiers in textual form.

 

The Attribute Summarizer generates a histogram for each attribute!

 

Can you perhaps post your flow (either as a .kf file or just a screenshot)?

 

Cheers,

Eibe

 

From: [hidden email]
Sent: Tuesday, 23 July 2019 6:02 AM
To:
[hidden email]
Subject: Re: [Wekalist] Building a classifier for string attributes

 

> What is the actual number of attributes in the NaiveBayesMultinomial model?

 

I am not sure how to determine this, apologies for my inexperience.  Each instance of data contains 2 string attributes and an integer class attribute.  The string pairs are comprised of retail product descriptions and categories, so each instance is similar to {“Nylon Hooded Jacket”, “Outdoor Apparel”, 1883} or {“Diet Cherry Coke”, “Soft Drinks”, 3356}. There are over 12 million instances of this data with an uneven number of instances per class. Of the 600 class attributes, 20 of them account for 80% of the instances:

> Can you print the model to take a look at this? Or maybe apply the StringToWordVector filter separately and look at the number of attributes in the filtered data?

 

Can you provide instructions on how to print the model? I’ve tried adding an Attribute Summarizer after the StringToWordVector so I can look at the results of the filter but attempting to view the image when the model build is complete results in the Java VM consuming 100% of the CPU and memory on the system and eventually a heap space error occurs.  (JAVA_OPTS=-Xmx28g)

 

Thanks,

-Mark

 

 

Mark Kimmerly | Lead Engineer
e:
[hidden email]
Avalara | Tax compliance done right

From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank
Sent: Friday, July 19, 2019 10:01 PM
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] Building a classifier for string attributes

 

EXTERNAL MESSAGE

For K classes and M attributes, NaiveBayesMultinomial stores O(K * M) double-precision floats as part of  the model.

 

For 600 classes and 1,000 attributes , this makes for roughly 600 * 1000 * 8 bytes, which is much lower than your memory consumption.

 

What is the actual number of attributes in the NaiveBayesMultinomial model? By default, StringToWordVector does not make break ties when pruning the dictionary based on the desired size set by the user: if you specify wordsToKeep=1000 and the 1000th word has frequency F then all other words that also have frequency F will be kept as well. And, by default, the frequency is established on a per-class basis when the per-class dictionaries are merged into the final dictionary. Perhaps the actual number of attributes/word used in the NaiveBayesMultinomial model is much larger?

 

Can you print the model to take a look at this? Or maybe apply the StringToWordVector filter separately and look at the number of attributes in the filtered data?

 

To reduce the number of attributes based on their predictive value, you can use information gain-based attribute ranking. A command-line example:

 

java weka.Run .AttributeSelection -S ".Ranker -N 2" -E .InfoGainAttributeEval < iris.arff

 

In your setting, you can apply this filter once the string attributes have been converted and before the naïve Bayes model is built.

 

Cheers,

Eibe

             

From: [hidden email]
Sent: Saturday, 20 July 2019 6:32 AM
To:
[hidden email]
Subject: Re: [Wekalist] Building a classifier for string attributes

 

Thanks Eibe!

 

I have been successfully building models based on your advice, but I have since found that my data source is different than I realized.  I have over 12 million instances to train/test with and a little over 600 unique class attribute values.  If I configure the # of words to keep for the StringToWordVector filter at 1000, the performance evaluation (PE) is at 80% and the resulting model is over a gigabyte in size.  Keeping 10000 words improves the PE to 82%, but the model size is > 2GB. Keeping 100 words produces a model around 700MB, but the PE drops to 70%.  I need to keep the model size below 500MB if possible.

 

Can you provide suggestions on how to reduce the size of the model without sacrificing performance?

 

Thanks again for sharing your knowledge and time!

-Mark

 

 

 

Mark Kimmerly | Lead Engineer
e:
[hidden email]
Avalara | Tax compliance done right

From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank
Sent: Tuesday, July 9, 2019 12:20 AM
To: Weka machine learning workbench list. <[hidden email]>
Subject: Re: [Wekalist] Building a classifier for string attributes

 

EXTERNAL MESSAGE

To being with, I would use the FilteredClassifier with StringToWordVector as the filter and NaiveBayesMultinomial as the base classifier.

 

Then, I would start to experiment with different options in StringToWordVector. This should give a reasonable baseline for further experiments.

 

Cheers,

Eibe

 

From: [hidden email]
Sent: Tuesday, 9 July 2019 7:34 AM
To:
[hidden email]
Subject: [Wekalist] Building a classifier for string attributes

 

I am looking for help in building a classifier model for a very simple set of attributes, but a lot of data.

 

The goal is to build a model that will predict a numeric code based on 2 input strings.  For example, the input string pair “nylon hooded running jacket” and “sports apparel” should result in a code of 2459. From viewing the training videos on WekaMOOC, I am guessing that one of the Naïve Bayes classifiers (maybe the Sparse Generative Model) is what I’m looking for, but I’m having difficulty getting the correct filters configured to eliminate error messages and produce a working model.

 

I have a .arff file with the following attributes, the last (numeric) argument being the class attribute:

 

@attribute UPC_DESC string

@attribute XFLD1 string

@attribute DEFAULT_CODE_ID {1868,1886,1892,1904,1910,1912,1923, <2700 codes omitted>, 5016,5024}

 

There are close to 85,000 instances in the data section of the .arff file, with 2750 unique DEFAULT_CODE_ID values.

 

Can anyone suggest a Knowledge Flow configuration that will build a model that provides the best accuracy with this amount and type of data?

 

Thanks,

-Mark

 

Mark Kimmerly | Lead Engineer
e:
[hidden email]
Avalara | Tax compliance done right

 

 

 

 


_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Building a classifier for string attributes

Eibe Frank-2
Administrator
Perhaps try the newest nightly snapshot of WEKA. We have made some changes to
speed up the calculation of false positive rate and related statistics.
Previously, runtime for calculating those statistics for all classes, and
their weighted average, was cubic in the number of classes!

However, it may be that the bottleneck is due to the calculation of AUROC
and AUPRC. There is currently no way to turn off that calculation in the
KnowledgeFlow's ClassifierPerformanceEvaluator. In the Explorer, we have
just added an option to turn it off under "More Options..." (and it has
always been possible to turn it off at the command-line interface).

Calculation of AUROC and AUPRC requires the collection of the probability
estimates for all classes for all test instances. Space complexity for this
is O(CN) if C is the number of classes and N is the number of test
instances. The time complexity of the calculation for all classes is
O(CNlog(N)).

It also looks like the ClassifierPerformanceEvaluator stores all test
instances along with the predictions, to be able to visualise errors, and
there is no way to turn this off either.



--
Sent from: https://weka.8497.n7.nabble.com/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Building a classifier for string attributes

Mark Kimmerly
Thanks Eibe. I downloaded the snapshot and turned off most of the metrics on the performance evaluator and restarted the flow. The evaluation process took all available RAM and kept the CPU active above 86% for several hours yesterday (as opposed to just a few minutes), but this morning I found that the results are the same - the evaluation timer continues to tick with no CPU usage or disk activity and memory usage is reduced to 70%.

-Mark

Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right




-----Original Message-----
From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank-2
Sent: Wednesday, August 7, 2019 6:04 PM
To: [hidden email]
Subject: Re: [Wekalist] Building a classifier for string attributes

EXTERNAL MESSAGE

Perhaps try the newest nightly snapshot of WEKA. We have made some changes to speed up the calculation of false positive rate and related statistics.
Previously, runtime for calculating those statistics for all classes, and their weighted average, was cubic in the number of classes!

However, it may be that the bottleneck is due to the calculation of AUROC and AUPRC. There is currently no way to turn off that calculation in the KnowledgeFlow's ClassifierPerformanceEvaluator. In the Explorer, we have just added an option to turn it off under "More Options..." (and it has always been possible to turn it off at the command-line interface).

Calculation of AUROC and AUPRC requires the collection of the probability estimates for all classes for all test instances. Space complexity for this is O(CN) if C is the number of classes and N is the number of test instances. The time complexity of the calculation for all classes is O(CNlog(N)).

It also looks like the ClassifierPerformanceEvaluator stores all test instances along with the predictions, to be able to visualise errors, and there is no way to turn this off either.



--
Sent from: https://weka.8497.n7.nabble.com/ _______________________________________________
Wekalist mailing list
Send posts to: [hidden email] To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Building a classifier for string attributes

Eibe Frank-3
Yes, sorry, I got the wrong: you can, in fact, turn off output (and, thus, calculation) of AUROC and AUPRC in ClassifierPerformanceEvaluator. However, the estimated class probabilities are still being collected and stored, for which O(CN) space is required. Also the test data, including all attribute values, are stored for visualisation. It seems we need to make it possible to turn off (a) collection of predictions and (b) storage of the test data for visualisation in ClassifierPerformanceEvaluator.

Have you looked at using an the IncrementalClassifierEvaluator in the KnowledgeFlow instead? It can be be used in conjunction with incremental learning algorithms (i.e., UpdateableClassifier instances in WEKA). There are two UpdateableClassifier types in WEKA that can process string attributes: NaiveBayesMultinomialText and SVDTest. For example, NaiveBayesMultinomial combines (most of) the functionality of StringToWordVector and NaiveBayesMultinominal.

Cheers,
Eibe

On Sat, Aug 10, 2019 at 5:01 AM Mark Kimmerly <[hidden email]> wrote:
Thanks Eibe. I downloaded the snapshot and turned off most of the metrics on the performance evaluator and restarted the flow. The evaluation process took all available RAM and kept the CPU active above 86% for several hours yesterday (as opposed to just a few minutes), but this morning I found that the results are the same - the evaluation timer continues to tick with no CPU usage or disk activity and memory usage is reduced to 70%.

-Mark

Mark Kimmerly | Lead Engineer
e: [hidden email]
Avalara | Tax compliance done right




-----Original Message-----
From: [hidden email] <[hidden email]> On Behalf Of Eibe Frank-2
Sent: Wednesday, August 7, 2019 6:04 PM
To: [hidden email]
Subject: Re: [Wekalist] Building a classifier for string attributes

EXTERNAL MESSAGE

Perhaps try the newest nightly snapshot of WEKA. We have made some changes to speed up the calculation of false positive rate and related statistics.
Previously, runtime for calculating those statistics for all classes, and their weighted average, was cubic in the number of classes!

However, it may be that the bottleneck is due to the calculation of AUROC and AUPRC. There is currently no way to turn off that calculation in the KnowledgeFlow's ClassifierPerformanceEvaluator. In the Explorer, we have just added an option to turn it off under "More Options..." (and it has always been possible to turn it off at the command-line interface).

Calculation of AUROC and AUPRC requires the collection of the probability estimates for all classes for all test instances. Space complexity for this is O(CN) if C is the number of classes and N is the number of test instances. The time complexity of the calculation for all classes is O(CNlog(N)).

It also looks like the ClassifierPerformanceEvaluator stores all test instances along with the predictions, to be able to visualise errors, and there is no way to turn this off either.



--
Sent from: https://weka.8497.n7.nabble.com/ _______________________________________________
Wekalist mailing list
Send posts to: [hidden email] To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
To subscribe, unsubscribe, etc., visit https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html