Three Questions

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Three Questions

Yaakov HaCohen-Kerner
Question 1:
I want to perform text classification with single-class supervised classification
for example the relative frequencies of the top frequent 1000 word uni-grams and a few supervised machine learning methods
using the experimenter mode and Train/Test Percentage Split (Data Randomized)

If I prepare a CSV file for the top frequent 1000 word uni-grams for the whole corpus
it seems that this is NOT good because I used also the test set.

a) Is it true?
b) If yes, then how do you suggest me to create a valid csv file containing the relative frequencies of the  top frequent 1000 word uni-grams for the experiment described above.


Question 2:
How can I preform in WEKA (what are the tools and possibilities)
text classification experiments for multi-class supervised classification corpora
such as Reuters-21578 and enron_mails?

Questiion 3:
Your default for Train/Test Percentage Split is 66% for Train and 34% for Test
May I suggest that you will change it to 67% for Train and 33% for Test (because 66.6666% is closer to 67% than to 66% and 33.3333% is closer to 33% than to 34%)?

Many thanks in advance,
Yaakov 

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Three Questions

Eibe Frank-2
Administrator

> On 10/05/2017, at 6:55 AM, Yaakov HaCohen-Kerner <[hidden email]> wrote:
>
> Question 1:
> I want to perform text classification with single-class supervised classification
> for example the relative frequencies of the top frequent 1000 word uni-grams and a few supervised machine learning methods
> using the experimenter mode and Train/Test Percentage Split (Data Randomized)
>
> If I prepare a CSV file for the top frequent 1000 word uni-grams for the whole corpus
> it seems that this is NOT good because I used also the test set.
>
> a) Is it true?

I don’t think it matters that much if you are not using the class labels in the test set. The uni-grams that don’t occur in the training set won’t be used by the classifiers anyway. However, you might loose some unigrams from the training set that don’t make it into the top 1000 on the full dataset.

> b) If yes, then how do you suggest me to create a valid csv file containing the relative frequencies of the  top frequent 1000 word uni-grams for the experiment described above.

You could use the FilteredClassifier in conjunction with the StringToWordVector filter. This will ensure that only the training set is used to establish the dictionary. Note that the StringToWordVector filter builds a dictionary per class by default. Those dictionaries are then merged to find the final dictionary. Also, the filter does not break ties, so your dictionary may be larger than 1,000 uni-grams.

> Question 2:
> How can I preform in WEKA (what are the tools and possibilities)
> text classification experiments for multi-class supervised classification corpora
> such as Reuters-21578 and enron_mails?

You’d use the FilteredClassifier with StringToWordVector and NaiveBayesMultinomial, an SVM (i.e., SMO/LibSVM/LibLINEAR/SGD), or BayesianLogisticRegression, possibly applied in conjunction with the MultiClassClassifier. It’s also possible to include attribute selection by applying the AttributeSelection filter or the AttributeSelectedClassifier.

José María Gómez Hidalgo has put together a lot of information on text classification with WEKA here:

http://www.esp.uem.es/jmgomez/tmweka/

> Questiion 3:
> Your default for Train/Test Percentage Split is 66% for Train and 34% for Test
> May I suggest that you will change it to 67% for Train and 33% for Test (because 66.6666% is closer to 67% than to 66% and 33.3333% is closer to 33% than to 34%)?

Users default results would change, so it’s probably best to leave it as it is.

Cheers,
Eibe

_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Reply | Threaded
Open this post in threaded view
|

Re: Three Questions

Peter Reutemann
>> Questiion 3:
>> Your default for Train/Test Percentage Split is 66% for Train and 34% for Test
>> May I suggest that you will change it to 67% for Train and 33% for Test (because 66.6666% is closer to 67% than to 66% and 33.3333% is closer to 33% than to 34%)?
>
> Users default results would change, so it’s probably best to leave it as it is.

Just change your default settings ("ClassifierPercentageSplit"):
http://weka.wikispaces.com/weka_gui_explorer_Explorer.props

Cheers, Peter
--
Peter Reutemann
Dept. of Computer Science
University of Waikato, NZ
+64 (7) 858-5174
http://www.cms.waikato.ac.nz/~fracpete/
http://www.data-mining.co.nz/
_______________________________________________
Wekalist mailing list
Send posts to: [hidden email]
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html