I haven't tried running distributedWekaSpark under AWS/EMR, but have on YARN-based Cloudera and Hortonworks clusters. You need to replace all the spark libraries in ~/wekafiles/packages/distributedWekaSpark/lib with the spark assembly jar from the cluster that you want to run against. In order for client-side Spark to pick up important Hadoop cluster settings (such as resource manager host etc.) you will also need the config dir of the Hadoop cluster to be in Weka's CLASSPATH. distributedWekaSpark only supports yarn-client mode for running on YARN clusters. This is the mode where the driver program executes on the local machine and the YARN resource manager is simply used to provision worker nodes in the Hadoop cluster for Spark to use. So, you would enter "yarn-client" in the master property when configuring the Weka job. The port field can be left blank (from memory) as Spark picks up all pertinent settings from the Hadoop config files. Depending on how much pain is involve
d with opening ports/services of AWS hosts to the outside world (and Spark uses quite a few for comms), you would probably be best to install Weka on an AWS node and run it from the command line.
distributedWekaSpark does not support Spark 2.x yet. There are breaking API changes between Spark 1 and Spark 2 that will require fair amount of work (and probably a separate Weka package) to support.