Tuesday, October 1, 2013

Using Mahout to cluster iris data available with weka in arff format

Why this article?

I am writing this to share how I was finally able to cluster the iris data using Mahout in a Hadoop environment and visualize the clusters using Gephi. It took me a very long time to figure all this out since there is very little documentation and I don't have much time on my hands.

The link below explains how you can cluster documents and many of the parameters to use have been explained.
https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering

But I was not sure how to cluster data that is available as a csv or arff file. An arff file is almost like a csv except that columns and data types are provided at the top.
Example
---------------------------------------------------------------
@RELATION iris

@ATTRIBUTE sepallength numeric
@ATTRIBUTE sepalwidth numeric
@ATTRIBUTE petallength numeric
@ATTRIBUTE petalwidth numeric
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

@DATA
5.1,3.5,1.4,0.2,Iris-setosa
--------------------------------------------------------------------
Everything below @DATA is just csv

Special note** I had to change "real" to "numeric" to make it work with mahout. If you check the uci database it might be
@ATTRIBUTE sepallength REAL
but the tool arff.vector didn't like REAL so I had to change it to numeric for all the attributes.

The steps I followed to cluster the iris data using mahout


First I placed the iris.arff file in a folder called irisoct on my hadoop linux server
(Available along with weka and from the machine learning repository  http://archive.ics.uci.edu/ml/ )
Then I can execute the following commands to generate a graphml file containing clusters that I can visualize using Gephi on my windows machine.

**Assumption that I am executing these commands from my home folder where I have  the folder irisoct with the iris.arff file

**place the file on the HDFS
hadoop fs -put irisoct/iris.arff irisoct

**Convert the arff file into a vector
mahout arff.vector -d irisoct -o irisoct/data -t irisoct/dict

**Run canopy clustering on the file
mahout canopy -i irisoct/data -o irisoct/output -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 3 -t2 2 -ow --clustering

**See the clusters as text in  irisoctanalysis.txt
mahout clusterdump -i irisoct/output/clusters-0-final -o irisoctanalysis.txt -p irisoct/output/clusteredPoints

**See the clusters as graphml file.
mahout clusterdump -i irisoct/output/clusters-0-final -of GRAPH_ML -o irisoctanalysis.graphml -p irisoct/output/clusteredPoints

**Transfer file  irisoctanalysis.graphml to a windows machine and load it in Gephi as a graphml file

**Run kmeans clustering using the cluster that was created by canopy
mahout kmeans   --input irisoct/data   --output irisoct/kmeans-output   --numClusters 3   --clusters irisoct/output/clusters-0-final --maxIter 20  --method mapreduce   --distanceMeasure org.apache.mahout.common.distance.TanimotoDistanceMeasure --clustering

**See the clusters as text  irisoctanalysiskm.txt
mahout clusterdump -i irisoct/kmeans-output/clusters-1-final -o irisoctanalysiskm.txt -p irisoct/kmeans-output/clusteredPoints

**See the clusters as graphml file.
mahout clusterdump -i irisoct/kmeans-output/clusters-1-final -of GRAPH_ML -o irisoctanalysiskm.graphml -p irisoct/kmeans-output/clusteredPoints

**Transfer file  irisoctanalysiskm.graphml to a windows machine and load it in Gephi as a graphml file

What software was used

Version of mahout used: mahout-core-0.7
Hadoop: Cloudera. Hadoop 2.0.0-cdh4.3.0
Gephi 0.8

4 comments:

  1. Hi,

    I tried same approach, and getting some problem over Convert the arff file into a vector.

    Following are the logs:

    mahout arff.vector -d /root/Desktop/irisoct/iris.arff -o /root/Desktop/irisoct/data -t /root/Desktop/irisoct/dict
    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
    Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf
    MAHOUT-JOB: /opt/cloudera/parcels/CDH-4.5.0-1.cdh4.5.0.p0.30/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
    13/12/06 11:50:47 WARN driver.MahoutDriver: No arff.vector.props found on classpath, will use command-line arguments only
    13/12/06 11:50:48 INFO arff.Driver: Output Dir: /root/Desktop/irisoct/data
    13/12/06 11:50:48 INFO arff.Driver: Converting File: /root/Desktop/irisoct/iris.arff
    13/12/06 11:50:49 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
    13/12/06 11:50:49 INFO compress.CodecPool: Got brand-new compressor [.deflate]
    13/12/06 11:50:49 INFO arff.Driver: Wrote: 150 vectors
    13/12/06 11:50:49 INFO driver.MahoutDriver: Program took 1499 ms (Minutes: 0.024983333333333333)

    Only dist file was created and following is content:

    Label bindings for Relation iris
    class 4
    sepallength 0
    sepalwidth 1
    petalwidth 3
    petallength 2

    No data is created to execute next operation, can you please suggest how to resolve the problem.

    Thanks in advance.

    ReplyDelete
  2. dist file is created on the local folder
    The actual vector file is created on the hadoop file system.

    ReplyDelete
  3. I am not able to view graphml file!

    ReplyDelete
    Replies
    1. Download Gephi

      https://gephi.org/users/download/

      Delete