Tuesday, October 1, 2013

Using Mahout to cluster iris data available with weka in arff format

Why this article?

I am writing this to share how I was finally able to cluster the iris data using Mahout in a Hadoop environment and visualize the clusters using Gephi. It took me a very long time to figure all this out since there is very little documentation and I don't have much time on my hands.

The link below explains how you can cluster documents and many of the parameters to use have been explained.
https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering

But I was not sure how to cluster data that is available as a csv or arff file. An arff file is almost like a csv except that columns and data types are provided at the top.
Example
---------------------------------------------------------------
@RELATION iris

@ATTRIBUTE sepallength numeric
@ATTRIBUTE sepalwidth numeric
@ATTRIBUTE petallength numeric
@ATTRIBUTE petalwidth numeric
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

@DATA
5.1,3.5,1.4,0.2,Iris-setosa
--------------------------------------------------------------------
Everything below @DATA is just csv

Special note** I had to change "real" to "numeric" to make it work with mahout. If you check the uci database it might be
@ATTRIBUTE sepallength REAL
but the tool arff.vector didn't like REAL so I had to change it to numeric for all the attributes.

The steps I followed to cluster the iris data using mahout


First I placed the iris.arff file in a folder called irisoct on my hadoop linux server
(Available along with weka and from the machine learning repository  http://archive.ics.uci.edu/ml/ )
Then I can execute the following commands to generate a graphml file containing clusters that I can visualize using Gephi on my windows machine.

**Assumption that I am executing these commands from my home folder where I have  the folder irisoct with the iris.arff file

**place the file on the HDFS
hadoop fs -put irisoct/iris.arff irisoct

**Convert the arff file into a vector
mahout arff.vector -d irisoct -o irisoct/data -t irisoct/dict

**Run canopy clustering on the file
mahout canopy -i irisoct/data -o irisoct/output -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 3 -t2 2 -ow --clustering

**See the clusters as text in  irisoctanalysis.txt
mahout clusterdump -i irisoct/output/clusters-0-final -o irisoctanalysis.txt -p irisoct/output/clusteredPoints

**See the clusters as graphml file.
mahout clusterdump -i irisoct/output/clusters-0-final -of GRAPH_ML -o irisoctanalysis.graphml -p irisoct/output/clusteredPoints

**Transfer file  irisoctanalysis.graphml to a windows machine and load it in Gephi as a graphml file

**Run kmeans clustering using the cluster that was created by canopy
mahout kmeans   --input irisoct/data   --output irisoct/kmeans-output   --numClusters 3   --clusters irisoct/output/clusters-0-final --maxIter 20  --method mapreduce   --distanceMeasure org.apache.mahout.common.distance.TanimotoDistanceMeasure --clustering

**See the clusters as text  irisoctanalysiskm.txt
mahout clusterdump -i irisoct/kmeans-output/clusters-1-final -o irisoctanalysiskm.txt -p irisoct/kmeans-output/clusteredPoints

**See the clusters as graphml file.
mahout clusterdump -i irisoct/kmeans-output/clusters-1-final -of GRAPH_ML -o irisoctanalysiskm.graphml -p irisoct/kmeans-output/clusteredPoints

**Transfer file  irisoctanalysiskm.graphml to a windows machine and load it in Gephi as a graphml file

What software was used

Version of mahout used: mahout-core-0.7
Hadoop: Cloudera. Hadoop 2.0.0-cdh4.3.0
Gephi 0.8