Why this article?
I am writing this to share how I was finally able to cluster the iris data using Mahout in a Hadoop environment and visualize the clusters using Gephi. It took me a very long time to figure all this out since there is very little documentation and I don't have much time on my hands.The link below explains how you can cluster documents and many of the parameters to use have been explained.
https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering
But I was not sure how to cluster data that is available as a csv or arff file. An arff file is almost like a csv except that columns and data types are provided at the top.
Example
---------------------------------------------------------------
@RELATION iris
@ATTRIBUTE sepallength numeric
@ATTRIBUTE sepalwidth numeric
@ATTRIBUTE petallength numeric
@ATTRIBUTE petalwidth numeric
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
--------------------------------------------------------------------
Everything below @DATA is just csv
Special note** I had to change "real" to "numeric" to make it work with mahout. If you check the uci database it might be
@ATTRIBUTE sepallength REAL
but the tool arff.vector didn't like REAL so I had to change it to numeric for all the attributes.
The steps I followed to cluster the iris data using mahout
First I placed the iris.arff file in a folder called irisoct on my hadoop linux server
(Available along with weka and from the machine learning repository http://archive.ics.uci.edu/ml/ )
Then I can execute the following commands to generate a graphml file containing clusters that I can visualize using Gephi on my windows machine.
**Assumption that I am executing these commands from my home folder where I have the folder irisoct with the iris.arff file
**place the file on the HDFS
hadoop fs -put irisoct/iris.arff irisoct
**Convert the arff file into a vector
mahout arff.vector -d irisoct -o irisoct/data -t irisoct/dict
**Run canopy clustering on the file
mahout canopy -i irisoct/data -o irisoct/output -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 3 -t2 2 -ow --clustering
**See the clusters as text in irisoctanalysis.txt
mahout clusterdump -i irisoct/output/clusters-0-final -o irisoctanalysis.txt -p irisoct/output/clusteredPoints
**See the clusters as graphml file.
mahout clusterdump -i irisoct/output/clusters-0-final -of GRAPH_ML -o irisoctanalysis.graphml -p irisoct/output/clusteredPoints
**Transfer file irisoctanalysis.graphml to a windows machine and load it in Gephi as a graphml file
**Run kmeans clustering using the cluster that was created by canopy
mahout kmeans --input irisoct/data --output irisoct/kmeans-output --numClusters 3 --clusters irisoct/output/clusters-0-final --maxIter 20 --method mapreduce --distanceMeasure org.apache.mahout.common.distance.TanimotoDistanceMeasure --clustering
**See the clusters as text irisoctanalysiskm.txt
mahout clusterdump -i irisoct/kmeans-output/clusters-1-final -o irisoctanalysiskm.txt -p irisoct/kmeans-output/clusteredPoints
**See the clusters as graphml file.
mahout clusterdump -i irisoct/kmeans-output/clusters-1-final -of GRAPH_ML -o irisoctanalysiskm.graphml -p irisoct/kmeans-output/clusteredPoints
**Transfer file irisoctanalysiskm.graphml to a windows machine and load it in Gephi as a graphml file
What software was used
Version of mahout used: mahout-core-0.7Hadoop: Cloudera. Hadoop 2.0.0-cdh4.3.0
Gephi 0.8