Why this article?
I am writing this to share how I was finally able to cluster the iris data using Mahout in a Hadoop environment and visualize the clusters using Gephi. It took me a very long time to figure all this out since there is very little documentation and I don't have much time on my hands.The link below explains how you can cluster documents and many of the parameters to use have been explained.
But I was not sure how to cluster data that is available as a csv or arff file. An arff file is almost like a csv except that columns and data types are provided at the top.
@ATTRIBUTE sepallength numeric
@ATTRIBUTE sepalwidth numeric
@ATTRIBUTE petallength numeric
@ATTRIBUTE petalwidth numeric
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
Everything below @DATA is just csv
Special note** I had to change "real" to "numeric" to make it work with mahout. If you check the uci database it might be
@ATTRIBUTE sepallength REAL
but the tool arff.vector didn't like REAL so I had to change it to numeric for all the attributes.
The steps I followed to cluster the iris data using mahout
First I placed the iris.arff file in a folder called irisoct on my hadoop linux server
(Available along with weka and from the machine learning repository http://archive.ics.uci.edu/ml/ )
Then I can execute the following commands to generate a graphml file containing clusters that I can visualize using Gephi on my windows machine.
**Assumption that I am executing these commands from my home folder where I have the folder irisoct with the iris.arff file
**place the file on the HDFS
hadoop fs -put irisoct/iris.arff irisoct
**Convert the arff file into a vector
mahout arff.vector -d irisoct -o irisoct/data -t irisoct/dict
**Run canopy clustering on the file
mahout canopy -i irisoct/data -o irisoct/output -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 3 -t2 2 -ow --clustering
**See the clusters as text in irisoctanalysis.txt
mahout clusterdump -i irisoct/output/clusters-0-final -o irisoctanalysis.txt -p irisoct/output/clusteredPoints
**See the clusters as graphml file.
mahout clusterdump -i irisoct/output/clusters-0-final -of GRAPH_ML -o irisoctanalysis.graphml -p irisoct/output/clusteredPoints
**Transfer file irisoctanalysis.graphml to a windows machine and load it in Gephi as a graphml file
**Run kmeans clustering using the cluster that was created by canopy
mahout kmeans --input irisoct/data --output irisoct/kmeans-output --numClusters 3 --clusters irisoct/output/clusters-0-final --maxIter 20 --method mapreduce --distanceMeasure org.apache.mahout.common.distance.TanimotoDistanceMeasure --clustering
**See the clusters as text irisoctanalysiskm.txt
mahout clusterdump -i irisoct/kmeans-output/clusters-1-final -o irisoctanalysiskm.txt -p irisoct/kmeans-output/clusteredPoints
**See the clusters as graphml file.
mahout clusterdump -i irisoct/kmeans-output/clusters-1-final -of GRAPH_ML -o irisoctanalysiskm.graphml -p irisoct/kmeans-output/clusteredPoints
**Transfer file irisoctanalysiskm.graphml to a windows machine and load it in Gephi as a graphml file
What software was used
Version of mahout used: mahout-core-0.7Hadoop: Cloudera. Hadoop 2.0.0-cdh4.3.0
Gephi 0.8
ReplyDeleteI tried same approach, and getting some problem over Convert the arff file into a vector.
Following are the logs:
mahout arff.vector -d /root/Desktop/irisoct/iris.arff -o /root/Desktop/irisoct/data -t /root/Desktop/irisoct/dict
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /opt/cloudera/parcels/CDH-4.5.0-1.cdh4.5.0.p0.30/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
13/12/06 11:50:47 WARN driver.MahoutDriver: No arff.vector.props found on classpath, will use command-line arguments only
13/12/06 11:50:48 INFO arff.Driver: Output Dir: /root/Desktop/irisoct/data
13/12/06 11:50:48 INFO arff.Driver: Converting File: /root/Desktop/irisoct/iris.arff
13/12/06 11:50:49 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/12/06 11:50:49 INFO compress.CodecPool: Got brand-new compressor [.deflate]
13/12/06 11:50:49 INFO arff.Driver: Wrote: 150 vectors
13/12/06 11:50:49 INFO driver.MahoutDriver: Program took 1499 ms (Minutes: 0.024983333333333333)
Only dist file was created and following is content:
Label bindings for Relation iris
class 4
sepallength 0
sepalwidth 1
petalwidth 3
petallength 2
No data is created to execute next operation, can you please suggest how to resolve the problem.
Thanks in advance.
dist file is created on the local folder
ReplyDeleteThe actual vector file is created on the hadoop file system.
I am not able to view graphml file!
ReplyDeleteDownload Gephi