July 23, 2012 | View Comments
This is a follow up post on Finding the most important words in documents. This time I did setup a Hadoop cluster with only 1 node (Hey! I'm still with my Macbook in a hotel room).
I didn't do much customizations. It's a straightforward Hadoop that comes out of the box. I had my map code and reducer code and ran the code to find out the most important words of some documents I had on my computer since I don't have access to a corpus right now.
Things to Note
Hadoop sorts using Quicksort by default. If you want to change it to something else (Heapsort or Mergesort), you need to edit
mapred-site.xml. For example;
<property> <name>map.sort.class</name> <value>org.apache.hadoop.util.HeapSort</value> <description>Use heap sort class for sorting keys.</description> </property>
- If you want to use Mergesort be careful: If the length of data is less than 7, then it will be sorted with Insertion sort. See line 40 in
- If the depth of recursion goes too deep, then Quicksort will switch to Heapsort. See line 35 and line 76 in
Let's have a clean system.
hadoop namenode -format
Once the system has been formatted then we could start the nodes.
Let's create a new folder in HDFS.
hadoop dfs -mkdir input
Then run your map code and reduce code.
When everything goes well, I could go to http://localhost:50030 and see how the jobs perform.
Everything went better than expected
Let's have an error on reduce code (Yes, I made reduce code fail on purpose to take the screenshot). In that case, the graph looks like
Houston, we have a problem
You can do anything with Hadoop. Anything at all. The only limit is yourself.