Faruk Akgul

Hadoop: Finding The Most Important Words In Documents

July 23, 2012 | View Comments

This is a follow up post on Finding the most important words in documents. This time I did setup a Hadoop cluster with only 1 node (Hey! I'm still with my Macbook in a hotel room).

I didn't do much customizations. It's a straightforward Hadoop that comes out of the box. I had my map code and reducer code and ran the code to find out the most important words of some documents I had on my computer since I don't have access to a corpus right now.

Things to Note

Hadoop sorts using Quicksort by default. If you want to change it to something else (Heapsort or Mergesort), you need to edit mapred-site.xml. For example;

<property>
    <name>map.sort.class</name>
    <value>org.apache.hadoop.util.HeapSort</value>
    <description>Use heap sort class for sorting keys.</description>
</property>

Note:

  • If you want to use Mergesort be careful: If the length of data is less than 7, then it will be sorted with Insertion sort. See line 40 in org.apache.hadoop.util.Mergesort.java.
  • If the depth of recursion goes too deep, then Quicksort will switch to Heapsort. See line 35 and line 76 in org.apache.hadoop.util.Quicksort.java.

Steps

  1. Let's have a clean system.

    hadoop namenode -format

  2. Once the system has been formatted then we could start the nodes.

    ./start-all.sh

  3. Let's create a new folder in HDFS.

    hadoop dfs -mkdir input

  4. Then run your map code and reduce code.

When everything goes well, I could go to http://localhost:50030 and see how the jobs perform.


Everything went better than expected

Let's have an error on reduce code (Yes, I made reduce code fail on purpose to take the screenshot). In that case, the graph looks like


Houston, we have a problem

You can do anything with Hadoop. Anything at all. The only limit is yourself.

Share:Tweet

blog comments powered by Disqus