Faruk Akgul

Finding The Most Important Words In Documents

July 22, 2012 | View Comments

It's 04:23 AM and I'm sitting in a hotel room without real Internet connection so I've decided to write something on semantic web while listening Fran├žoise Hardy. Anyway, let's say you want to detect which words are important of a given article.

Why would you want to detect the important words?

The answer is quite easy. Let's say you want to build keywords of a document to describe it. There are more complex ways of analysing the most important words but the simplest approach would be calculating the Tf.Idf scores of a document.

What is Tf.Idf?

  • Tf: Term Frequency
  • Idf: Inverse Document Frequency

Tf (Term Frequency)

Let's start with finding the Tf of a given document. The formula is simple:

Tf(d, t)
  • d: Document
  • t: Number of times term t appears in document d.

Term Frequency

Here's an example on how to construct Tf table. The sentences are from Friedrich Nietzsche's Ecce Homo.

  • document1: how one becomes what one is.
  • document2: the remotest idea of what one is.
  • document3: one becomes what one is.
Term `t`Document `d`Term Frequency `Tf(d, t)`
oned12
oned21
oned32
becomesd11
becomesd20
becomesd31

The problem with term frequency is most frequent words don't tell much about the document and in a real-world example the most common words you're going to get will be; and, the, of, a, to....

Most common words are not the most important words of a document.

In a real-world situation you'd most likely want to split the words with a smart REGEX and by filtering out with a stop words list. Anyway, so if we want to calculate Tf scores then simplest approach would be;

List<String> stop_words = new ArrayList<String>();

static {
    stop_words.add("a");
    stop_words.add("an");
    stop_words.add("the");
}

List<String> build_words(String document) {
    Matcher matcher = Pattern.compile("[a-z]{2,}").matcher(document);
    List<String> list = new ArrayList<String>();
    while(matcher.find()) list.add(matcher.group());
    return list;
}

Map<String, Int> tf(String document) {
    Map<String, Int> map = new HashMap<String, Int>();
    String[] list = document.split(" ");
    for (String s: list) {
        Int counter = map.get(s);
        if (counter == null) map.put(s, new Int());
        else counter.incr();
    }
    return map;
}

Df (Document Frequency)

The formula is:

Df(c, t)
  • c: Documents (corpus) of a given dataset.
  • t: Number of times term t appears in corpus c.

Inverse Document Frequency Formula:

1 / Df(c, t)

Document Frequency

Document frequency is the total number of documents that word w appears in.

  • document1: how one becomes what one is.
  • document2: the remotest idea of what one is.
  • document3: one becomes what one is.
Term `t`Document Frequency `Df(c, t)`Inverse Document Frequency `1/Df(c, t)`
one30.33
becomes20.5
what30.33
remotest11

When you calculate Idf values of words then you start getting somewhere but it's still not good enough.

Map<String, Int> df(List<String> documents) {
    List<String> list = new ArrayList<String>();
    for (String s: documents) {
        Set<String> set = new HashSet<String>();
        set.addAll(Arrays.asList(s.split(" ")));
        for (String d: set) list.add(d);
    }
    return counter(list);
}

Map<String, Double> idf(List<String> documents, Map<String, Int> map) {
    Map<String, Double> map2 = new HashMap<String, Double>();
    for (String s: documents) {
        String[] split = s.split(" ");
        for(String d: split) {
            if (map.get(d) != null) {
                double val = 1.0 / (map.get(d).getCounter() + 1e-100);
                map2.put(d, val);
            }
        }
    }
    return map2;
}

Please note that, these codes shouldn't be used in real-world. I'm just trying to make things easier and simple.

Tf.Idf

Now, we know how to calculate Tf and Idf, as it can be easily guessed Tf.Idf is the combined version of Tf and Idf.

The formula is:

Tf.Idf(c, d, t) = Tf(d, t) / Df(c, t)
Term `t`Document `d`Term Frequency `Tf(d, t)`Document Frequency `Df(c, t)``Tf(d, t) / Df(c, t)`
oned1230.66
oned2130.33
oned3230.66
becomesd1120.5
becomesd2020.0
becomesd3120.5

Le Problem

The problem with the simplest Tf.Idf approach is it gives very high score for very rare words so we need to "tweak" the formula. The simplest way would be adding log.

The updated formula:

Tf.Idf(c, d, t) = Tf(d, t) log (N / Df(c, t))
  • N: The number of total documents.

Le Problem

We still have a problem with the tweaked formula. The longer documents will have higher Tf.Idf scores than the shorter documents. Solution would be dividing the Tf.Idf score by the document length.

The updated formula:

Tf.Idf(c, d, t) = Tf.Idf(c, d, t) / |dN|
  • dN: Document length.

So, finally the most important words:

def most_important_words(documents, length=5):
  mapp = tfidf(documents)
  mapp = collections.OrderedDict(sorted(mapp.items()), key=lambda x: x[1], reverse=True)
  return mapp.items()[:length]

I've used stop words to get more accurate results. The stop words I've used are from google.appengine.ext.search package.

All words are important but some words are more important than the others.

Share:Tweet

blog comments powered by Disqus