oice.langdet uses bi-gram approach, an n-gram analysis. in the bi-gram relationship, the first word can be understand as the context that describes the second word. it qualifies the second word, providing additional insight into its meaning. humans use the technique of examining the context of words to determine the meaning of written words quite frequently. words can have multiple meanings, called senses. the bigram-provided context provides a way to distinguish which sense of the word is used (well, most of the time, as literary puns and the like are the obvious exception). the stemming algorithm can use this context as additional information to assist in determining the outcome of the algorithm (whether or not the current word is stemmed, which of one or more possible stems is appropriate, whether to use substitution or stripping, whether the word is a verb or noun or some other lexical category, etc). because the algorithm used probabilities in determining the outcome, it is a stochastic algorithm. for more information about 4 n-grams can be found in here.
anyway, here's my little python code to detect whether in english or not. you can easily modify it for spanish, and french but the project i'm working on is about english language.
try:
from cStringIO import StringIO
except ImportError:
from StringIO import StringIO
from oice.langdet import langdet
from oice.langdet import streams
from oice.langdet import languages
def isEnglish(s):
s = unicode(s)
s = unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
text = streams.Stream(StringIO(s))
lang = langdet.LanguageDetector.detect(text)
return lang == languages.english