Project DELILAH Explanations Page
Introduction

The DELILAH Project is a program that traces semantical relations between words in all languages. To do that, it crawls the internet and looks for statistical datapairs of words and all the data is then displayed in the form of a graph. It was initially written is pure Java, but, due to the poor management of processes, we migrated it to an interressant cocktail : PHP and SQL on the server part and C# on the feeder part.

In the image you can see that Schwarzenegger is strongly linked to Arnold, but is only weakly linked to California. That means that "a lot" of pages that deals with Schwarzenegger also deals with Arnold and California but only "a few" of pages about California actually are about Schwarzenegger.

It was initially written by Alban Galland and me under the supervision of Professor Miller in the undergraduate course taught by Professor Jean Jacques Levy at the Ecole Polytechnique.

How does DELILAH works?

Actually, DELILAH is composed of two parts:

What is this so-called statistical analysis?

Let's say Google gives you ten URLs refering to the topic shark. Delilah will retrieve each page and will squeeze all HTML from thoses pages. Afterwards, it will list all words that are physically near from each occurence of shark. The more a given word is present, the heavier its weight will be.

Afterwards, you must get rid of some words that always appears (e.g. the, a, one, is, they...). Those words are easy to blacklist because, indeed, they always appear. Thus if a word is linked to more than 10 percent of the subjects, it is a junction word of the language.

More informations

As I already said, it initially was a project, so you can take a look at the report if you want. Unfortunately it is in french and (worse!) it deals with the JAVA version of the program. That's why the old java version is still available here to understand basic concepts.

Can I get it?

Of course, we'll be happy to give it for free to anybody. It is not totally finalized yet, so I prefer not uploading an everchanging program: Please send me a mail.