| |
between thematically
unrelated page, which have been set for the sole purpose of boosting
PageRank of one page. Indeed, it is questionable if it is possible
to realize such weighting based on content analyses.
 |
 |
The fundamentals of content analyses
are based on Gerard Salton's work in the 1960s and 1970s. In
his vector space model of information retrieval, documents are
modeled as vectors which are built upon terms and their weighting
within the document. These |
 |
 |
term vectors allow comparisons between the content of documents by,
for instance, calculating the cosine measure (the inner product) of
the vectors. In its basic form, the vector space model has some weaknesses.
For instance, often the assumption that if and in how far the same
words appear in two documents is an indicator for their similarity
is criticized. However, numerous enhancements have been developed
that solve most of the problems of the vector space model.
One person who excelled at publications which are based on Salton's
vector space model is Krishna Bharat. This is interesting because
Bharat meanwhile is a member of Google's staff and, particularly,
because he is deemded to be the developer of "Google News"
(news.google.com). Google News is a service that crawls news websites,
evaluates articles and then provides them categorized and grouped
in different subjects on the Google New website. According to Google,
all these procedures are completely automated. Therefore, other
criteria like, for example, the time when an article is published,
are taken into account, but if there is no manual intervention,
the clustering of articles is most certainly only possible, if the
contents of the articles are actually compared to each other. The
questions is: How can this be realized?
In their publication on a term vector database, Raymie Stata, Krishna
Bharat and Farzin Maghoul describe how the contents of web pages
can be compared based on term vectors and, particularly, they describe
how some of the problems with the vector space model can be solved.
Firstly, not all terms in documents are suitable for content analsysis.
Very frequent terms provide only little discrimination across vectors
and, so, the most frequent third of all terms is eliminated from
the database. Infrequent terms, on the other hand, do not provide
a good basis for measuring similarity. Such terms are, for example,
misspellings. They appear only on few pages which are likely unrelated
in terms of their theme, but because they are so infrequent, the
term vectors of the pages appear to be closely related. Hence, also
the least frequent third of all terms is eliminated from the database.
Even if only one third of all terms is included in the term vectors,
this selection is still not very efficient. Stata, Bharat and Maghoul
perform another filtering, so that each term vector is based on
a maximum of 50 terms. But these are not the 50 most frequent terms
on a page. They weight a term by deviding the number of times it
appears on a page by the number of times it appears on all pages,
and those 50 terms with the highest weight are included in the term
vector of a page. This selection actually allows a real differentiation
between the content of pages.
The methods described above are standards for the vector space
model. If, for example, the inner product of two term vectors is
rather high, the contents of the according pages tend to be similar.
This may allow content comparisons in many areas, but it is doubtful
if it is a good basis for weighting links within the PageRank technique.
Most of all, synonyms and terms that describe similar things can
not be identified. Indeed, there are algorithms for word stemming
which work good for the english language, but in other languages
word stemming is much more complicated. Different languages are
a general problem. Unless, for instance, brand names or loan words
are used, texts in different languages normally do not contain the
same terms. And if they do, these terms normally have a completely
different meaning, so that comparing content in different languages
is not possible. However, Stata, Bharat and Maghoul provide a method
of resolution for these problems.
 |
 |
Stata, Bharat und Maghoul present
a concrete application for their Term Vector Database by classifying
pages thematically. Bharat has also published on this issue
together with Monika Henzinger, presently Google's Research
Director, and they called it "topic distillation".
Topic distillation is based on calculating so-called topic vectors.
Topic vectors are term vectors, but they do not only include
terms of one page but rather the terms of many pages which are
on the same topic. So, |
 |
 |
in order to create topic vectors, they have to know a certain amount
of web pages which are on several pre-defined topics. To achieve this,
they resort to web directories.
For their application, Stata, Bharat und Maghoul have crawled about
30,000 links within each of the then 12 main categories of Yahoo
to create topic vectors which include about 10,000 terms each. Then,
in order to identify the topic of any other web page, they matched
the according term vector with all the topic vectors which were
created from the Yahoo crawl. The topic of a web page derived from
the topic vector which matched the term vector of the web page best.
That such a classification of web pages works can again be observed
by the means of Google News. Google News does not only merge articles
to one news topic, but also arranges them to the categories World,
U.S., Business, Sci/Tech, Sports, Entertainment and Health. As long
as this categorization is not based on the structure of the website
where the articles come from (which is unlikely), the actual topic
of an article has in fact to be computed.
At the time he published on term vectors, Krishna Bharat did not
work on PageRank but rather on Kleinberg's algorithm, so that he
was more interested in filtering off-topic links than in weighting
links. But from classifying pages to weighting links based on content
comparisons, there is only a small step. Instead of matching the
term vectors of two pages, it is much more efficient to match the
topics of two pages. We can, for instance, create a "topic
affinity vector" for each page based on the degree of affinity
of the page's term vector and all the topic vectors. The better
the topic affinity vectors of two pages match, the more likely are
they on the same topic and the higher should a link between them
be weighted.
Using topic vectors has one big advantage over comparing term vectors
directly: A topic vector can include terms in different languages
by being based on, for instance, the links on different national
Yahoo versions. Deviant site structures of the national versions
can most certainly be adapted manually. Even better may be using
the ODP because the structure of the sub-categories of the "World"
category is based on the main OPD structure. In this way, measuring
topic similarities between pages in different languages can be realized,
so that a really useful weighting of links based on text analyses
appears to be possible.
10.
Theme-Based PageRank (continued)
This article reproduced with permission of eFactory.
© 2002 eFactory Internet-Agentur KG Online-Marketing - written
by Markus Sobek
PageRank and Google are trademarks of Google Inc., Mountain ViewCA,
USA.
PageRank is protected by US Patent 6,285,999.
|