| web pages. For example,
the content of all pages of one website may take influence on the
ranking of one single page of this website. On the other hand, it
is also conceivable that one page's ranking is based on the content
of those pages which link to it or which it links to itself.
The potential implementation of a theme-based ranking in the Google
search engine is discussed controversially. In search engine optimization
forums and on websites on this topic we can over and over again
find advice that inbound links from sites with a similar theme to
our own have a larger influence on PageRank than links from unrelated
sites. This hypothesis shall be discussed here. Therefore, we first
of all take a look at two relatively new approaches for the integration
of themes in the PageRank technique: on the one hand the "intelligent
surfer" by Matthew Richardson and Pedro Domingos and on the
other hand the Topic-Sensitive PageRank by Taher Haveliwala. Subsequently,
we take a look at the possibility of using content analyses in order
to compare the text of web pages, which can be a basis for weighting
links within the PageRank technique.
The "Intelligent Surfer" by Richardson
and Domingos
Matthew Richardson and Pedro Domingos resort to the Random Surfer
Model in order to explain their approach for the implementation
of themes in the PageRank technique. Instead of a surfer who follws
links completely at random, they suggest a more intelligent surfer
who, on the one hand, only follows links which are related to an
original search query and, on the other hand, also after "getting
bored" only jumps to a page which relates to the original query.
So, to Richardson and Domingos' "intelligent surfer"
only pages are relevant that contain the search term of an initial
query. But since the Random Surfer Model does nothing but illustrate
the PageRank technique, the question is how an "intelligent"
behaviour of the Random Surfer influences PageRank. The answer is
that for every term occuring on the web a separate PageRank calculation
has to be conducted and each calculation is solely based on links
between pages which contain that term.
Computing PageRank this way causes some problems. They especially
appear for search terms that do not occur so often on the web. To
make it into the PageRank calculations for a specific search term,
that term has not only to appear on someone's page, but also on
the pages that link to it. So, the search results would often be
based on small subsets of the web and may omit relevant sites. In
addition, using such small subsets of the web, the algorithms are
more vulnerable to spam by automatically generating numerous pages.
Additionally, there are serious problems regarding scalability.
Richardson and Domingos estimate the memory and computing time requirements
for several 100,000 terms 100-200 times higher compared to the original
PageRank calculations. Regarding the large number of small subsets
of the web, these numbers appear to be realistic.
The higher memory requirements should not be so much of a problem
because Richardson and Domingos correctly state that the term specific
PageRank values constitute only a fraction of the data volume of
Google's inverse index. However, the computing time requirements
are indeed a large problem. If we assume just five hours for a conventional
PageRank calculation, then this would last about 3 weeks based on
Richardson and Domingos' model, which makes it unsuitable for actual
employment.
10.
Theme-Based PageRank (continued)
This article reproduced with permission of eFactory.
© 2002 eFactory Internet-Agentur KG Online-Marketing - written
by Markus Sobek
PageRank and Google are trademarks of Google Inc., Mountain ViewCA,
USA.
PageRank is protected by US Patent 6,285,999.
|