| PageRank to rank pages
the higher, the more high ranking pages link to them. But, at the
time of their scientific work on PageRank, Page and Brin have already
recognized that their algorithm is vulnerable to artificial inflation
of PageRank.
An artificial influence on PageRank might be exerted by webmasters
who generate a multitude of web pages whose links distribute PageRank
in a way that single pages within that system receive a special
importance. Those pages can have a high PageRank without being linked
to from other pages with high PageRank. So, not only the concept
of PageRank is undermined, but also the search engine's index is
spammed with an innumerable amount of web pages which were solely
created to influence PageRank.
In his patent specifications for PageRank, Lawrence Page presents
the evaluation of links by the distance between pages as a means
to avoid the artificial inflation of PageRank, because the bigger
the distance between two pages, the less likely has one webmaster
control over both. A criterium for the distance between two pages
may be if they are on the same domain or not. In this way, internal
links would be weighted less than external links. In the end, any
general measure of the distance between links can be used to determine
such a weighting. This comprehends if pages are on the same server
or not and also the geographical distance between servers.
As another indicator for the importance of a document, Lawrence
Page mentions the up-to-dateness of the documents which link to
it. This argument considers that the information on a page is less
likely outdated, the more pages which have been modified recently
link to it. In contrast, the original PageRank concept, just like
any method of measuring link popularity, favours older documents
which gained their inbound links in the course of their existence
and have at a higher probability been modified less recently than
new documents. Basically, recently modified documents may be given
a higher evaluation by weighting the factor (1-d). In this way,
both those recently modified documents and the pages they link to
receive a higher PageRank. But, if a page has been modified recently,
is not necessarily an indicator for the importance of the information
presented on it. So, as suggested by Lawrence Page, it is advisable
not to favour recently modified pages but only their outbound links.
Finally, Page mentions the importance of the web location of a
page as an indicator of the importance of its outbound links. As
an example for an important web location he names the root page
of a domain, but, in the end, Google could exert influence on PageRank
absolutely arbitrarily.
To implement the evaluation of the linking page into PageRank,
the evaluation factor of the modified algorithm must consist of
several single factors. For a link that points from page Ti to page
A, it can be given as follows:
L(Ti,A) = K(Ti,A) × K1(Ti) × ... × Km(Ti)
where K(Ti,A) is the above presented weighting of a single link
within a page by its visibility or position. Additionally, an evaluation
of page Ti by m criteria which are represented by the factors Kj(Ti)
takes place.
To implement the evaluation of the linking pages, not only the
algorithm but also the proceedings of PageRank calculation have
to be modified. This shall be illustrated by an example.
 |
 |
We take a look at a web consisting
of three pages A, B and C, whereby page A links to the pages
B and C, page B links to page C and page C links to page A.
The outbound links of one page are evaluated equally, so there
is no weighting by visibilty or position. But now, the pages
are evaluated by one criterium. In this way, an inbound link
from page C shall be considered four times as important as an
inbound link from one of the other pages. After weighting by
the |
 |
 |
number of pages, we get the following evaluation factors:
K(A) = 0.5
K(B) = 0.5
K(C) = 2
At a damping factor d of 0.5, the equations for the computation
of the PageRank values are given by
PR(A) = 0.5 + 0.5 × 2 PR(C)
PR(B) = 0.5 + 0.5 × 0.5 × 0.5 PR(A)
PR(C) = 0.5 + 0.5 (0.5 PR(B) + 0.5 × 0.5 PR(A))
Solving the equations gives us the follwing PageRank values:
PR(A) = 4/3
PR(B) = 2/3
PR(C) = 5/6
At the current modifications of the PageRank algorithm, the accumulated
PageRank of all pages no longer equals the number of pages. The
reason therefore is that the weighting of the page evaluation by
the number of pages was not appropriate. To determine the proper
weighting, the web's linking structure would have to be anticipated,
which is not possible in case of the actual WWW. Therefore, the
PageRank calculated by an evaluation of linking pages has to be
normalized if there shall not be any unfounded effects on the general
ranking of pages by Google. Within the iterative calculation, a
normalization would have to take place after each iteration to minimize
unintentional distortions.
In the case of a small web, the evaluation of pages often causes
severe distortions. In the case of the actual WWW, these distortions
should normally equalise by the number of pages. Indeed, it is to
be expected that the evaluation of the distance between pages will
cause distortions on PageRank, since pages with many inbound links
surely tend to be linked to from different geographical regions.
But such effects can be anticipated by experience from previous
calculation periods, so that a normalisation would only have to
be marginal.
In either case, implementing additional factors in PageRank is
possible. Indeed, the computation of PageRank values would take
more time.
Next
Article Segment
10.
Theme-Based PageRank
This article reproduced with permission of eFactory.
© 2002 eFactory Internet-Agentur KG Online-Marketing - written
by Markus Sobek
PageRank and Google are trademarks of Google Inc., Mountain ViewCA,
USA.
PageRank is protected by US Patent 6,285,999.
|