Internet Marketing & Website Promotion Strategy

Internet Marketing

About SubiaSoft

Search Engine Marketing
Link Popularity

Website Marketing

Marketing Tools

Event Calendar
Free Web Counter
Tell A Friend
Mortgage Calculator

Webmaster Tools

Keyword Analysis Tool
Web Page Analyzer
Meta Keyword Generator
Search Engine Ranking Tools
Link Popularity Check
Allinanchor Vs SERPS
Search Engine Forums
Free Content & Articles

Kickass Web Templates

Stock Photos

Website Scripts

Contribute Articles

Advertising

Contact

Site Map

  Internet Advertising

Computer Training
Sales Seminars
Website Translation
Vehicle Wraps Marketing

America's Directory

 

Resource Directory

 

 

 

 
 

10. Theme-based PageRank (continued)

The Weighting of Links
Based on Content Analyses

That it is possible to weight single links within the PageRank technique has been shown on the previous page. The thought behind weighting links based on content analyses is to avoid the corrumption of PageRank. By weighting links this way, it is theoretically possible to diminish the influence of links

 
Table of Contents
 

Survey of Google’s PageRank
1. Introduction
2. The PageRank Algorithm
3. Page Rank Implementation
4. Effect Of Inbound Links
5. Effect of Outbound Links
6. Effect of Number of Pages
7. PageRank Redistribution
8. The Yahoo Bonus
9. Additional Factors
10. Theme-Based Page Rank
11. PR0 Penalty

 

between thematically unrelated page, which have been set for the sole purpose of boosting PageRank of one page. Indeed, it is questionable if it is possible to realize such weighting based on content analyses.

The fundamentals of content analyses are based on Gerard Salton's work in the 1960s and 1970s. In his vector space model of information retrieval, documents are modeled as vectors which are built upon terms and their weighting within the document. These
term vectors allow comparisons between the content of documents by, for instance, calculating the cosine measure (the inner product) of the vectors. In its basic form, the vector space model has some weaknesses. For instance, often the assumption that if and in how far the same words appear in two documents is an indicator for their similarity is criticized. However, numerous enhancements have been developed that solve most of the problems of the vector space model.

One person who excelled at publications which are based on Salton's vector space model is Krishna Bharat. This is interesting because Bharat meanwhile is a member of Google's staff and, particularly, because he is deemded to be the developer of "Google News" (news.google.com). Google News is a service that crawls news websites, evaluates articles and then provides them categorized and grouped in different subjects on the Google New website. According to Google, all these procedures are completely automated. Therefore, other criteria like, for example, the time when an article is published, are taken into account, but if there is no manual intervention, the clustering of articles is most certainly only possible, if the contents of the articles are actually compared to each other. The questions is: How can this be realized?

In their publication on a term vector database, Raymie Stata, Krishna Bharat and Farzin Maghoul describe how the contents of web pages can be compared based on term vectors and, particularly, they describe how some of the problems with the vector space model can be solved. Firstly, not all terms in documents are suitable for content analsysis. Very frequent terms provide only little discrimination across vectors and, so, the most frequent third of all terms is eliminated from the database. Infrequent terms, on the other hand, do not provide a good basis for measuring similarity. Such terms are, for example, misspellings. They appear only on few pages which are likely unrelated in terms of their theme, but because they are so infrequent, the term vectors of the pages appear to be closely related. Hence, also the least frequent third of all terms is eliminated from the database.

Even if only one third of all terms is included in the term vectors, this selection is still not very efficient. Stata, Bharat and Maghoul perform another filtering, so that each term vector is based on a maximum of 50 terms. But these are not the 50 most frequent terms on a page. They weight a term by deviding the number of times it appears on a page by the number of times it appears on all pages, and those 50 terms with the highest weight are included in the term vector of a page. This selection actually allows a real differentiation between the content of pages.

The methods described above are standards for the vector space model. If, for example, the inner product of two term vectors is rather high, the contents of the according pages tend to be similar. This may allow content comparisons in many areas, but it is doubtful if it is a good basis for weighting links within the PageRank technique. Most of all, synonyms and terms that describe similar things can not be identified. Indeed, there are algorithms for word stemming which work good for the english language, but in other languages word stemming is much more complicated. Different languages are a general problem. Unless, for instance, brand names or loan words are used, texts in different languages normally do not contain the same terms. And if they do, these terms normally have a completely different meaning, so that comparing content in different languages is not possible. However, Stata, Bharat and Maghoul provide a method of resolution for these problems.

Stata, Bharat und Maghoul present a concrete application for their Term Vector Database by classifying pages thematically. Bharat has also published on this issue together with Monika Henzinger, presently Google's Research Director, and they called it "topic distillation". Topic distillation is based on calculating so-called topic vectors. Topic vectors are term vectors, but they do not only include terms of one page but rather the terms of many pages which are on the same topic. So,
in order to create topic vectors, they have to know a certain amount of web pages which are on several pre-defined topics. To achieve this, they resort to web directories.

For their application, Stata, Bharat und Maghoul have crawled about 30,000 links within each of the then 12 main categories of Yahoo to create topic vectors which include about 10,000 terms each. Then, in order to identify the topic of any other web page, they matched the according term vector with all the topic vectors which were created from the Yahoo crawl. The topic of a web page derived from the topic vector which matched the term vector of the web page best. That such a classification of web pages works can again be observed by the means of Google News. Google News does not only merge articles to one news topic, but also arranges them to the categories World, U.S., Business, Sci/Tech, Sports, Entertainment and Health. As long as this categorization is not based on the structure of the website where the articles come from (which is unlikely), the actual topic of an article has in fact to be computed.

At the time he published on term vectors, Krishna Bharat did not work on PageRank but rather on Kleinberg's algorithm, so that he was more interested in filtering off-topic links than in weighting links. But from classifying pages to weighting links based on content comparisons, there is only a small step. Instead of matching the term vectors of two pages, it is much more efficient to match the topics of two pages. We can, for instance, create a "topic affinity vector" for each page based on the degree of affinity of the page's term vector and all the topic vectors. The better the topic affinity vectors of two pages match, the more likely are they on the same topic and the higher should a link between them be weighted.

Using topic vectors has one big advantage over comparing term vectors directly: A topic vector can include terms in different languages by being based on, for instance, the links on different national Yahoo versions. Deviant site structures of the national versions can most certainly be adapted manually. Even better may be using the ODP because the structure of the sub-categories of the "World" category is based on the main OPD structure. In this way, measuring topic similarities between pages in different languages can be realized, so that a really useful weighting of links based on text analyses appears to be possible.

10. Theme-Based PageRank (continued)

 

This article reproduced with permission of eFactory.
© 2002 eFactory Internet-Agentur KG Online-Marketing - written by Markus Sobek
PageRank and Google are trademarks of Google Inc., Mountain ViewCA, USA.
PageRank is protected by US Patent 6,285,999.

 
 

If you’re interested in having your link here, click for info.

 

 

 
Web Templates


Stock Photography

 

 

 

 

81104

 

Terms of Use | Privacy Policy | Legal Information | Marketing Partners | Add Link

© 2004 SubiaSoft. Internet Marketing Strategy.