Utilizing RSS feeds for crawling the Web


We present “advaRSS” crawling mechanism which
is created in order to support peRSSonal, a mechanism used to
create personalized RSS feeds. In contrast to the common
crawling mechanisms our system is focalized on fetching the
latest news from the major and minor portals worldwide by
utilizing their communication channels. The challenge between
“advaRSS” and a usual crawler is the fact that the news is
produced in a random order any time of the day and thus the
freshness of the offline collection can be measured even in
minutes. This means that the system has to be updated with
news every single time they occur. In order to achieve this we
utilize the communication channels that exist on the modern
architecture of the WWW and more specifically in almost
every modern news portal. As the RSS feeds are used by every
major and minor portal it is possible to keep our crawler up to
date and retain a high freshness of the “offline content” that is
maintained in our system’s database by applying algorithms in
order to observe the temporal behaviour of each RSS feed.
Keywords-rss crawling, web crawler, rss analysis, offline
content.
I. INTRODUCTION
The World Wide Web has grown from a few thousand
pages in 1993 to more than three billion pages at present.
The consequence of the popularity of the Web as a global
information system is that it is flooded with a large amount
of data and information and hence finding useful
information on the Web is often a tedious and frustrating
experience. New tools and techniques are crucial for
intelligently searching for useful information on the Web.
However, the mechanisms that were invented to make Web
seem less chaotic need information and waste a great
amount of time in order to collect it. Web crawlers are an
essential component of all search engines and are
increasingly becoming important in data mining and other
indexing applications. Web crawlers are programs which
browse the Web in a methodical, automated manner. They
are mainly used to create a copy of all the visited pages for
future use by mechanisms which will index the downloaded
pages to provide fast searches and further processing.
Much research has been done for creating crawlers that will
have “fresh” collection of web pages. Web pages are
changing at different rates which means that the crawler
should decide which page should be revisited by using an
efficient method [1]. This leads to creation of crawlers that
have at least two basic modules, one for periodical crawling
(scheduled) and another for incremental crawling (update
the most frequently changing pages). In [2] and [3] is
denoted that most web pages in the US are modified during
the US working hours a statement that is extremely logical.
In [4], Cho and Garcia-Molina show that different domains
have very different “page change” rates. Arasu et al in [5]
report a half-life of 10 days for web pages in order to create
an algorithm for maintaining the freshnesh of their “offline
collection”.