Full-Text RSS is a free software and open source web page content extraction tool developed for the Five Filters project. It tries, and usually succeeds in, separating actual content from other web page elements — e.g. presentational elements and navigation bars.
It is designed to work with any web page carrying an article. It can accept as input a single web page (identified by its URL) or a list of web pages (identified by a feed URL). It then retrieves each page and uses a set of rules to identify the title and content block that's most likely to hold only the article's content (usually paragraphs of text). These blocks are then extracted and returned in the RSS feed format.
This process of content detection and extraction is outlined here. For the majority of extractions we use a piece of free and open source software called Readability - it was developed by Arc90 and was ported to PHP for use in this tool. It is also used by Apple in the Safari browser.
In the simplest and most popular case, Full-Text RSS is used with news reader software (e.g. Google Reader, RSS Owl). If you use software, on any platform, that can read RSS feeds, you can use Full-Text RSS. Simply copy and paste the URL generated by Full-Text RSS (you'll see it in your browser's address bar when you click 'Create Feed') into your application.
Full-Text RSS is also used by developers and researchers to extract articles from web pages. The extracted content can then be repackaged (e.g. for easy mobile reading, converted to epub or mobi format to be read on an e-reader such as the Kindle), or used for statistical analysis.
As long as you have access to a server with PHP, you can use this (and if you don't, most cheap shared web hosts meet the requirements). Full-Text RSS can output the extracted content as an RSS file or as JSON. It gives you a URL which you can use in your application, whatever the language (if you prefer RSS over JSON, there are RSS libraries for PHP, Python, Java, Ruby and more — and as RSS is XML, you can also use your favourite XML parser to get at the content).
One you load the library, simply pass the URL created by Full-Text RSS (or follow the URL construction guidelines to construct your own URL) and you'll be able to access the extracted content in your favourite language. There's more information in our user guide.
Google Reader does not treat all feeds equally. The update frequency depends on the number of subscribers a feed has and possibly other factors. We will include a solution for this in a future version. For the time being we have given users paying for premium access the option of creating monitored feeds which have changes automatically pushed to Google Reader.
There's often confusion around the word 'free' in free software. The Free Software Foundation explains it as follows: 'When we speak of "free software", we're talking about freedom, not price' (read more). We are a free software probject because we believe in the benefits of free software and we want people who use our code to be able to examine it, run it for themselves, and even change it if they want to. But we're not a big business with a big budget. We simply want to do the work we enjoy, and to do it we really need to make enough to sustain the project. We've tried to keep the price of the zip download reasonable and, in addition to sending you the code itself, we give you additional benefits such as free updates for 1 year and support if you have trouble using it.
If you pay and it doesn't work, and we can't help you get it to work, then we'll happily refund your money (email us at fivefilters (at) fivefilters.org and we'll do our best to help).
This is related to the question above about paying. We rely on paying users to sustain the project. When we make a new release we'll send free updates to everyone who purchased an earlier version, then we make it available for purchase. After we've raised some money through sales we'll push our changes to the repository so everyone can download for free.
The best way to ensure we pick out the correct content is to mark up your content using hNews. Readability's publishing guidelines is a good start.
We occasionally receive emails from publishers who find their content republished on other websites completely unrelated to fivefilters.org - usually spam blogs (or splogs). The reason some of these complaints reach us, rather than the site owners who republish the content, is that some site owners have started using our Full-Text RSS service as part of their publication process. In doing so, they are also inadvertently republishing the reference to fivefilters.org that appears at the bottom of each feed item. This reference often gives the mistaken impression that we endorse or are responsible for the republication of such material.
As of 13 August 2010, we have altered the message that appears beneath each feed item to include a link to this FAQ page, which we hope offers some clarification.
fivefilters.org has no control over other websites. We urge publishers to contact the site owner carrying the content and request that it be removed.
If you suspect that your content is being republished with the aid of our service, please contact us with your feed URL and we will be happy to block access to your feed from fivefilters.org.
But please note that republishing content is fairly trivial for those determined to do it. Having your feed blocked on fivefilters.org is unlikely to stop those determined to republish it.
RSS feeds are typically used by individuals to (1) more conveniently receive updates to a website and (2) to read that content in an environment of their choosing (usually a deicated news reading application such as Google Reader or RSS Owl). Full-Text RSS helps users with the second scenario.
Aside from helping individuals read the content they like in their chosen environment, the ability to extract web content is also very useful for application developers. We use the code ourselves as part of our PDF Newspaper project and Kindle service to make web articles, especially those published by smaller, independent sites, more accesible — this can also benefit authors and publishers of these sites (many of whom don't have the resources of the big media companies) by allowing them to offer their content in multiple convenient formats without much effort.
To give another example, researchers in the field of natural language processing use our tools to do linguistic and statistical analysis on web content.
Like many technologies, however, it is a double edged sword. We often use the copy and paste commands as an analogy: they can be used to help people "steal" content but were not designed for that purpose.