Full-Text RSS is a free software and open source web page content extraction tool developed for the Five Filters project. It tries, and usually succeeds in, separating actual content from other web page elements — e.g. presentational elements and navigation bars.
It is designed to work with any web page carrying an article. It can accept as input a single web page (identified by its URL) or a list of web pages (identified by a feed URL). It then retrieves each page and uses a set of rules to identify the title and content block that's most likely to hold only the article's content (usually paragraphs of text). These blocks are then extracted and returned in the RSS feed format.
This process of content detection and extraction is outlined here. For the majority of extractions we use a piece of free and open source software called Readability - it was developed by Arc90 and was ported to PHP for use in this tool. It is also used by Apple in the Safari browser.
In the simplest and most popular case, Full-Text RSS is used with news reader software (e.g. NewsBlur, Feedly). If you use software, on any platform, that can read RSS feeds, you can use Full-Text RSS. Simply copy and paste the URL generated by Full-Text RSS (you'll see it in your browser's address bar when you click 'Create Feed') into your application.
Full-Text RSS is also used by developers and researchers to extract articles from web pages. The extracted content can then be repackaged (e.g. for easy mobile reading, converted to epub or mobi format to be read on an e-reader such as the Kindle), or used for statistical analysis.
Full-Text RSS is designed to be used as web service. We have a hosted API, but you can also host the software yourself (see our hosting suggestions). Full-Text RSS outputs the extracted content as an RSS file or as JSON.
There's more information in our user guide.
There's often confusion around the word 'free' in free software. The Free Software Foundation explains it as follows: 'When we speak of "free software", we're talking about freedom, not price' (read more). We are a free software project because we believe in the benefits of free software and we want people who use our code to be able to examine it, run it for themselves, and even change it if they want to. But we're not a big business with a big budget. We simply want to do the work we enjoy, and to do it we really need to make enough to sustain the project. We've tried to keep the price of the zip download reasonable and, in addition to sending you the code itself, we give you additional benefits such as free updates for a limited time and support if you have trouble using it.
If you pay and it doesn't work, and we can't help you get it to work, then we'll happily refund your money (email us at help (at) fivefilters.org and we'll do our best to help).
The best way to ensure we pick out the correct content is to mark up your content using Schema.org Article markup.
We occasionally receive emails from publishers who find their content republished on other websites completely unrelated to fivefilters.org - usually spam blogs (or splogs). The reason some of these complaints reach us, rather than the site owners who republish the content, is that some site owners have started using our Full-Text RSS service as part of their publication process. In doing so, they are also inadvertently republishing the reference to fivefilters.org that appears at the bottom of each feed item. This reference often gives the mistaken impression that we endorse or are responsible for the republication of such material.
fivefilters.org has no control over other websites. We urge publishers to contact the site owner carrying the content and request that it be removed.
If you suspect that your content is being republished with the aid of our service, please contact us with your feed URL and we will be happy to block access to your feed from fivefilters.org.
But please note that republishing content is fairly trivial for those determined to do it. Having your feed blocked on fivefilters.org is unlikely to stop those determined to republish it.
RSS feeds are typically used by individuals to (1) more conveniently receive updates to a website and (2) to read that content in an environment of their choosing (usually a deicated news reading application such as Google Reader or RSS Owl). Full-Text RSS helps users with the second scenario.
Aside from helping individuals read the content they like in their chosen environment, the ability to extract web content is also very useful for application developers. We use the code ourselves as part of our PDF Newspaper project and Kindle service to make web articles, especially those published by smaller, independent sites, more accesible — this can also benefit authors and publishers of these sites (many of whom don't have the resources of the big media companies) by allowing them to offer their content in multiple convenient formats without much effort.
To give another example, researchers in the field of natural language processing use our tools to do linguistic and statistical analysis on web content.
Like many technologies, however, it is a double edged sword. We often use the copy and paste commands as an analogy: they can be used to help people "steal" content but were not designed for that purpose.
We're happy to answer any questions you may have. Feel free to ask questions over in our support centre, or alternatively, email help (at) fivefilters.org
Please find retweet of my Oct 19 observations on @Wikipedia editor Philip Cross' editing of the "Neal's Yard Remedies" page, something he has decided to do again today, ignoring ArbCom's warning against conflict of interest editing.@NeilClark66 @Wikimedia @wikimediauk @krmaher https://t.co/bA0JgAhrS0 pic.twitter.com/3KgyZmS97f— leftworks (@leftworks1) January 19, 2020
To understand the story below, and why the world's climate is falling apart, watch this documentary (or read the book of the same name by Joel Bakan). It's vital to understand how crazy the corporate system really is https://t.co/h7jUDdC0Sqhttps://t.co/CfL6ig15hN— Media Lens (@medialens) January 18, 2020
© 2020 FiveFilters.org