Term Extraction from FiveFilters.org is a free software project to help you extract terms (e.g. for use as tags) through a web service. Given some text it will return a list of terms with (hopefully) the most relevant first.
Get terms in HTML, JSON, XML, or plain text for easy parsing in any programming language.
Give it the URL to a web article and we'll try to extract the article's contents first and then carry out term extraction.
Host on your own servers or deploy to the cloud. Pre-configured. No database required.
Term Extraction is free software — no restrictive corporate APIs here.
Released 17 January 2013
Our term extraction application is intended to be run as a web service which you host yourself.
Additionally, it contains Full-Text RSS 2.9.5 (not the latest release) to enable term extraction from web articles (URL input).
PHP 5.2 or above is required.
If APC is available, the dictionary of English words will be stored in memory to increase performance.
The previous version of the term extraction web service provided here was a Python application. You can download that free of charge. You should be able to test it on your own machine using web.py. The version that was running here was hosted on Google's App Engine.
Download: ZIP file
Inside the term-extraction directory you'll find two sub-directories: web.py and google_app_engine. If you'd like to test on your own machine try installing web.py and running python code.py. If you'd like to host the code on Google App Engine, use the code inside google_app_engine.
Note: Unfortunately we cannot offer any support for the Python version.
When making HTTP requests, you can pass the following parameters (either in a GET request or POST request).
|text||string||The text to extract terms from (UTF-8 encoded). English is the only supported language.|
|output||json, xml, txt, php, html||The format to return the terms.|
|terms_only||1 or 0 (default)||Set this to 1 if you're only interested in the terms (not the occurrence and term word count). Only applies to JSON output.|
|max||number (default 50)||The maximum number of terms to return.|
|lowercase||1 or 0 (default)||Set this to 1 to have all extracted terms converted to lowercase|
|url||string||This can be used instead of 'text' or 'text_or_url', to point to a web article.|
|text_or_url||string||For convenience, this parameter can be used instead of the 'text' or 'url' parameters to accept either a URL (on its own) or some text.|
|key||string||Access key. If you've set one up in custom_config.php, otherwise not required.|
Required parameters: either text, url, or text_or_url must be supplied.
The parameters below can be used to filter out certain terms
|min_occurrence||number (default 1)||The minimum number of times a single-word (unigram) term must appear for it be included in the output.|
|max_strength||number (default 3)||Strength is the number of words in the term, so to reduce results to terms with a maximum of 2 words, set this to 2.|
|keep_if_strength||number (default 2)||Keep a term if the term's word count is equal to or greater than this, regardless of occurrence.|
|exc||string||Check terms for this string, and exclude term if there's a match or partial match. This can appear multiple times.|
|filter||1 (default) or 0||Set this to 0 to disable filtering (overriding the four parameters above).|
One aim of our this web application is to allow users to switch from Yahoo's Term Extraction service to one under their own control. To make this as easy as possible, Term Extraction from FiveFilters.org can produce output in the format generated by the Yahoo service and accept the same parameters.
If you are switching over from Yahoo's service, make sure you enable Yahoo mode either by using the 'yahoo' parameter below, or simply calling yahoo.php instead of extract.php.
|yahoo||1 or 0 (default)||Set this to 1 to enable Yahoo mode (output format matching that used by Yahoo's Term Extraction service). Alternatively, you can simply call yahoo.php instead of extract.php to enable Yahoo mode.|
|appid||string||Same as 'key'|
|context||string||Same as 'text'|
For example, let's say we want to extract terms from the following piece of text (the example used by Yahoo):
"Italian sculptors and painters of the renaissance favored the Virgin Mary for inspiration."
Here's what the request might look like for Yahoo:
To switch to Term Extraction from FiveFilters.org, you would simply change the base URL to point it to your own copy:
Note: in this case exactly the same terms are returned by both services, but Yahoo compatibility mode does not mean you'll get the same results as Yahoo's service, only that the way the results are formatted should match Yahoo's.
Our hosted service (the one accessible via the form at the top of this page) is intended for light use and to demo what the self-hosted option can produce. We do not currently offer a premium plan, so for developers and others who need to make a lot of requests, please purchase our self-hosted package.
Free software for term extraction:
Non-free web services for term extraction:
Olena Medelyan has more information and resources on her topic indexing blog. She is involved with the Maui and Kea projects. Joseph Turian gives a background to term extraction and links to related tools and research.
© 2013 FiveFilters.org