What is Term Extraction?

Term Extraction from FiveFilters.org is a free software project to help you extract terms (e.g. for use as tags) through a web service. Given some text it will return a list of terms with (hopefully) the most relevant first.

Terms can be returned in a variety of formats. The application is intended to be a simple, free alternative to Yahoo's Term Extraction service. Our web service relies on a PHP port Topia's Term Extraction.


Features

Icon

Multiple formats

Get terms in HTML, JSON, XML, or plain text for easy parsing in any programming language.

Icon

Article extraction

Give it the URL to a web article and we'll try to extract the article's contents first and then carry out term extraction.

Icon

Easy hosting

Host on your own servers or deploy to the cloud. Pre-configured. No database required.

Icon

Freedom

Term Extraction is free software — no restrictive corporate APIs here.


Download

Term Extraction 1.0

Released 17 January 2013

Our term extraction application is intended to be run as a web service which you host yourself.

It contains, and relies on, a PHP port of Topia's Term Extraction. (This can be used as a regular PHP library, rather than a web service. See examples of use.)

Additionally, it contains Full-Text RSS 2.9.5 (not the latest release) to enable term extraction from web articles (URL input).

Term Extraction v1.0

zip package — 20 €

Buy Now

System requirements

PHP 5.2 or above is required.

If APC is available, the dictionary of English words will be stored in memory to increase performance.

Older versions

The previous version of the term extraction web service provided here was a Python application. You can download that free of charge. You should be able to test it on your own machine using web.py. The version that was running here was hosted on Google's App Engine.

Download: ZIP file

Inside the term-extraction directory you'll find two sub-directories: web.py and google_app_engine. If you'd like to test on your own machine try installing web.py and running python code.py. If you'd like to host the code on Google App Engine, use the code inside google_app_engine.

Note: Unfortunately we cannot offer any support for the Python version.


Request parameters

General parameters

When making HTTP requests, you can pass the following parameters (either in a GET request or POST request).

Parameter Value Description
text string The text to extract terms from (UTF-8 encoded). English is the only supported language.
output json, xml, txt, php, html The format to return the terms.
terms_only 1 or 0 (default) Set this to 1 if you're only interested in the terms (not the occurrence and term word count). Only applies to JSON output.
max number (default 50) The maximum number of terms to return.
lowercase 1 or 0 (default) Set this to 1 to have all extracted terms converted to lowercase
callback string For JSONP: name of your Javascript function to receive the JSON response. If JSON has not been requested, this has no effect The following characters are allowed: A-Z a-z 0-9 . [] and _.
url string This can be used instead of 'text' or 'text_or_url', to point to a web article.
text_or_url string For convenience, this parameter can be used instead of the 'text' or 'url' parameters to accept either a URL (on its own) or some text.
key string Access key. If you've set one up in custom_config.php, otherwise not required.

Required parameters: either text, url, or text_or_url must be supplied.

Filtering

The parameters below can be used to filter out certain terms

Parameter Value Description
min_occurrence number (default 1) The minimum number of times a single-word (unigram) term must appear for it be included in the output.
max_strength number (default 3) Strength is the number of words in the term, so to reduce results to terms with a maximum of 2 words, set this to 2.
keep_if_strength number (default 2) Keep a term if the term's word count is equal to or greater than this, regardless of occurrence.
exc[] string Check terms for this string, and exclude term if there's a match or partial match. This can appear multiple times.
filter 1 (default) or 0 Set this to 0 to disable filtering (overriding the four parameters above).

Yahoo compatibility

One aim of our this web application is to allow users to switch from Yahoo's Term Extraction service to one under their own control. To make this as easy as possible, Term Extraction from FiveFilters.org can produce output in the format generated by the Yahoo service and accept the same parameters.

If you are switching over from Yahoo's service, make sure you enable Yahoo mode either by using the 'yahoo' parameter below, or simply calling yahoo.php instead of extract.php.

Parameter Value Description
yahoo 1 or 0 (default) Set this to 1 to enable Yahoo mode (output format matching that used by Yahoo's Term Extraction service). Alternatively, you can simply call yahoo.php instead of extract.php to enable Yahoo mode.
appid string Same as 'key'
context string Same as 'text'

For example, let's say we want to extract terms from the following piece of text (the example used by Yahoo):

"Italian sculptors and painters of the renaissance favored the Virgin Mary for inspiration."

Here's what the request might look like for Yahoo:

http://search.yahooapis.com/ContentAnalysisService/V1/termExtraction?appid=YahooDemo&output=xml&context=Italian%20sculptors%20and%20painters%20of%20the%20renaissance%20favored%20the%20Virgin%20Mary%20for%20inspiration.

To switch to Term Extraction from FiveFilters.org, you would simply change the base URL to point it to your own copy:

http://term-extraction.aws.af.cm/yahoo.php?appid=YahooDemo&output=xml&context=Italian%20sculptors%20and%20painters%20of%20the%20renaissance%20favored%20the%20Virgin%20Mary%20for%20inspiration.

Note: in this case exactly the same terms are returned by both services, but Yahoo compatibility mode does not mean you'll get the same results as Yahoo's service, only that the way the results are formatted should match Yahoo's.


More information

Our hosted service

Our hosted service (the one accessible via the form at the top of this page) is intended for light use and to demo what the self-hosted option can produce. We do not currently offer a premium plan, so for developers and others who need to make a lot of requests, please purchase our self-hosted package.

Similar software

Free software for term extraction:

Non-free web services for term extraction:

Olena Medelyan has more information and resources on her topic indexing blog. She is involved with the Maui and Kea projects. Joseph Turian gives a background to term extraction and links to related tools and research.

License


This web application is licensed under the AGPL version 3. The bulk of the work, however, is carried out by libraries which are licensed as follows...

Recommended articles and tweets

Follow us on Twitter for more