Metadata Details

The following name:value pairs are returned by getmetadata and getmetadatatext. Elements and their values depend on the file type, document structure and parameters used to call the MetaGlance service.

For mapping to other metadata formats see crosswalks.


Content Analysis Elements

subject

Keywords is a comma separated list, including single and multi-word phrases. Keywords can also be considered as possible tags for a document, blog, etc. MetaGlance generates keywords using a combination of proprietary methods.

Returned for all file types.

"subject": [ "apples", "oranges", "fruit trees"]

Keywords can be customized to particular subject domains or to (a) look for, emphasize, deemphasize or filter out certain words and (b) consider the meaning of words in the context of a particular subject. The generic API does not do this.

title

Title of a Web page or document, if it can be detected. MetaGlance considers the title defined by the author using a <title> tag or property (e.g. for a Word document) and also looks inside of a web page or document at other possible titles. MetaGlance returns the title it believes is “best.”

Returned for all file types

"title": "Health and Nutrition: Latest News"

language

Language returns the two letter ISO 639-1 code for the content. This is done using N-Gram language detection. 

Returned for all file types.

"language": "en"

readinglevel

The approximate reading level of the text as determined using the Flesch-Kincaid measure. For example: "12" is around a U.S. 12th grade reading level. 

Returned for all file types.

"readinglevel": "8.335244755244759"

readingtime

Estimated reading time in seconds, based on the length and complexity of the text. It is intended as a general measure of "size." It does not include the playing time for videos or audio tracks.

Returned for all file types.

"readingtime": "97.24656000000002"


File Properties

identifier

The identifier of the resource for which metadata was generated. 

  • If you call the MetaGlance service with a URL using getmetadata, the identifier will be the URL unless you specify a different one.
  • If you call the MetaGlance service using getmetadatatext, the identifier will be blank unless you specify one.
  • To specify an identifier foo, add the paramenter &identifier=foo.

Returned for all file types.

"identifier": "http://www.eduworks.com"
"identifier": "myidentifier"

format

File type, for example: "Web page" or "Flash animation." This is done by interpreting the MIME type. See MIME mappings for more detail.

Returned for all file types.

"format": "Web page"

mimetype

MIME type is a two-part identifier for internet file formats. For more information see a list of common media types. See MIME mappings for more detail

Returned for all file types.

"mimetype": "text/html"
"mimetype": "application/pdf"
"mimetype": "image/jpeg"
"mimetype": "application/vnd.ms-word"

mediatype

The media type is a broader generalization of the format, usually "Text" or "Image." See MIME mappings for more detail

"mediatype": "Text"

pages

Number of pages in a document. This is taken from the properties of the file in the case of Word, PowerPoint and PDF documents. This is not computed for other types of content because it is not well-defined. It is also possible that page count is not present in a Word or PDF document that is generated by a third party program.

"pages": "8"

size

Size of the file in bytes.

"size": "94266"

 


Statistical Data

wordcount

Word count, or total number of words. The word count returned by MetaGlance closely matches that returned by standard word processors. The user should be aware that different methods might give  different results for non-standard cases (e.g. hyphenated words that break across a line). The differences are generally very small.

Returned for all file types.

"wordcount": "2394"

sentencecount

Sentence count, counted by looking at punctuation.

Returned for all file types.

"sentencecount": "179"

averagewordlength

Average characters per word, excluding common words (such as "is," "the," "a," "and," etc.) Note: These are called "stop words" in Natural Language Processing. 

Returned for all file types.

"averagewordlength": "6.433750152587891"


BETA Elements

The following can be accessed with a beta key, but are still in testing and may change at any time. Please do not rely on these until they're folded into the core elements, above.

abstract

Abstract section from a scholarly paper labelled "Abstract." This is only done for PDF's and only returned if MetaGlance is confident that it has found a genuine abstract.

classification

Classification into one or more of the following broad subject areas:

  • "arts" - Arts, Entertainment and Food
  • "business" - Business, Finance and Consumers
  • "health" - Health and Medical
  • "sports" - Sports
  • "education" - Education and Training
  • "technology" - Technology and Science
  • "politics" - Politics and Government

Returned for all file types. For single values a string is returned. For multiple classifications an array is returned.

The classifier used is a proprietary classifier based on weighted directed graphs and other techniques.

"classification": "business"
"classification": [ "sports", "arts"]

Custom classifiers can be built for specific applications using this and  many other classification methods.

creator

Creator, or document author. This is obtained from the properties of Word, PowerPoint,  PDF and HTML documents. It is also checked to see if it makes sense. In some versions of the API authors can be taken from the body of the document itself. Warning: MetaGlance cannot check whether an author associated with a document is carried over from an older version as often happens when slide decks or templates are reused. 

Returned for all file types if found in the properties.

"creator": "John Smith"

medianwordlength

Median length of all words excluding common words (such as "is," "the," "a," "and," etc.). Note: These are called "stop words" in Natural Language Processing. 

Returned for all file types.

"medianwordlength": "7"

shortdescription

An experimental short description that provides a brief, human readable summary or abstract of the file. Expect this to change from time to time.

"shortdesciption": "A web page about fish and bicycles."

syllablecount

Number of syllables in the text. This is computed by counting vowels and modifying the count using rules that account for diphthongs and silent letters.

Returned for all files types.

"syllablecount": "4390"

uniquewordcount

Number of unique words in the text. Words are considered different unless they are identical (i.e. "dog" is different than "dogs" and "theater" is different than "theatre.")

"uniquewordcount": "944"