Metadata Details
The following name:value pairs are returned by getmetadata and getmetadatatext. Elements and their values depend on the file type, document structure and parameters used to call the MetaGlance service.
For mapping to other metadata formats see crosswalks.
Content Analysis Elements
subject
Keywords is a comma separated list, including single and multi-word phrases. Keywords can also be considered as possible tags for a document, blog, etc. MetaGlance generates keywords using a combination of proprietary methods.
Returned for all file types.
"subject": [ "apples", "oranges", "fruit trees"]
Keywords can be customized to particular subject domains or to (a) look for, emphasize, deemphasize or filter out certain words and (b) consider the meaning of words in the context of a particular subject. The generic API does not do this.
title
Title of a Web page or document, if it can be detected. MetaGlance considers the title defined by the author using a <title> tag or property (e.g. for a Word document) and also looks inside of a web page or document at other possible titles. MetaGlance returns the title it believes is “best.”
Returned for all file types
"title": "Health and Nutrition: Latest News"
language
Language returns the two letter ISO 639-1 code for the content. This is done using N-Gram language detection.
Returned for all file types.
"language": "en"
readinglevel
The approximate reading level of the text as determined using the Flesch-Kincaid measure. For example: "12" is around a U.S. 12th grade reading level.
Returned for all file types.
"readinglevel": "8.335244755244759"
readingtime
Estimated reading time in seconds, based on the length and complexity of the text. It is intended as a general measure of "size." It does not include the playing time for videos or audio tracks.
Returned for all file types.
"readingtime": "97.24656000000002"
File Properties
identifier
The identifier of the resource for which metadata was generated.
- If you call the MetaGlance service with a URL using getmetadata, the identifier will be the URL unless you specify a different one.
- If you call the MetaGlance service using getmetadatatext, the identifier will be blank unless you specify one.
- To specify an identifier foo, add the paramenter &identifier=foo.
Returned for all file types.
"identifier": "http://www.eduworks.com"
"identifier": "myidentifier"
format
File type, for example: "Web page" or "Flash animation." This is done by interpreting the MIME type. See MIME mappings for more detail.
Returned for all file types.
"format": "Web page"
mimetype
MIME type is a two-part identifier for internet file formats. For more information see a list of common media types. See MIME mappings for more detail
Returned for all file types.
"mimetype": "text/html"
"mimetype": "application/pdf"
"mimetype": "image/jpeg"
"mimetype": "application/vnd.ms-word"
mediatype
The media type is a broader generalization of the format, usually "Text" or "Image." See MIME mappings for more detail
"mediatype": "Text"
pages
Number of pages in a document. This is taken from the properties of the file in the case of Word, PowerPoint and PDF documents. This is not computed for other types of content because it is not well-defined. It is also possible that page count is not present in a Word or PDF document that is generated by a third party program.
"pages": "8"
size
Size of the file in bytes.
"size": "94266"
Statistical Data
wordcount
Word count, or total number of words. The word count returned by MetaGlance closely matches that returned by standard word processors. The user should be aware that different methods might give different results for non-standard cases (e.g. hyphenated words that break across a line). The differences are generally very small.
Returned for all file types.
"wordcount": "2394"
sentencecount
Sentence count, counted by looking at punctuation.
Returned for all file types.
"sentencecount": "179"
averagewordlength
Average characters per word, excluding common words (such as "is," "the," "a," "and," etc.) Note: These are called "stop words" in Natural Language Processing.
Returned for all file types.
"averagewordlength": "6.433750152587891"
BETA Elements
The following can be accessed with a beta key, but are still in testing and may change at any time. Please do not rely on these until they're folded into the core elements, above.
abstract
Abstract section from a scholarly paper labelled "Abstract." This is only done for PDF's and only returned if MetaGlance is confident that it has found a genuine abstract.
classification
Classification into one or more of the following broad subject areas:
- "arts" - Arts, Entertainment and Food
- "business" - Business, Finance and Consumers
- "health" - Health and Medical
- "sports" - Sports
- "education" - Education and Training
- "technology" - Technology and Science
- "politics" - Politics and Government
Returned for all file types. For single values a string is returned. For multiple classifications an array is returned.
The classifier used is a proprietary classifier based on weighted directed graphs and other techniques.
"classification": "business"
"classification": [ "sports", "arts"]
Custom classifiers can be built for specific applications using this and many other classification methods.
creator
Creator, or document author. This is obtained from the properties of Word, PowerPoint, PDF and HTML documents. It is also checked to see if it makes sense. In some versions of the API authors can be taken from the body of the document itself. Warning: MetaGlance cannot check whether an author associated with a document is carried over from an older version as often happens when slide decks or templates are reused.
Returned for all file types if found in the properties.
"creator": "John Smith"
medianwordlength
Median length of all words excluding common words (such as "is," "the," "a," "and," etc.). Note: These are called "stop words" in Natural Language Processing.
Returned for all file types.
"medianwordlength": "7"
shortdescription
An experimental short description that provides a brief, human readable summary or abstract of the file. Expect this to change from time to time.
"shortdesciption": "A web page about fish and bicycles."
syllablecount
Number of syllables in the text. This is computed by counting vowels and modifying the count using rules that account for diphthongs and silent letters.
Returned for all files types.
"syllablecount": "4390"
uniquewordcount
Number of unique words in the text. Words are considered different unless they are identical (i.e. "dog" is different than "dogs" and "theater" is different than "theatre.")
"uniquewordcount": "944"
