#QUESTIONS:
#What do we mean by loss of fidelity, for example, sometimes DPLA moves subjects from subject to say location
# C. Enrichment
# “Enrichment” refers the process DPLA uses to enhance original records from partners with
# additional data, typically in the form of standardized versions of names and places as well as
# URIs to LOD vocabularies such as GeoNames or the Virtual Identity Authority File (VIAF). DPLA
# has developed services to check these authorities for the values in specific partner-supplied meta data fields
# and record any matches. DPLA currently performs such enrichments only on place names, however, partner-
# supplied URIs for subject-headings or name authorities are stored within MAP properties.DPLA also performs
# enrichments that remove extraneous punctuation and whitespaces, normalize date formats to a standard (yyyy-mm-dd),
# and other clean up tasks
intermediateProvider: An intermediate organization that selects, collates, or curates data from a data provider that is then aggregated by a provider from which DPLA harvests. dpla
score: The relevance score assigned to the item by Elasticsearch elasticsearch
test
import dpla as dpla
from dpla.api import DPLA
import pprint
pp = pprint.PrettyPrinter(width=25)
product_key = '80c77ddf7e3c36e8d9daf820ea5d3a93'
dpla = DPLA(product_key)
print 'Product Key %s initalized, we are a go!' % product_key
MAIN
@id: Used to uniquely identify things that are being described in the document.
dataProvider: Provider of the SourceResource and WebResource edm
id: DPLA ID of a SourceResource within a given context DPLA
ingestDate: Date on which the original record was imported into the DPLA database DPLA
ingestType: Type of record created by ingestion (either item or collection). DPLA
intermediateProvider: An intermediate organization that selects, collates, or curates data from a data provider that is then aggregated by a provider from which DPLA harvests. dpla
score: The relevance score assigned to the item by Elasticsearch elasticsearch
DPLA ITEM RECORD
sourceResource (literal value): dc
sourceResource.contributor: Entity responsible for making contributions to the resource dc
sourceResource.creator: Entity primarily responsible for making sourceResource dc
sourceResource.date: Array containing point or period of time associated with an event in lifecycle of a
sourceResource.date.begin: Date/time of the start of a time span (inclusive). edm
sourceResource.date.displayDate: date to be displayed by an application seeking to provide a date to accompany
sourceResource.date.end: Date/time of the end of a time span (inclusive) edm
sourceResource.physicalMedium A physical material or carrier in which source resource exists dc
sourceResource.subject: Array containing topic(s)of a SourceResource dc
sourceResource.subject.name: Topic or subject of a SourceResource dc
provider: Service or content hub providing access to the Data Providers content. edm
provider.name: Human-readable version of provider name edm
ORIGINAL RECORD (HIGHLY DEPENDANT ON PROVIDER)
**originalRecord: Complete original record as provided by the provider dpla
subject: array of subjects
format: A physical material or carrier in which source resource exists
dates: ?
MISC
isShownAt: An unambiguous URL reference to the digital object on the provider’s web site in its full information context. edm
facets: Groups of items collected by shared field values elasticsearch
#DPLA URL: https://dp.la/item/b8daeaff18b57a399d8156313bbd944d
#ORIGINAL URL: http://cdm16795.contentdm.oclc.org/cdm/ref/collection/msaphotos/id/2051
pp = pprint.PrettyPrinter(depth=4)
item_id = "b8daeaff18b57a399d8156313bbd944d"
record = dpla.fetch_by_id([item_id])
pp.pprint(record.items[0]['sourceResource'])
pp.pprint(record.items[0]['originalRecord'])
print 'DPLA SUBJECTS'
pp.pprint(record.items[0]['sourceResource']['subject'])
pp = pprint.PrettyPrinter()
pp.pprint(record.items[0]['originalRecord']['metadata']['mods']['subject'])
from nltk.util import ngrams
from nltk import word_tokenize
bigrams = ngrams(word_tokenize(text), 2)
trigrams = ngrams(word_tokenize(text), 3)
qgrams = ngrams(word_tokenize(text), 4)
print bigrams
print ""
print trigrams
print ""
print qgrams