In [ ]:
#QUESTIONS:

#What do we mean by loss of fidelity, for example, sometimes DPLA moves subjects from subject to say location
In [ ]:
# C. Enrichment
# “Enrichment” refers the process DPLA uses to enhance original records from partners with 
# additional data, typically in the form of standardized versions of names and places as well as 
# URIs to LOD vocabularies such as GeoNames or the Virtual Identity Authority File (VIAF). DPLA 
# has developed services to check these authorities for the values in specific partner-supplied meta data fields
# and record any matches. DPLA currently performs such enrichments only on place names, however, partner-
# supplied URIs for subject-headings or name authorities are stored within MAP properties.DPLA also performs 
# enrichments that remove extraneous punctuation and whitespaces, normalize date formats to a standard (yyyy-mm-dd), 
# and other clean up tasks

intermediateProvider: An intermediate organization that selects, collates, or curates data from a data provider that is then aggregated by a provider from which DPLA harvests. dpla

score: The relevance score assigned to the item by Elasticsearch elasticsearch

test

In [20]:
import dpla as dpla
from dpla.api import DPLA

import pprint
pp = pprint.PrettyPrinter(width=25)

product_key = '80c77ddf7e3c36e8d9daf820ea5d3a93'
dpla = DPLA(product_key)

print 'Product Key %s initalized, we are a go!' % product_key
Product Key 80c77ddf7e3c36e8d9daf820ea5d3a93 initalized, we are a go!

MAIN

@id: Used to uniquely identify things that are being described in the document.

dataProvider: Provider of the SourceResource and WebResource edm

id: DPLA ID of a SourceResource within a given context DPLA

ingestDate: Date on which the original record was imported into the DPLA database DPLA

ingestType: Type of record created by ingestion (either item or collection). DPLA

intermediateProvider: An intermediate organization that selects, collates, or curates data from a data provider that is then aggregated by a provider from which DPLA harvests. dpla

score: The relevance score assigned to the item by Elasticsearch elasticsearch


DPLA ITEM RECORD

sourceResource (literal value): dc

sourceResource.contributor: Entity responsible for making contributions to the resource dc

sourceResource.creator: Entity primarily responsible for making sourceResource dc

sourceResource.date: Array containing point or period of time associated with an event in lifecycle of a

sourceResource.date.begin: Date/time of the start of a time span (inclusive). edm

sourceResource.date.displayDate: date to be displayed by an application seeking to provide a date to accompany

sourceResource.date.end: Date/time of the end of a time span (inclusive) edm

sourceResource.physicalMedium A physical material or carrier in which source resource exists dc

sourceResource.subject: Array containing topic(s)of a SourceResource dc

sourceResource.subject.name: Topic or subject of a SourceResource dc

provider: Service or content hub providing access to the Data Providers content. edm

provider.name: Human-readable version of provider name edm


ORIGINAL RECORD (HIGHLY DEPENDANT ON PROVIDER)

**originalRecord: Complete original record as provided by the provider dpla

subject: array of subjects

format: A physical material or carrier in which source resource exists

dates: ?


MISC

isShownAt: An unambiguous URL reference to the digital object on the provider’s web site in its full information context. edm

facets: Groups of items collected by shared field values elasticsearch

In [70]:
#DPLA URL: https://dp.la/item/b8daeaff18b57a399d8156313bbd944d
#ORIGINAL URL: http://cdm16795.contentdm.oclc.org/cdm/ref/collection/msaphotos/id/2051

pp = pprint.PrettyPrinter(depth=4)

item_id = "b8daeaff18b57a399d8156313bbd944d"

record = dpla.fetch_by_id([item_id])
pp.pprint(record.items[0]['sourceResource'])
{u'@id': u'http://dp.la/api/items/b8daeaff18b57a399d8156313bbd944d#sourceResource',
 u'collection': {u'@id': u'http://dp.la/api/collections/e01a0f320bb2f2a90a8b856d99f6cb37',
                 u'description': u'',
                 u'id': u'e01a0f320bb2f2a90a8b856d99f6cb37',
                 u'title': u'Mdh_msaphotos'},
 u'creator': [u'Massie, Gerald R. (1911-1989)'],
 u'description': [u'Image showing the library within the Supreme Court building in Jefferson City. A man pulls a book from the bookshelf.'],
 u'format': u'Photograph/Pictorial Works',
 u'identifier': [u'RG005_78_24_0943.tif',
                 u'http://cdm16795.contentdm.oclc.org/cdm/ref/collection/msaphotos/id/2051'],
 u'language': [{u'iso639_3': u'eng', u'name': u'English'}],
 u'relation': [u'Publication Photos Non-Portrait',
               u'Is part of Secretary of State Publications Division, Publication Photos Non-Portrait'],
 u'rights': [u'Copyright is in the public domain. Items reproduced for publication should carry the credit line: Courtesy of the Missouri State Archives.'],
 u'spatial': [{u'city': u'Jefferson City',
               u'coordinates': u'38.5996017456, -92.1636962891',
               u'country': u'US',
               u'county': u'Callaway County',
               u'name': u'Jefferson City',
               u'state': u'MO'},
              {u'coordinates': u'38.5775, -92.1778',
               u'country': u'United States',
               u'county': u'Cole County',
               u'name': u'Cole County',
               u'state': u'Missouri'}],
 u'specType': [u'Photograph/Pictorial Works'],
 u'stateLocatedIn': [{u'name': u'Missouri'}],
 u'subject': [{u'name': u'Libraries'},
              {u'name': u'Books'},
              {u'name': u'Bookstacks'},
              {u'name': u'Men'},
              {u'name': u'Jefferson City (MO)'},
              {u'name': u'Supreme Court Building'}],
 u'title': [u'Supreme Court Library']}
In [71]:
pp.pprint(record.items[0]['originalRecord'])
{u'collection': {u'@id': u'http://dp.la/api/collections/e01a0f320bb2f2a90a8b856d99f6cb37',
                 u'description': u'',
                 u'id': u'e01a0f320bb2f2a90a8b856d99f6cb37',
                 u'title': u'mdh_msaphotos'},
 u'header': {u'datestamp': u'2016-12-18T15:17:10Z',
             u'expirationdatetime': u'2017-01-07T13:39:21Z',
             u'identifier': u'urn:data.mohistory.org:oai:cdm16795.contentdm.oclc.org:msaphotos\\/2051',
             u'setSpec': u'mdh_msaphotos'},
 u'id': u'urn:data.mohistory.org:oai:cdm16795.contentdm.oclc.org:msaphotos\\/2051',
 u'metadata': {u'mods': {u'accessCondition': u'Copyright is in the public domain. Items reproduced for publication should carry the credit line: Courtesy of the Missouri State Archives.',
                         u'genre': u'Photograph/Pictorial Works',
                         u'identifier': [u'RG005_78_24_0943.tif',
                                         u'http://cdm16795.contentdm.oclc.org/cdm/ref/collection/msaphotos/id/2051'],
                         u'language': {u'languageTerm': u'English'},
                         u'location': {u'url': [...]},
                         u'name': {u'namePart': u'Massie, Gerald R. (1911-1989)',
                                   u'role': {...}},
                         u'note': [u'Image showing the library within the Supreme Court building in Jefferson City. A man pulls a book from the bookshelf.',
                                   {...}],
                         u'physicalDescription': {u'note': u'TIFF'},
                         u'relatedItem': [{...}, {...}],
                         u'subject': [{...},
                                      {...},
                                      {...},
                                      {...},
                                      {...},
                                      {...},
                                      {...}],
                         u'titleInfo': {u'title': u'Supreme Court Library'},
                         u'xmlns': u'http://www.loc.gov/mods/v3'}},
 u'provider': {u'@id': u'http://dp.la/api/contributor/missouri-hub',
               u'name': u'Missouri Hub'}}
In [72]:
print 'DPLA SUBJECTS'
pp.pprint(record.items[0]['sourceResource']['subject'])
DPLA SUBJECTS
[{u'name': u'Libraries'},
 {u'name': u'Books'},
 {u'name': u'Bookstacks'},
 {u'name': u'Men'},
 {u'name': u'Jefferson City (MO)'},
 {u'name': u'Supreme Court Building'}]
In [64]:
pp = pprint.PrettyPrinter()
pp.pprint(record.items[0]['originalRecord']['metadata']['mods']['subject'])
[{u'topic': u'Libraries'},
 {u'topic': u'Books'},
 {u'topic': u'Bookstacks'},
 {u'topic': u'Men'},
 {u'cartographics': {u'coordinates': u'38.5775,-92.1778'},
  u'hierarchicalGeographic': {u'city': u'Jefferson City',
                              u'continent': u'North America',
                              u'country': u'US',
                              u'state': u'MO'}},
 {u'topic': u'Jefferson City (MO)'},
 {u'topic': u'Supreme Court Building'}]
In [25]:
from nltk.util import ngrams
from nltk import word_tokenize

bigrams = ngrams(word_tokenize(text), 2)
trigrams = ngrams(word_tokenize(text), 3)
qgrams = ngrams(word_tokenize(text), 4)

print bigrams
print ""
print trigrams
print ""
print qgrams