JSON and Library Catalog Data#

How can we extract data from APIs (machine-readable online data sources)?

In this lesson, we look at how we can use data from the web. We will use real-world data from The National Library of Norway. The National Library has a search API which we will use.

JSON#

JSON is a machine-readable data format. Machine-readable data makes it easy to read and process the information with a computer. JSON data is usually tree structured, with multiple levels containing information. This is kind of like a directory tree containing files.

Fetching Data#

To fetch data from the web, we can use a library called requests that makes this task quite easy. Since we are are fetching data in the JSON format, we will also import a library to decode JSON data. Libraries are collections of code written by others that we can utilize instead of writing everything from scratch ourselves.

import requests
import json

We need to specify the URL to the data we want to fetch.

URL = "https://api.nb.no/catalog/v1/items?digitalAccessibleOnly=true&size=3&filter=mediatype:bøker&q=Bing,Jon"

We include some parameters that specifies which cases we want to load:

  • digitalAccessibleOnly specifies that we only want documents that are available in fulltext

  • size is the number of items to fetch

  • filter narrows the search, in this case to books

  • q is the search query

More parameters are listed in the API documentation.

Now, let’s fetch the data.

data = requests.get(URL).json()

This step both fetches and decodes the json data in one line. We can also do this step-by-step, to see how the process works. If you don’t want to get into the details at this point, you can skip ahead to the section “Using the data”. The server response also contains metadata, but we want the content:

response = requests.get(URL)
content = response.content

We can look at the first 100 characters from the raw data. We can see the same data if we open the URL in a web browser: https://api.nb.no/catalog/v1/items?digitalAccessibleOnly=true&size=3&filter=mediatype:bøker&q=Bing,Jon

print(content[:100])
b'{\n  "_links" : {\n    "self" : {\n      "href" : "https://api.nb.no/catalog/v1/items?q=Bing,Jon&digita'

To use the data, we must decode them. We must specify the character set, which is often UTF-8. Then we decode the json format into a Python dictionary.

text = content.decode("utf-8")
data = json.loads(text)

Using the Data#

We can print the data, however this is a lot of text:

print(data)
{'_links': {'self': {'href': 'https://api.nb.no/catalog/v1/items?q=Bing,Jon&digitalAccessibleOnly=true&filter=mediatype:b%C3%B8ker&page=0&size=3'}, 'next': {'href': 'https://api.nb.no/catalog/v1/items?q=Bing,Jon&digitalAccessibleOnly=true&filter=mediatype:b%C3%B8ker&page=1&size=3'}, 'last': {'href': 'https://api.nb.no/catalog/v1/items?q=Bing,Jon&digitalAccessibleOnly=true&filter=mediatype:b%C3%B8ker&page=50&size=3'}}, 'page': {'size': 3, 'totalElements': 1435, 'totalPages': 479, 'number': 0}, '_embedded': {'items': [{'id': 'e3195d7716f6e6dbd02827e43b05fdc1', '_links': {'self': {'href': 'https://api.nb.no:443/catalog/v1/items/e3195d7716f6e6dbd02827e43b05fdc1'}, 'mods': {'href': 'https://api.nb.no:443/catalog/v1/metadata/e3195d7716f6e6dbd02827e43b05fdc1/mods'}, 'dublincore': {'href': 'https://api.nb.no:443/catalog/v1/metadata/e3195d7716f6e6dbd02827e43b05fdc1/dublincore'}, 'presentation': {'href': 'https://api.nb.no:443/catalog/v1/iiif/e3195d7716f6e6dbd02827e43b05fdc1/manifest'}, 'enw': {'href': 'https://api.nb.no:443/catalog/v1/reference/e3195d7716f6e6dbd02827e43b05fdc1/enw'}, 'ris': {'href': 'https://api.nb.no:443/catalog/v1/reference/e3195d7716f6e6dbd02827e43b05fdc1/ris'}, 'wiki': {'href': 'https://api.nb.no:443/catalog/v1/reference/e3195d7716f6e6dbd02827e43b05fdc1/wiki'}, 'thumbnail_custom': {'href': 'https://www.nb.no/services/image/resolver/URN:NBN:no-nb_digibok_2020101258046_C1/full/{width},{height}/0/native.jpg', 'templated': True}, 'thumbnail_large': {'href': 'https://www.nb.no/services/image/resolver/URN:NBN:no-nb_digibok_2020101258046_C1/full/0,200/0/native.jpg'}, 'thumbnail_medium': {'href': 'https://www.nb.no/services/image/resolver/URN:NBN:no-nb_digibok_2020101258046_C1/full/0,128/0/native.jpg'}, 'thumbnail_small': {'href': 'https://www.nb.no/services/image/resolver/URN:NBN:no-nb_digibok_2020101258046_C1/full/0,64/0/native.jpg'}}, 'accessInfo': {'accessAllowedFrom': 'EVERYWHERE', 'viewability': 'ALL', 'license': 'copyrighted', 'accessInfoLastModified': '2022-04-28T20:30:07.504Z', 'isDigital': True, 'isPublicDomain': False}, 'metadata': {'title': 'Dobbelt-eksponering : eit balanse-spel av Bing & Bringsværd ; omsett til nynorsk av Halldis Hoaas', 'titleInfos': [{'title': 'Dobbelt-eksponering : eit balanse-spel av Bing & Bringsværd ; omsett til nynorsk av Halldis Hoaas'}], 'geographic': {'city': '[Oslo]'}, 'originInfo': {'publisher': 'Det Norske teatret', 'issued': '1981', 'firstIndexTime': '2024-02-17T23:24:08.705Z', 'firstDigitalContentTime': '2020-10-17T03:03:07.905Z', 'indexLastModified': '2023-06-09T06:05:50.490Z'}, 'identifiers': {'sesamId': 'e3195d7716f6e6dbd02827e43b05fdc1', 'oaiId': 'oai:nb.bibsys.no:999920073084102202', 'urn': 'URN:NBN:no-nb_digibok_2020101258046'}, 'languages': [{'code': 'nno'}], 'pageCount': 6, 'mediaTypes': ['bøker'], 'creators': ['Hoaas, Halldis', 'Bing, Jon', 'Det Norske teatret', 'Bringsværd, Tor Åge'], 'contentClasses': ['public', 'legaldeposit', 'jp2'], 'metadataClasses': ['public']}}, {'id': 'f7c2926938b3b414ac959368ebcdddc0', '_links': {'self': {'href': 'https://api.nb.no:443/catalog/v1/items/f7c2926938b3b414ac959368ebcdddc0'}, 'mods': {'href': 'https://api.nb.no:443/catalog/v1/metadata/f7c2926938b3b414ac959368ebcdddc0/mods'}, 'dublincore': {'href': 'https://api.nb.no:443/catalog/v1/metadata/f7c2926938b3b414ac959368ebcdddc0/dublincore'}, 'presentation': {'href': 'https://api.nb.no:443/catalog/v1/iiif/f7c2926938b3b414ac959368ebcdddc0/manifest'}, 'enw': {'href': 'https://api.nb.no:443/catalog/v1/reference/f7c2926938b3b414ac959368ebcdddc0/enw'}, 'ris': {'href': 'https://api.nb.no:443/catalog/v1/reference/f7c2926938b3b414ac959368ebcdddc0/ris'}, 'wiki': {'href': 'https://api.nb.no:443/catalog/v1/reference/f7c2926938b3b414ac959368ebcdddc0/wiki'}, 'thumbnail_custom': {'href': 'https://www.nb.no/services/image/resolver/URN:NBN:no-nb_digibok_2021020448501_C1/full/{width},{height}/0/native.jpg', 'templated': True}, 'thumbnail_large': {'href': 'https://www.nb.no/services/image/resolver/URN:NBN:no-nb_digibok_2021020448501_C1/full/0,200/0/native.jpg'}, 'thumbnail_medium': {'href': 'https://www.nb.no/services/image/resolver/URN:NBN:no-nb_digibok_2021020448501_C1/full/0,128/0/native.jpg'}, 'thumbnail_small': {'href': 'https://www.nb.no/services/image/resolver/URN:NBN:no-nb_digibok_2021020448501_C1/full/0,64/0/native.jpg'}}, 'accessInfo': {'accessAllowedFrom': 'EVERYWHERE', 'viewability': 'ALL', 'license': 'ccbyncnd', 'accessInfoLastModified': '2024-02-02T14:36:12.727Z', 'isDigital': True, 'isPublicDomain': False}, 'metadata': {'title': "Er'e ikke det ene, så er'e det andre..., eller Badekaret som ble vekk : av Jon Bing, Svein Erik Brodal, Erik Bye, Tor Edvin Dahl, Knut Faldbakken, Gunnar Bull Gundersen, Helge Hagerup, Jon Michelet, Sidsel Mørck, Arild Nyquist, Vivi Schøyen, Karin Sveen, Jan Erik Vold : med musikk av Egil Kapstad", 'titleInfos': [{'title': "Er'e ikke det ene, så er'e det andre..., eller Badekaret som ble vekk : av Jon Bing, Svein Erik Brodal, Erik Bye, Tor Edvin Dahl, Knut Faldbakken, Gunnar Bull Gundersen, Helge Hagerup, Jon Michelet, Sidsel Mørck, Arild Nyquist, Vivi Schøyen, Karin Sveen, Jan Erik Vold : med musikk av Egil Kapstad"}], 'geographic': {'city': '[Oslo]'}, 'originInfo': {'publisher': 'Riksteatret', 'issued': '1900', 'firstIndexTime': '2024-02-16T18:38:48.227Z', 'firstDigitalContentTime': '2021-02-18T08:01:51.514Z', 'indexLastModified': '2024-02-02T14:36:12.727Z'}, 'identifiers': {'sesamId': 'f7c2926938b3b414ac959368ebcdddc0', 'oaiId': 'oai:nb.bibsys.no:999920112502602202', 'urn': 'URN:NBN:no-nb_digibok_2021020448501'}, 'languages': [{'code': 'nob'}], 'pageCount': 16, 'mediaTypes': ['bøker'], 'creators': ['Riksteatret'], 'contentClasses': ['public', 'CCBYNCND', 'legaldeposit', 'jp2'], 'metadataClasses': ['public']}}, {'id': 'ee333b7284cc60d6ea7a05f95c77add1', '_links': {'self': {'href': 'https://api.nb.no:443/catalog/v1/items/ee333b7284cc60d6ea7a05f95c77add1'}, 'mods': {'href': 'https://api.nb.no:443/catalog/v1/metadata/ee333b7284cc60d6ea7a05f95c77add1/mods'}, 'dublincore': {'href': 'https://api.nb.no:443/catalog/v1/metadata/ee333b7284cc60d6ea7a05f95c77add1/dublincore'}, 'presentation': {'href': 'https://api.nb.no:443/catalog/v1/iiif/ee333b7284cc60d6ea7a05f95c77add1/manifest'}, 'enw': {'href': 'https://api.nb.no:443/catalog/v1/reference/ee333b7284cc60d6ea7a05f95c77add1/enw'}, 'ris': {'href': 'https://api.nb.no:443/catalog/v1/reference/ee333b7284cc60d6ea7a05f95c77add1/ris'}, 'wiki': {'href': 'https://api.nb.no:443/catalog/v1/reference/ee333b7284cc60d6ea7a05f95c77add1/wiki'}, 'thumbnail_custom': {'href': 'https://www.nb.no/services/image/resolver/URN:NBN:no-nb_digibok_2006111300028_C1/full/{width},{height}/0/native.jpg', 'templated': True}, 'thumbnail_large': {'href': 'https://www.nb.no/services/image/resolver/URN:NBN:no-nb_digibok_2006111300028_C1/full/0,200/0/native.jpg'}, 'thumbnail_medium': {'href': 'https://www.nb.no/services/image/resolver/URN:NBN:no-nb_digibok_2006111300028_C1/full/0,128/0/native.jpg'}, 'thumbnail_small': {'href': 'https://www.nb.no/services/image/resolver/URN:NBN:no-nb_digibok_2006111300028_C1/full/0,64/0/native.jpg'}}, 'accessInfo': {'accessAllowedFrom': 'EVERYWHERE', 'viewability': 'ALL', 'license': 'publicdomain', 'accessInfoLastModified': '2018-10-09T11:32:10.673Z', 'isDigital': True, 'isPublicDomain': True}, 'metadata': {'title': 'Fortegnelse over magistratpersoner i Trondhjem 1377-1922 : med biografiske oplysninger og henvisninger', 'titleInfos': [{'title': 'Fortegnelse over magistratpersoner i Trondhjem 1377-1922 : med biografiske oplysninger og henvisninger'}, {'title': 'Magistratpersoner i Trondhjem 1377-1922', 'type': 'alternative'}], 'geographic': {'city': 'Trondheim'}, 'originInfo': {'publisher': '[s.n.]', 'issued': '1945', 'firstIndexTime': '2024-04-20T10:10:25.555Z', 'firstDigitalContentTime': '2016-01-22T07:56:35.765Z', 'indexLastModified': '2024-04-20T10:10:25.532Z'}, 'identifiers': {'sesamId': 'ee333b7284cc60d6ea7a05f95c77add1', 'oaiId': 'oai:nb.bibsys.no:999306842654702202', 'urn': 'URN:NBN:no-nb_digibok_2006111300028'}, 'subject': {'geographics': ['Trondheim']}, 'languages': [{'code': 'nob'}], 'pageCount': 67, 'mediaTypes': ['bøker'], 'creators': ['Mathiesen, Henrik'], 'contentClasses': ['public', 'legaldeposit', 'jp2', 'publicdomain'], 'metadataClasses': ['public']}}]}}

Instead, we can print only the keys using list():

keys = list(data)
print(keys)
['_links', 'page', '_embedded']

Our search results only contain the first few items. We say that we are viewing a page of the results. The field page contains information about the current page. This is a dictionary, and we can print the information:

page = data['page']
print(page)
{'size': 3, 'totalElements': 1435, 'totalPages': 479, 'number': 0}

The field size contains the number of hits in the database. This is usually different from the number of items we requested. If the size is zero, we don’t have any results and need to check the query in the URL.

size = page['size']
print(size)
3

That looks good. Let’s fetch the list of items:

embedded = data['_embedded']
items = embedded['items']

Now we can inspect each item. Let’s loop over the items and get some of the information. The data contains various metadata about each item, such as the item ID.

It’s often useful to look at the data in a web browser to get an overview.

for item in items:
    print("item ID:", item['id'])
item ID: e3195d7716f6e6dbd02827e43b05fdc1
item ID: f7c2926938b3b414ac959368ebcdddc0
item ID: ee333b7284cc60d6ea7a05f95c77add1

Each item has a metadata dictionary with the title etc.

for item in items:
    metadata = item['metadata']
    print("Item title:", metadata['title'])
Item title: Dobbelt-eksponering : eit balanse-spel av Bing & Bringsværd ; omsett til nynorsk av Halldis Hoaas
Item title: Er'e ikke det ene, så er'e det andre..., eller Badekaret som ble vekk : av Jon Bing, Svein Erik Brodal, Erik Bye, Tor Edvin Dahl, Knut Faldbakken, Gunnar Bull Gundersen, Helge Hagerup, Jon Michelet, Sidsel Mørck, Arild Nyquist, Vivi Schøyen, Karin Sveen, Jan Erik Vold : med musikk av Egil Kapstad
Item title: Fortegnelse over magistratpersoner i Trondhjem 1377-1922 : med biografiske oplysninger og henvisninger

Exercise: Creator #

Complete the code below to print the creator of each item. You will need to browse the data.

import requests
import json

URL = "https://api.nb.no/catalog/v1/items?digitalAccessibleOnly=true&size=3&filter=mediatype:bøker&q=Bing,Jon"
data = requests.get(URL).json()
embedded = data['_embedded']
items = embedded['items']

for item in items:
    metadata = item['metadata']
    print("Item title:", metadata['title'])
    #your code here
Item title: Dobbelt-eksponering : eit balanse-spel av Bing & Bringsværd ; omsett til nynorsk av Halldis Hoaas
Item title: Er'e ikke det ene, så er'e det andre..., eller Badekaret som ble vekk : av Jon Bing, Svein Erik Brodal, Erik Bye, Tor Edvin Dahl, Knut Faldbakken, Gunnar Bull Gundersen, Helge Hagerup, Jon Michelet, Sidsel Mørck, Arild Nyquist, Vivi Schøyen, Karin Sveen, Jan Erik Vold : med musikk av Egil Kapstad
Item title: Fortegnelse over magistratpersoner i Trondhjem 1377-1922 : med biografiske oplysninger og henvisninger

Following the Path#

As mentioned, JSON data is a tree structure. It can contain many nested levels. In that case, we need to follow the path to find the entry we’re looking for. It’s usually advisable to follow the path one step at a time. This makes it easier to find errors in our programs.

URL = "https://api.nb.no/catalog/v1/items?digitalAccessibleOnly=true&size=3&filter=mediatype:bøker&q=Bing,Jon"
data = requests.get(URL).json()
embedded = data['_embedded']
items = embedded['items']

for item in items:
    metadata = item['metadata']
    print("Item title:", metadata['title'])
    
    # Step-by-step:
    links = item['_links']
    self = links['self']
    href = self['href']
    print("Item URL:", href)

    # We start a new path from links:
    thumbnail_large = links['thumbnail_large']
    thumbnail_URL = thumbnail_large['href'] # have already used the name 'href'
    print('Thumbnail:', thumbnail_URL)

    # Extra linebreak:
    print()
Item title: Dobbelt-eksponering : eit balanse-spel av Bing & Bringsværd ; omsett til nynorsk av Halldis Hoaas
Item URL: https://api.nb.no:443/catalog/v1/items/e3195d7716f6e6dbd02827e43b05fdc1
Thumbnail: https://www.nb.no/services/image/resolver/URN:NBN:no-nb_digibok_2020101258046_C1/full/0,200/0/native.jpg

Item title: Er'e ikke det ene, så er'e det andre..., eller Badekaret som ble vekk : av Jon Bing, Svein Erik Brodal, Erik Bye, Tor Edvin Dahl, Knut Faldbakken, Gunnar Bull Gundersen, Helge Hagerup, Jon Michelet, Sidsel Mørck, Arild Nyquist, Vivi Schøyen, Karin Sveen, Jan Erik Vold : med musikk av Egil Kapstad
Item URL: https://api.nb.no:443/catalog/v1/items/f7c2926938b3b414ac959368ebcdddc0
Thumbnail: https://www.nb.no/services/image/resolver/URN:NBN:no-nb_digibok_2021020448501_C1/full/0,200/0/native.jpg

Item title: Fortegnelse over magistratpersoner i Trondhjem 1377-1922 : med biografiske oplysninger og henvisninger
Item URL: https://api.nb.no:443/catalog/v1/items/ee333b7284cc60d6ea7a05f95c77add1
Thumbnail: https://www.nb.no/services/image/resolver/URN:NBN:no-nb_digibok_2006111300028_C1/full/0,200/0/native.jpg

Exercise: Presentation and URN #

The field presentation contains a link to the full text. Complete the code below to print the presentation URL and the ‘URN’ of each item.

You will need to browse the data.

import requests
import json

URL = "https://api.nb.no/catalog/v1/items?digitalAccessibleOnly=true&size=3&filter=mediatype:bøker&q=Bing,Jon"
data = requests.get(URL).json()
embedded = data['_embedded']
items = embedded['items']

for item in items:
    metadata = item['metadata']
    print("Item title:", metadata['title'])
    #your code here
Item title: Dobbelt-eksponering : eit balanse-spel av Bing & Bringsværd ; omsett til nynorsk av Halldis Hoaas
Item title: Er'e ikke det ene, så er'e det andre..., eller Badekaret som ble vekk : av Jon Bing, Svein Erik Brodal, Erik Bye, Tor Edvin Dahl, Knut Faldbakken, Gunnar Bull Gundersen, Helge Hagerup, Jon Michelet, Sidsel Mørck, Arild Nyquist, Vivi Schøyen, Karin Sveen, Jan Erik Vold : med musikk av Egil Kapstad
Item title: Fortegnelse over magistratpersoner i Trondhjem 1377-1922 : med biografiske oplysninger og henvisninger

Working with Lists#

Each item in the data set contains a list of one or more creators. These lists are located in each item’s metadata field. We can use a for-loop to process the list items:

for item in items:
    metadata = item['metadata']
    print("Item title:", metadata['title'])
    creators = metadata['creators']
    for creator in creators:
        print('Creator:',creator)
    print() #insert empty line
Item title: Dobbelt-eksponering : eit balanse-spel av Bing & Bringsværd ; omsett til nynorsk av Halldis Hoaas
Creator: Hoaas, Halldis
Creator: Bing, Jon
Creator: Det Norske teatret
Creator: Bringsværd, Tor Åge

Item title: Er'e ikke det ene, så er'e det andre..., eller Badekaret som ble vekk : av Jon Bing, Svein Erik Brodal, Erik Bye, Tor Edvin Dahl, Knut Faldbakken, Gunnar Bull Gundersen, Helge Hagerup, Jon Michelet, Sidsel Mørck, Arild Nyquist, Vivi Schøyen, Karin Sveen, Jan Erik Vold : med musikk av Egil Kapstad
Creator: Riksteatret

Item title: Fortegnelse over magistratpersoner i Trondhjem 1377-1922 : med biografiske oplysninger og henvisninger
Creator: Mathiesen, Henrik

Key Points#

  • The requests library can be used to fetch data from the web.

  • Many data providers provide an API from which we can fetch data programatically.

  • Parameters can be used to control what data we get from an API.

  • Most APIs provide data in the JSON format, and JSON is well supported in Python.

  • Additional filtering and processing of the retrieved data can be done using loops and conditions (if-statements, next chapter)