SQR-086

Deep linking to data documentation with IVOA DataLink#

Abstract

When presenting LSST data through VO services, we want to annotate that data with links to documentation, such as the table and column descriptions in data release documentation sites hosted on lsst.io. This technote describes a system where we implement a linking service called Hoverdrive that uses the IVOA DataLink standard to provide links to table and column documentation. The Hoverdrive link service, which resides in the Rubin Science Platform (RSP), is called by TAP schema queries. In turn, Hoverdrive queries Ook, Rubin Observatory’s documentation metadata service that indexes the link inventories of documentation sites. These link inventories are prepared and included with Sphinx documentation builds using Sphinx extensions provided by the Documenteer package. With this standards-based approach, clients like the Portal can show descriptions and links to data documentation from their user interfaces.

High-level overview#

This overview traces the system’s component architecture from the end-user’s perspective through to the intention of the documentation author.

Consider a Rubin Science Platform user who is working in the Portal to query and view LSST data.[1] The Firefly-based Portal shows information about tables and columns by making queries against the TAP schema. The LSST TAP schemas are annotated with DataLink service descriptors pointing to link endpoints, also hosted in the RSP, that provide URLs to related entities like table and column descriptions in LSST data release documentation. These service descriptors are added to the TAP schema through definition files in sdm_schemas, which is managed through Felis. The service descriptors specify access URLs, including query parameters, to link endpoints provided by another RSP service, hoverdrive. The link endpoints, through the hoverdrive service, operate in the RSP so that they can be aware of what data is available in that RSP and can make any last-mile transformations to the links, including adding RSP-specific links. See VO documentation linking.

Most documentation links will point to documentation sites hosted outside the Rubin Science Platform on lsst.io, which is managed by the LSST the Docs (SQR-006) static documentation site platform. So that the link service in the RSP doesn’t have to be aware in detail of these documentation sites, and also so that other data facilities can mix in other documentation sources for their RSP deployments, we propose that the link service also acts as a proxy to a link provider that lives closer to the documentation domain.

The Ook documentation librarian service, which is hosted in the Roundtable internal services Kubernetes cluster, already indexes Rubin documentation sites. For this application, Ook provides a new links API that provides deep links into the documentation sites for specific entities like data release data and column documentation. See The Ook links service.

To associate semantic meaning with the deep link anchors in documentation, Ook will interface specifically with the Intersphinx object inventory files that Sphinx projects publish to their /objects.inv paths. These object inventories associate hyperlink targets to all domain entities in a Sphinx project. To make meaningful domain entities, we will introduce a custom Sphinx domain for Rubin Observatory documentation to mark up data release reference documentation and related concepts. See Link inventories in Sphinx documentation.

Overall this system provides a standards-based approach both for linking to documentation with VO clients and for authoring documentation with Sphinx.

Link inventories in Sphinx documentation#

In this architecture, TAP tables and columns are annotated with links to Rubin documentation, such as a data release documentation site. This documentation is built with the Sphinx/Documenteer toolchain. To accomplish this, we will build upon the core Sphinx technologies of domains and Intersphinx to structure and annotate Rubin data release documentation that it is cross-referenceable with machine-readable link inventories.

Marking up documentation with link anchors#

In the documentation source, we will use custom extensions (reStructuredText roles and directives) provided through Documenteer to annotate specific pages and sections as documenting a table or column in a data release. These Sphinx extensions will be part of a Rubin Observatory Sphinx domain. Sphinx domains are collections of directives that allow writers to document specific types of entities and cross reference those. Sphinx includes built-in domains for Python, C++, and other programming languages, which is how Sphinx API references are built.

An example of how a table and column might be documented in a Sphinx project:

.. :rubin:table: Visit
   :release: dp02_dc2_catalogs

Content about the ``Visit`` table goes here
for the DP0.2 DC2 catalogs.

.. :rubin:column:: physical_filter
   :table: Visit
   :release: dp02_dc2_catalogs

Content about the ``physical_filter`` column
goes here for the DP0.2 DC2 catalogs
and ``Visit`` table.

These custom directives (e.g., :rubin:table:, :rubin:column:) leave structured hyperlink anchors in the generated HTML output. Complementary Sphinx roles let writers cross-reference these entities in other parts of the documentation project (or even in other Sphinx projects with Intersphinx):

See the :rubin:table:`Visit` table for more information.

The filter for the observation is given
in the :rubin:column:`physical_filter` column.

Publishing link inventories from Sphinx documentation#

By integrating with the Sphinx domains API, the inventory of all Rubin documentation entities, like data release tables and columns, is automatically part of the Intersphinx object inventory. Intersphinx publishes this inventory as a file (objects.inv) that is hosted alongside the HTML documentation site. Although the objects.inv format is somewhat opaque, Sphinx provides a Python API for reading it. We will use that API in the Ook link service.

The Ook links service#

Ook is an existing SQuaRE application that serves as a documentation librarian. Ook’s established role is to index documents and populate the Algolia full-text search database that powers the Rubin Observatory documentation search at www.lsst.io. We propose to extend Ook to also index link inventories (for example the objects.inv Intersphinx inventory files of Sphinx projects, but generally any relevant and linkable documentation or information source). The Ook link service would sync these inventories into a Postgres database and then provide a REST API for querying the inventories.

See The Ook links API discussion of the web API and Ook’s database model for the database modeling.

        flowchart LR
  objadapter[Object Inventory Adapter]
  objects["dr1.lsst.io/objects.inv"]
  documenteer[Documenteer Sphinx Domain]
  service[Ook Link Service]
  db[Postgres Database]
  api[Ook Link API]
  vo[VO data linking service]
  vo --> api
  api --> service
  service --> objadapter
  objadapter --> objects
  documenteer --> objects
  service --> db

Internally, the Ook link service would follow a process like this:

Based on a manual trigger, or Kafka message from the LTD documentation publishing system, Ook would begin an ingest of the project’s link inventory. This trigger is similar to how Ook’s Algolia indexing for a documentation project is triggered.
Ook’s interface to Sphinx objects.inv file format downloads and reads the inventory file.
The Ook link service upserts the entities from the inventory into a Postgres database. Ook maintains the schemas for these object inventory tables given that the Ook API also needs is aware of what Sphinx domains it publishes.
The Ook link service provides a REST API for querying the link inventory.

The Ook links API#

Ook’s link API would be structured around the different information domains. Some of these domains would map directly to the Sphinx/Intersphinx domains such as the Rubin domain for linking to Rubin data products and other entities. For example, to get the links to a Science Data Model (SDM) column’s documentation:

GET /ook/links/domains/sdm/schemas/dp02_dc2_catalogs/tables/Visit/columns/physical_filter

With the same technology, we can provide a generic API for other Sphinx domains:

GET /ook/links/domains/python/modules/lsst.afw.table

Discovery and URL templating#

The root endpoints for each link domain would provide templated URLs for the different link endpoints, categorized around links to specific entities, or a collection of entities:

GET /ook/links/domains/sdm

{
  "entities": {
    "schema": "/ook/links/domains/sdm/schemas/{schema}",
    "table": "/ook/links/domains/sdm/schemas/{schema}/tables/{table}",
    "column": "/ook/links/domains/sdm/schemas/{schema}/tables/{table}/columns/{column}"
  },
  "collections": {
    "schemas": "/ook/links/domains/sdm/schemas",
    "tables": "/ook/links/domains/sdm/schemas/{schema}/tables",
    "columns": "/ook/links/domains/sdm/schemas/{schema}/tables/{table}/columns"
  }
}

So long as the names for the entities and URL template variables are well known, this root endpoint can provide a discovery and auto-configuration layer for clients.

Structure of an entity link API#

The entity linking APIs let a client get the links for a specific entity based on the URL structure:

GET /ook/links/domains/sdm/schemas/dr1/tables/Visit/columns/physical_filter

The JSON response for a specific entity is an array of links:

[
  {
    "url": "https://dr1.lsst.io/reference/tables/Visit#physical_filter",
    "type": "schema_browser",
    "source_title": "physical_filter column",
    "source_collection_title": "Data Release 1 Documentation"
  }
]

The link responses anticipate that multiple links might be associated with a single entity. For one, the “pull” nature of the Ook link service means that multiple documentation sites might claim to document the same entity. To help clients distinguish between multiple links, Ook can provide some context for the links (whether it is a documentation site, or a document/technote, or a tutorial notebook, etc.). As well, Ook can provide the name of the site that hosts the link.

Structure of the entity collections API#

A client may need bulk access to links for a collection of entities without needing to make a large number of HTTP requests. For example, a client may need all columns in a table, or all tables in a data release. For these cases, the collections APIs can provide an array of entities and their links:

GET /ook/links/domains/sdm/schemas/dr1/tables/Object/columns

With a query string syntax, we could let the client get a subset of the collection. For example, all columns that start with a prefix:

GET /ook/links/domains/sdm/schemas/dr1/tables/Object/columns?prefix=shape_

The response for collections is an array of entities, and each entity has an array of links like in the the entity link API:

[
  {
    "schema_name": "dr1",
    "table_name": "Object",
    "column_name": "shape_flag",
    "links": [
      {
        "url": "https://dr1.lsst.io/reference/tables/Object#shape_flag",
        "type": "schema_browser",
        "source_title": "shape_flag column",
        "source_collection_title": "Data Release 1 Documentation"
      }
    ]
  },
  {
    "schema_name": "dr1",
    "table_name": "Object",
    "column_name": "shape_xx",
    "links": [
      {
        "url": "https://dr1.lsst.io/reference/tables/Object#shape_xx",
        "type": "schema_browser",
        "source_title": "shape_xx column",
        "source_collection_title": "Data Release 1 Documentation"
      }
    ]
  },
  {
    "schema_name": "dr1",
    "table_name": "Object",
    "column_name": "shape_xy",
    "links": [
      {
        "url": "https://dr1.lsst.io/reference/tables/Object#shape_xy",
        "type": "schema_browser",
        "source_title": "shape_xy column",
        "source_collection_title": "Data Release 1 Documentation"
      }
    ]
  },
]

Note

Many entities in the Rubin domain described here are naturally hierarchical. A data release’s schema contains tables, and those tables contain columns. It could be useful to include child entities in the response for a parent entity (essentially embedding the collections API for the child entities in the response for the parent entity). If we do this, we should study how other APIs handle pagination in these types of responses.

Ook’s database model#

Ook’s link service is backed by a Postgres datastore. See Fig. 1 for a visualization of the database schema.

        erDiagram
    sdm_schemas ||--o{ sdm_tables : contains
    sdm_tables ||--o{ sdm_columns : contains
    sdm_schemas ||--o{ links_sdm_schemas : has
    sdm_tables ||--o{ links_sdm_tables : has
    sdm_columns ||--o{ links_sdm_columns : has
    links ||--|| links_sdm_schemas : inherits
    links ||--|| links_sdm_tables : inherits
    links ||--|| links_sdm_columns : inherits

    sdm_schemas {
        BigInteger id PK
        string name
        string felis_id
        string description
        string github_owner
        string github_repo
        string github_ref
        string github_path
        datetime date_updated
    }

    sdm_tables {
        BigInteger id PK
        BigInteger schema_id FK
        string name
        string felis_id
        string description
        BigInteger tap_table_index
        datetime date_updated
    }

    sdm_columns {
        BigInteger id PK
        BigInteger table_id FK
        string name
        string felis_id
        string description
        string datatype
        string ivoa_ucd
        string ivoa_unit
        BigInteger tap_column_index
        datetime date_updated
    }

    links {
        BigInteger id PK
        string type
        string html_url
        string source_type
        string source_title
        string source_collection_title
        datetime date_updated
    }

    links_sdm_schemas {
        BigInteger id PK,FK
        BigInteger schema_id FK
    }

    links_sdm_tables {
        BigInteger id PK,FK
        BigInteger table_id FK
    }

    links_sdm_columns {
        BigInteger id PK,FK
        BigInteger column_id FK
    }

Fig. 1 Ook database schema that backs the Link Service, specifically related to links to Science Domain Model (SDM) entities.#

The links table#

All links are stored in a common table, links. These links have a website URL, a title, a type, and information about the documentation collection that they’re part of.

The type field is a controlled vocabulary of resource content types, which may include guide, tutorial, schema_browser, document, and so on. This field helps clients understand what kind of resource they’re linking to.

The source_collection_title field is the title of the website the link is part of. For example, links to a schema in the DP1 data release documentation would have a source_collection_title of “LSST Data Release 1 Documentation.” With this generality, any type of link can be stored in this links table, whether it is a link to a section in a document, a link to a method in a Python API reference, or a link to a column in a schema browser.

Link subtypes#

The Ook Links API demands that links have a structured context. For example, consider the SDM columns links endpoint:

GET /ook/links/domains/sdm/schemas/:schema/tables/:table/columns/:column

This endpoint requires that links are contextually associated with a specific SDM schema, table, and column. To provide links with this context, the database schema includes additional tables for each annotating the domain entity associated with the links. For example, the SDM entity links are stored in tables links_sdm_schemas, links_sdm_tables, and links_sdm_columns. These tables are related to the parent links table through joined-table inheritance.

The link-subtype tables provide entity-specific context. For SDM links, the link subtypes have relationships to a separate set of tables that describe the SDM schema, tables, and columns. By joining across the links table to the subtypes and through to the SDM schema tables, Ook is able to provide links associated with specific SDM schemas, tables, and columns.

Modeling domain knowledge in Ook#

A by-product of this work is that Ook now has a structured model of the domain entities that it indexes. For example, Ook’s databases contain the Science Data Model Schemas as ingested from the source GitHub repository (see sdm_schemas, sdm_tables, and sdm_columns in Fig. 1). This information can have interesting applications beyond the links API by providing a structured and accessible source of truth for a broad set of domains across Rubin Observatory. For example, documentation discussing the SDM could have dynamic references to the SDM data in Ook to ensure that their documentation is always up-to-date with the latest schema. This concept is discussed in SQR-087 Structured information service: preliminary notes.

VO documentation linking#

From the Rubin Science Platform, clients won’t directly query the Ook link service. Instead, they will query a VO data linking service, hoverdrive that uses the Ook link service as a backend. Hoverdrive uses the IVOA DataLink protocol to provide a standardized interface. There are two parts to the DataLink specification: link endpoints and service descriptors.

Hoverdrive link endpoints#

Hoverdrive provides two sets of endpoints for linking to documentation.

Redirect endpoints#

The first endpoints provide the user with a redirect to the most relevant link for a specific table or column. Per typical VO conventions, the table[2] and column names are passed as query parameters:

GET /api/hoverdrive/table-docs-redirect{?table}
GET /api/hoverdrive/column-docs-redirect{?table,column}

Note

These endpoints are now available as of Hoverdrive 0.2.0.

With this redirect functionality, Hoverdrive chooses the most relevant link. Typically this would be configured as a schema browser link or a link to the canonical documentation reference for the table or column.

Bulk link endpoints#

The second category provides multiple links for one or more tables or columns:

/api/hoverdrive/table-docs-links{?table}
/api/hoverdrive/column-docs-links{?table,column}

With this approach, the client can get multiple links for a single table or column, and each link is annotated with its type and source. For example, one link could be to the schema browser, and another could be to a tutorial notebook.

To get links in bulk, the VO standard allows query parameters to be repeated to indicate a list of values. For example, to get links for the Visit and Object tables:

GET /api/hoverdrive/table-docs-links?table=dr1.Visit&table=dr1.Object

For these endpoints, Hoverdrive responds with a VOTable that contains documentation links relevant to the table or column.

Note

The schema for the VOTable response is not yet defined.

Performance#

The hoverdrive endpoints derive their data from the Ook links API. These data may be cached in the hoverdrive service to improve performance.

Service descriptors#

DataLink service descriptors annotate a TAP query result with metadata about endpoints that can be called by the client to get information related to the result. Service descriptors make the Hoverdrive link endpoints available to the client.

For the RSP, datalink service descriptors are built from templates hosted in the sdm_schemas repository. These are the service desciptors for the redirect endpoints:

<?xml version="1.0" encoding="UTF-8"?>
<VOTABLE xmlns="http://www.ivoa.net/xml/VOTable/v1.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.2">

  <RESOURCE type="meta" utype="adhoc:service" name="ColumnDocumentationRedirect">
    <DESCRIPTION>
      Redirect to the most relevant documentation link for a column.
    </DESCRIPTION>
    <PARAM name="accessURL" datatype="char" arraysize="*"
          value="$baseUrl$/api/hoverdrive/column-docs-redirect"/>
    <GROUP name="inputParams">
      <PARAM name="table" datatype="char" use="required">
        <DESCRIPTION>The name of the table.</DESCRIPTION>
      </PARAM>
      <PARAM name="column" datatype="char" arraysize="*" use="required">
        <DESCRIPTION>The name of the column.</DESCRIPTION>
      </PARAM>
    </GROUP>
    <PARAM name="exampleURL" datatype="char" arraysize="*"
          value="$baseUrl$/api/hoverdrive/column-docs-redirect?table=dp02_dc2_catalogs.Object&amp;column=detect_isPrimary">
      <DESCRIPTION>Example request to redirect to the documentation for the 'detect_isPrimary' column in the 'dp02_dc2_catalogs.Object' table.</DESCRIPTION>
    </PARAM>
  </RESOURCE>

  <RESOURCE type="meta" utype="adhoc:service" name="TableDocumentationRedirect">
    <DESCRIPTION>
      Redirect to the most relevant documentation link for a table.
    </DESCRIPTION>
    <PARAM name="accessURL" datatype="char" arraysize="*"
          value="$baseUrl$/api/hoverdrive/table-docs-redirect"/>
    <GROUP name="inputParams">
      <PARAM name="table" datatype="char" arraysize="*" use="required">
        <DESCRIPTION>The name of the table.</DESCRIPTION>
      </PARAM>
    </GROUP>
    <PARAM name="exampleURL" datatype="char" arraysize="*"
          value="$baseUrl$/api/hoverdrive/table-docs-redirect?table=dp02_dc2_catalogs.Object">
      <DESCRIPTION>Example request to redirect to the documentation for the 'dp02_dc2_catalogs.Object' table.</DESCRIPTION>
    </PARAM>
  </RESOURCE>
</VOTABLE>

Note

The details of the service descriptor are not yet finalized.

Version

Source

Deep linking to data documentation with IVOA DataLink#

Abstract

High-level overview#

Link inventories in Sphinx documentation#

Marking up documentation with link anchors#

Publishing link inventories from Sphinx documentation#

The Ook links service#

The Ook links API#

Discovery and URL templating#

Structure of an entity link API#

Structure of the entity collections API#

Ook’s database model#

The links table#

Link subtypes#

Modeling domain knowledge in Ook#

VO documentation linking#

Hoverdrive link endpoints#

Redirect endpoints#

Bulk link endpoints#

Performance#

Service descriptors#