Property Graph Standards Update - July 2020

What I did on my summer “vacation”

Keith W. Hare
Convenor,  ISO/IEC JTC1 SC32 WG3 Database Languages

Introduction

The international standards committee (ISO/IEC JTC1 SC32 WG3 Database Languages) that is developing the SQL and GQL standards met during June 2020. We had been scheduled to hold a face-to-face meeting in Malmö Sweden the first week of June. Because face-to-face meetings have gone out of fashion this year, WG3 converted from the face-to-face meeting to a series of ten 2-hour web conferences spread out across June.

WG3 has participants from Asia, Europe, and North America. There is no time that works for everyone in these time zones so we cycled meeting start times. Presenting and discussing technical topics at 3:00 AM body time is not optimal, but WG3 participants made it work.  

This meeting included 33 experts from 10 national bodies, liaisons from 3 liaison organizations (SC42 WG2, OGC, and LDBC), and 3 guests from LDBC for one meeting segment.

So, what did we do?

Property Graphs: GQL and SQL/PGQ

GQL (ISO/IEC 39075 Information Technology — Database Languages — GQL) is a full database language to create and manage property graphs and create, read, update, and delete nodes and edges (or vertices and relationships). (Because of widespread existing use, the GQL draft supports the synonyms NODE & VERTEX and RELATIONSHIP & EDGE.)

SQL/PGQ (ISO/IEC 9075-16 Information technology — Database languages SQL — Part 16: SQL Property Graph Queries) is a new add-on part of the Database Language SQL family of standards. SQL/PGQ specifies two major capabilities:

  • Creating property graph views on top of existing tables in an SQL database
  • Querying property graphs using a GRAPH_TABLE function in an SQL FROM clause

The input to the SQL/PGQ GRAPH_TABLE function is a property graph query, sometimes referred to as Graph Pattern Matching or GPM.

Graph Pattern Matching is common between SQL/PGQ and GQL. That is, the syntax accepted in a GRAPH_TABLE function in an SQL FROM clause is identical to the syntax in a GQL graph query.

Because GPM is the same in both draft standards, changes to GPM for SQL/PGQ also apply to the GPM portions of the GQL specification.

GPM includes a number of complex capabilities, including how to match shortest paths and how to prevent queries that match cycles from returning infinite results. In most cases, the required rules and defaults are fairly obvious. However we also need to correctly and consistently handle the obscure corner cases. Getting this correct has taken a surprising amount of effort, discussion, and cooperation between the participants.

The following examples illustrate a small portion of the capabilities supported by Graph Pattern Matching:

MATCH () -[r:KNOWS WHERE r.since < date("2001-09-11")]-* ()

MATCH (start) [ (p1:Person)-[r:KNOWS]-(p2:Person)
    WHERE p1.age < p2.age AND r.since < date("2001-09-11"]* (end)

GPM integrates ideas and capabilities from vendors and academic research. The full power of GPM would be worth a blog on its own. This is not that blog.

The GQL and SQL/PGQ specification for Graph Pattern Matching is almost done, at least until we discover the next corner case or specification bug.

We approved several additions to the GQL editor’s working draft, including:

  • Create Graph Type
  • Catalog
  • Numeric Literals

Create Graph and Create Graph Type

CREATE GRAPH TYPE allows a GQL user to create a type then create one or more graph instances based on that type.The GQL implementation can use the graph type to constrain nodes and edges as data is being inserted into the graph.

The following example creates a graph type called socialNetworkSchema, then creates a graph based on the type. The following example uses pattern-oriented syntax (The CREATE GRAPH TYPE examples are extracted from the paper “Graph and graph type DDL”.):

CREATE GRAPH TYPE socialNetworkSchema AS {
    (Person :Person&TaxPayer
    {name STRING, dob DATE, taxNo STRING}),
    (City :City {name STRING, state string, country STRING}),
    (Company :Company {name STRING, description STRING}),

(Person)-[LivesIn :LIVES_IN {since DATE}]->(City),
(Company)-[HeadquartersIn :HEADQUARTERS_IN
    {since DATE}]->(City),
(Person)-[WorksFor :WORKS_FOR {since DATE}]->(Company),
(Person)~[MarriedTo :MARRIED_TO
    {since DATE, until DATE}]~(Person)
}

CREATE GRAPH socialNetwork OF socialNetworkSchema

The GQL draft also supports a keyword syntax for creating graphs and graph types. The following example creates the same graph type as the previous example but with a slightly different syntax:
 

CREATE GRAPH TYPE socialNetworkSchema AS {
    NODE Person LABELS Person & TaxPayer
    {name STRING, dob DATE, taxNo STRING},
    
    NODE City LABEL City
    {name STRING, state STRING, country STRING},
    
    NODE Company LABEL Company {name STRING, description STRING},
    
    DIRECTED EDGE LivesIn LABELS LIVES_IN {since DATE}
        CONNECTING (Person TO City),
    DIRECTED EDGE HeadquartersIn
        LABELS HEADQUARTERS_IN {since DATE}
        CONNECTING (Person TO City),
    DIRECTED EDGE WorksFor LABELS WORKS_FOR {since DATE}
        CONNECTION (Person TO Company),
    UNDIRECTED EDGE MarriedTo
        LABELS MARRIED_TO {since DATE, until DATE}
        CONNECTION (Person TO Person)
    }

CREATE GRAPH socialNetwork OF socialNetworkSchema

In the GQL draft, the keyword version of the syntax is syntactically transformed to the pattern-oriented syntax. Supporting both syntax styles is a compromise between participating vendors.

The current GQL draft also allows for schema-less graphs, where a node’s attributes and labels are whatever is inserted.

Catalog

From the paper “The GQL Catalog”:

The GQL Catalog is an arbitrary structure of Directories that contain named entries of either nested Directories or Schemas.

That is, a catalog root contains directories and/or schemas. A directory contains directories and/or schemas. A schema contains objects such as Graph Types and Graphs. The following diagram, also from “The GQL Catalog”, illustrates the catalog concepts:

GQL Catalog Diagram

This diagram suggests that a schema could contain objects in addition to Graph Types and Graphs. The requirements and details of the additional objects are still under discussion.

The catalog’s directory structure can be navigated and Graphs can be referenced either using the full catalog path or using a path relative to the current default location.

Numeric Literals

The paper “Adding additional numeric literal forms to GQL” fulfilled the promise of the title and added numeric literal forms. The GQL draft already supported decimal integer literals such as 42 and -12345. This paper added support for hexadecimal, octal, and binary literals. For example:

  • Hex: 0x2a
  • Octal: 0o52
  • Binary: 0b101010

Why support these numeric literal forms? Data is being generated from a wide variety of sources and devices in many formats.These numerical literal forms allow data in these forms to be inserted into a GQL graph without any additional conversion.

You might note that the current version of the SQL standard does not support these numeric literal forms. A fix for that omission is in the works.

What are SC42 WG2, OGC, and LDBC?

SC32 WG3 has liaison relationships with a number of other organizations, including:

  • SC42 WG2 Big Data
  • OGC - Open Geospatial Consortium
  • LDBC - Linked Data Benchmark Council

SC42 is responsible for developing standards related to Artificial Intelligence. WG2 is responsible for standards in the area of Big Data. SC42 WG2 is particularly interested in the SC32 WG3 property graph work.

The Open Geospatial Consortium is a Geospatial industry consortium. Among other things, OGC is working on a whitepaper titled “OGC Benefits of Representing Spatial Data Using Semantic and Graph Technologies”. Within SC32 WG3, we are interested in understanding what is needed in the GQL standard to support geospatial data.

The Linked Data Benchmark Council is an academic and industry consortium that initially focused on defining benchmarks for graph databases. LDBC has expanded and property graph related working groups:

  • Existing Languages Working Group - (i) identifying and composing a comprehensive list of graph querying features; and (ii) indicating for each feature the level of support afforded by a wide range of query languages in use in industry
  • Formal Semantics Working Group - validating the GQL specification using formal semantics techniques
  • Property Graph Schema Working Group
    • GS Basic - syntax and semantics of the basic graph schemas for property graphs
    • GS Cardinality and Keys - design property graph key constraints and evaluate computability
    • GS Data Model - requirements for properties on properties

The LDBC working groups provided a large amount of input that will be very useful as the GQL standard develops.

Pandemics and Standards Development

You might have noticed that the current global pandemic is affecting travel and face-to-face meetings. For decades, SC32 WG3 has had two or three week-long meetings a year. We usually meet in interesting places and spend the week sitting in a conference room staring at our computers. Because we can’t travel, we have laid out a schedule of monthly web conferences instead of face-to-face meetings. This allows us to stare at our screens from the comfort of our own homes. We’ve never worked this way before but we have a long history of figuring out how to get the database languages standards work done with whatever the restrictions so I am confident that we will continue to make progress.
What’s Next?

SC32 WG3 is planning a Committee Draft (CD) ballot on six parts of the 9075 SQL standard starting in October, 2020:

  • 9075:1 SQL/Framework
  • 9075:2 SQL/Foundation
  • 9075:4 SQL/PSM
  • 9075:11 SQL/Schemata
  • 9075:14 SQL/XML
  • 9075:16 SQL/PGQ

We will use the CD ballot to carefully review new capabilities, particularly SQL/PGQ. Because the Graph Pattern Matching is common between SQL/PGQ and GQL, the SQL/PGQ CD ballot will allow us to have the GQL draft in better shape when we initiate a CD ballot on 39075 GQL in Q1 2021.