Web Data Management in RDF Age

- September 11, 2017

This was the keynote on ICDCS'17 Day 2, by Tamer Ozsu. Below are my notes from his talk. The slides for his presentation are available here.

Querying web data presents challenges due to lack of a schema, its volatility, and its sheer scale. There have been several recent approaches to querying web data, including XML, JSON, and fusion tables. This talk is about another approach to maintaining and querying web data: RDF and SPARQL. This last one is the recommended way by W3C (World Wide Web Consortium) and is a building block for semantic web and Linked Open Data (LOD). Here is a diagram denoting the LOD datasets and links between them as of 2014.

Resource Description Framework (RDF)

In RDF, everything is a uniquely named resource (URI). The ID for Jack Nicholson's resource is JN29704. Resources have defined attributes: y:JN29704 hasName = "Jack Nicholson", y:JN29704 BornOnDate = "1937-04-22". The relationships with other resources can be defined via predicates.(This is "predicates in grammar context", and not in logic context.) Predicates for triples of the form "Subject Predicate Object". The subjects always URIs, and the objects are "literals" or URIs.

It turns out biologists are heavy users of RDF with the UniProt dataset.

RDF query model: SPARQL

SPARQL provides a flavor of SQL. It operates on the RDF triple, and now the _variables_ can be used to substitute in place of subject, predicate, object as well. Thus it is also possible to represent SPARQL queries as a graph. Once you represent the SPARQL query as a graph, answering a SPARQL query reduces to a subgraph matching matching problem between the RDF data-graph and the query-graph.

As in querying with SQL, too many joins are bad for performance, and a lot of work is extended to optimize this.

Distributed SPARQL processing

If it is possible to gather all RDF needed in one place, then querying can be done easily with using common cloud computing tools. You can partition and store the RDF on HDFS, and then run SPARQL queries on this as mapreduce jobs (say using Spark).

If it is not possible to gather all RDF data, you need to a distributed querying, and for this methods similar to distributed RDBMS can be used.

To complicate this further not all of the RDF data sites can process SPARQL queries. So several approaches are formulated to deal with the problem that poses for distributed SPARQL processing, such as writing wrappers to execute SPARQL queries on these sites.

Finally live querying of the web of RDF linked data is also possible by sending bots to traverse/browse this graph at runtime.