SIGMOD/PODS Day 2

- June 21, 2023

Yesterday I wrote about Day 1, which was mostly PODS. Today the SIGMOD conference started and I had a fun packed day attending awesome talks.

Opening remarks

The conference chairs, Sudipto Das and Ippokratis Pandis (both Amazonians), welcomed people. They said they were happy to get SIGMOD back to Seattle, the home of the clouds, after 25 years. They also noted that this is the first fully in-person SIGMOD conference post pandemic.

The conference attracted 970 attendants! US based participants constituted 47% of the crowd, China 10%, Germany 8%. There were 660 submissions this year, which was up +20% from last year. There was a graph showing acceptance rates per topics, and I was to catch that transactions papers have more than 30% acceptance rate, and hardware papers more than 40%, both of which are above the average acceptance rate for the conference. They fed the keywords of the accepted papers to ChatGPT for fun, and got an abstract from ChatGPT. The title ChatGPT offered was "Graph-based learning: A scalable and efficient approach to data analysis"!

Keynote: 49 years of queries (Don Chamberlin)

Don Chamberlin (IBM fellow retired) is the creator of the SQL language. Why does the title say 49 years and not 50 years of querying? This is because the SQL paper was published 49 years ago at SIGMOD conference, this month, at Ann Arbor Michigan. Things were different then, SIGMOD was just a small gathering of academicians and some industry people. In fact it wasn't even called SIGMOD, it was called
SIGFIDET, management of data was not in the name of the conference.

The paper was titled: "SEQUEL: A structured English query language". But believe it or not, this was not the main show in that conference, and maybe even went low key unnoticed. The main show was two influential people debating.

The first one was Charles Bachman from Honeywell. Charle was representing the establishment, and pushing a network/graph based querying of integrated datastores. He got a Turing award in 1979 and his talk was titled "Programmer as the navigator" as the programmer followed links/pointers navigating and looking toward an answer in integrated datastores.

The second one was E.F. (Ted) Codd. Ted was the challenger. He was pushing a higher level model, where the queries are declarative over data values (not pointers), He was advocating that an optimizing compiler can translate high level "what" queries to an efficient "how" plan. The users didn't need to know about access paths, as they could change.

The timing of the debate was very critical. In the coming years, a lot of businesses would start putting their data on computers, and the decisions made in that decade will affect the path of technology.

It would not be any spoiler to say that the relational model Ted advocated won and the relational model took over the world in the coming decade. Ted received the Turing Award in 1981 for this idea.

Let's get back to Don's personal story and SQL. Don Chamberlin and Ray Boyce were assigned at IBM to work on System R. Don and Ray appreciated the simplicity of codd's ideas, but thought they would get more attention if they were easier to use. So they decided to use ordinary English words for querying and target non programmers. They even did experiments with San Jose State students which are non programmers. After a few hours of instruction, the students gained proficiency. The only thing, some students didn't understand was the difference between variable and string. So instead of
WHERE LASTNAME='JOB' they would write WHERE LASTNAME=JOB, and get cases where the LASTNAME column is equal to the JOB column rather than getting people whose last names are "JOB".

It turns out Don had a rocky relationship with Ted Codd on SQL as the querying language for relational databases and with IBM management on getting SQL based relational databases into market. These came out during Q&A session after the talk. Don said he was 25 years old, when joining system R work. He wanted to work with Ted, but that didn't happen, since after Ted became an IBM fellow, Ted took on a visionary/guru role and didn't see himself as nuts/bolts guy. As a result, Don didn't see Ted much. After a while Ted left IBM, and joined consulting group, where he became critical of SQL, and challenged it in several ways. Don was unhappy about that, because he wished Ted was around and got involved earlier to have those debates earlier. Don mentions that he still had great respect for Ted.

On the struggles Don had within IBM, this seems like a classic example of innovation dilemma. IBM treated their work as research, and was very reluctant to get this into production. They did not want to cannibalize their existing product on IMS. Larry Ellison, in the meanwhile, was worried that IBM saw SQL as strategic investment, and moved quickly. With the couple years of headstart, Oracle became the market leader and dominated the market until recently.

Don credited the ANSI and NIST standarization of SQL for why SQL got successful and thrived to the face of many challenges. He mentioned IEEE spectrum survey on top languages where SQL was 5th in overall category and first in the "by job advertisements" category. In both lists, there are only two languages that are 40+ years old: SQL and C, both of which went through ANSI standardization. The standardization made it possible for vendors to compete and improve adoption.

I can't help but notice C and SQL share another property that might have led to their success and resilience. This has been explained in the classical "Worse is Better" essay and also by the Principle of Least Power of "choosing the least powerful [computer] language suitable for a given purpose".

Don mentioned that SQL had this "walk up and read" property meaning it is not too hard to understand, and can be read like English, with no special punctuation. Don believes that the golden age of SQL standard is the 1992 version. It is short and sweet, widely implemented and successful. The SQL 1992 book by Jim Melton was 536 pages, in contrast to SQL 1999 book by Jim which is 1458 pages. Jim stopped writing more books, even though new standards came after that.

Don talked about some controversies about SQL, one being the lack of orthogonality. Since SQL is not a functional language, it is not side effect free. You can't put groupby at any place, groupby doesn't return value, and has side effects. He also talked about the null value problem. Inevitably, information is sometimes missing in real world, and there is no good way to deal with it. Using null values lead to problems, but SQL provides a way to treat this, you can specify not to accept Null on a column by column basis. He emphasized that he is a pragmatist, and the SQL philosophy is to provide flexible tools and to trust the user. This principle also applies in similar way for whether or not to eliminate duplicates using SELECT DISTINCT.

Don said that there is a danger in trying to be everything to everybody. There can always be specialized languages for specialized things. In accordance with SQL philosophy, you can write a function for handling special cases/data, and can call functions from SQL. This is a tradeoff they made to make SQL more broadly applicable.

Don mentioned the 1995 manifesto on object databases as a challenge to SQL. After the rise of object oriented languages like Java and C++, people asked why not make objects persistent in native form, rather than to translate to relational. In response, other people including Jim Gray and Mike Stonebraker wrote back a response showing how SQL can handle the object model. Don said that object oriented DBMS in his opionion proved to be inflexible, as it was not easy to share objects between apps written in other languages.

While addressing why SQL thrived, Don also credited the openly and timely publishing (no patent, trademarks) by IBM System R and by UC Berkeley Ingress group by Michael Stonebraker and Eugene Wong.

He also said another reason was that "data is sticky!" in the sense that being first accounts for a lot as it is expensive to migrate to another platform.

As for newer developments, he said that XML and XQL had limited success, and that JSON is a relief after XML, and it is much more promising to move into for future data applications.

Don mentioned the SQL++ work that extends SQL with nested tables, and pointed to the AsterixDB project as a successful effort.

This was such a great keynote! It was for close to 90 minutes in total with Q&A, but it was so engaging. Don delivered very fluent talk and avoided distractions by using only a handful of slides in total. He had a stack of papers for notes to check when needed. At the end of his talk, Don got a standing ovation from the audience. (image credit Ippokratis Pandis)

Other sessions in Day 2

I also attended and took notes from two other interesting sessions. I had a very long day, and it is too late in to the night for my body which is still adjusting to the Pacific time. I may write about these later.

One of these sessions was the Future of Database System Architectures panel with Gustavo Alonso (ETH), Natassa Ailamaki (EPFL), Sailesh Krishnamurthy (Google), Sam Madden (MIT), Swami Sivasubramanian (AWS), and Raghu Ramakrishnan (Microsoft). The panel was lively and playful. The main takeaways for me was about how disaggregation has arrived and is here to stay, and how LLMs is already changing querying.

The other session was the session combining talks from sponsoring companies:

Amazon: Innovation in AWS Database Services — Marc Brooker
Microsoft: Microsoft Fabric - Analytics in the AI Era — Raghu Ramakishnan
Google: Data and AI at Google BigQuery Scale — Tomas Talius
Alibaba: Enhancing Database Systems with AI — by Bolin Ding and Jingren Zhou
Confluent: Consensus in Apache Kafka: from Theory to Production — Jason Gustafson and Guozhang Wang
Salesforce: Enterprise and the Cloud: Why Is It Challenging? — Pat Helland
Databricks: The best warehouse is a Lakehouse — Ryan Johnson