North East Database Day

NEDD is a one day conference in Cambridge. It’s an inexpensive way to learn about the cutting edge of database research. These are my notes on each speaker.

Large notes

I wasn’t as out of my depth as I thought I was going to be. I owe it all to Distributed systems theory for the distributed systems engineer by Henry Robinson

The storm blew open a locked door, but it wasn’t at a dramatic moment in a talk. Disappointed in god for the missed opportunity.

I sat next to Michael Stonebraker.

Frans Kaashoek (MIT) - Verification and correctness of kernels

Linux has many subtle bugs based on system interactions that aren’t caught in testing.
Verified file systems avoid this problem, like FSCQ, which we’ll be learning about today.
(I just read about seL4, a microkernel devoted to security that recently performed well in live tests.)
FSCQ uses CHL, which is Crash Hoare Logic. Normal Hoare Logic has PRE and POST conditions. CHL adds a CRASH condition. It states what must be true if the function crashes at any point.

Notes about FSCQ:

80k lines of code in Haskell (and that includes the CHL code).
1.5 years to develop, with 4 programmers who were getting up to speed with the CHL tools.
Testing passed with almost no effort.
Performed well
Project page

Interesting tidbit I now believe: higher level languages perform comparable to lower level languages when the program is not CPU-bound.

During Q&A, Michael Stonebraker pointed out that verified systems were tried a couple decades ago, but fell out of favor when it was shown that proofs-of-correctness were just as likely to contain errors as the code itself.

Todo:

Learn Hoare Logic
Let Kaashoek know that his versioned trees graph looks a lot like git.

John Hogg (Facebook) - Adding Read-Your-Writes anywhere

Small history of databases: RDBMS hit its limit in 2008, and companies with huge data sets started migrating to key-value stores.

Q. How do you add Read-Your-Writes to a system that doesn’t have it?

A. Cache it in the client! It kinda works!

My question: Doesn’t Facebook solve its consistency problems by having users hit very local datacenters for a few minutes after they post something, until Facebook is reasonably sure the change has propagated to other datacenters?

Note: Easily the prettiest slides.

Todo:

Read Facebook trapped in MySQL ‘fate worse than death’

John Cielewicz (Google) - Declarative Query at Scale

F1, like the cars, is fast. It’s Google’s upgrade to sharded SQL.

Basically there’s an F1 server, and a UDF server. The F1 server sends over queries in batches to the UDF server, which processes them and sends the results back.

F1 workers are always in process, so there’s no startup time (like AthenaWorkers).

Stratos Idreos (Harvard) - Periodic table of Data Structures

(I just read this paper!)

Normally data structures have a performance trade off between Reading, Updating, and Memory.

This paper has a few things going for it.

It presents a system to help you choose the correct data structure to fit your needs.
It decomposes data structures into their component design principles.
Given those components, it created a Periodic Table that predicts data structures that haven’t been invented yet. (Much like how the original periodic table predicted the existence of elements.)

Since the dawn of computer science, ~5,000 data structures have been written about. But given the combination of design principles, we can predict that 10^32 exist.

Todo:

Get better at B-Trees, Bloom filters, and LSM v B-Trees
Send these slides to paper-a-day guy

James Finnerty (Amazon) - Aurora & PostgreSQL

All about keeping Aurora compatible with PostgreSQL.

Aurora has built in distributed computing guarantees of consistency. It’s able to make a few optimizations over a traditional Postgres install since it is built straight on top of S3.

Jeremy Kepner (MIT) - 1 billion updates per second

This guy looks homeless, but then I used to look like a creepy Jesus, so who am I to talk.

He runs the MIT Lincoln Lab Supercloud. This talk is about what they have accomplished. In sum, they’re the best. The goal is to return results to researchers within 15 minutes, before they get started on other things and lose track of their thoughts.

Nate Weir (Brown) - Natural Language for Db

Train on NL -> SQL pairs
Implies that this is plug-n-play

Todo:

Send this to David from Optum
Read George Miller’s Ambiguous Words

Raul Fernandez (MIT) - Combining unstructured tables

What if we have unrelated databases or tables, with data in them that we might care about? How do we find the data we want?
Turn Each row, and each column, into a vector. Then searching becomes a matter of finding nearby vectors.
Downsides: huge space requirements for all vectors. Some minimization required.

Ryan Marcus (Brandeis) - Better DB with ML

Here’s a question that database platforms need to answer: is a join query going to “spill”? We train a neural network on thousands of queries with some spill, then map the queries to vectors, and (like the last presentation) compare the vectors clusters to new queries to predict “spillage”. (“Spillage” might not be a database term.)
Bonus: you can see vectors outliers, and guess where the database is “broken” and needs a bugfix.

Alexandra Meliou (UMass Amherst) - Fairness and Diversity in ML

Here’s a fun example. Go to an online translator, like Google, and translate “he is a doctor” to Turkish and back. Then translate “she is a doctor” to Turkish and back. The latter will come back as “he is a doctor.” Do the same with “he is a nurse” and “she is a nurse.” Both come back as “she is a nurse.”
Machines amplify biases.
This talk focuses on data management, about data going into an ML system.

Todo

Look up the DIVERSE keyword in SQL.

Carl Nuessle (SUNY Buffalo) - Big Data on Small Phones

Smartphone databases are different from other big data databases:

They care about latency, not throughput.
Has bursty data.
Are single client.

The nature of these differences means that smartphone databases Need ACD, not ACID. They don’t need locking and threading.

Also, maybe obviously, smartphone’s care way more about energy consumption that cloud databases.

See more on this project at PocketData.info.

Oscar Moll (MIT) - Video Data Lakes

When it’s expensive to analyse and calculate data, you have to be picky about getting a good sample from your data lake.
The two most common methods to pick samples are randomly (which performs poorly) and set-distance (like modulo, which performs weirdly).
This talk is on a better way to get good samples.

Mirek Riedewald - Top-k to Any-k

(Had to step outside during this talk, didn’t get notes.)

David Cohen (Intel) - OpenSource Cloud-Native Database

Speaker said that banks can’t use AWS because they’re heavily regulated. Why doesn’t healthcare have the same problems? We use AWS all the time, and our regulatory burdens are huge.

Uses Vitess, and RocksDB (!).

Xiaowei Zhu (Tsinhua) - LiveGraph

We want to process live data on a constantly changing graph. Think retweets, fraudulent payments through circular processors, etc.

This is hard because the normal process is:

update --> [ writable DB ] --> slow snapshot --> [ read-only DB ] --> analytics

Where the writable DB is a normal DB like b-tree or LSMT, and the read-only DB is a CSR (compressed sparse row).

The new process is:

update --> [ LiveGraph DB ] --> analytics

The LiveGraph DB is basically a bunch of logs. Each cell is a log of its values+timestamps. To get the state of the database at a certain time, pick any value on any cell, and find the values for all the other cells <= that timestamp. (Apparently this is quicker than a snapshot?)

The result is slower updates, but faster graph analysis.

Large notes#

Frans Kaashoek (MIT) - Verification and correctness of kernels#

Notes about FSCQ:#

Todo:#

John Hogg (Facebook) - Adding Read-Your-Writes anywhere#

Todo:#

John Cielewicz (Google) - Declarative Query at Scale#

Stratos Idreos (Harvard) - Periodic table of Data Structures#

Todo:#

James Finnerty (Amazon) - Aurora & PostgreSQL#

Jeremy Kepner (MIT) - 1 billion updates per second#

Nate Weir (Brown) - Natural Language for Db#

Todo:#

Raul Fernandez (MIT) - Combining unstructured tables#

Ryan Marcus (Brandeis) - Better DB with ML#

Alexandra Meliou (UMass Amherst) - Fairness and Diversity in ML#

Todo#

Carl Nuessle (SUNY Buffalo) - Big Data on Small Phones#

Oscar Moll (MIT) - Video Data Lakes#

Mirek Riedewald - Top-k to Any-k#

David Cohen (Intel) - OpenSource Cloud-Native Database#

Xiaowei Zhu (Tsinhua) - LiveGraph#

Large notes

Frans Kaashoek (MIT) - Verification and correctness of kernels

Notes about FSCQ:

Todo:

John Hogg (Facebook) - Adding Read-Your-Writes anywhere

Todo:

John Cielewicz (Google) - Declarative Query at Scale

Stratos Idreos (Harvard) - Periodic table of Data Structures

Todo:

James Finnerty (Amazon) - Aurora & PostgreSQL

Jeremy Kepner (MIT) - 1 billion updates per second

Nate Weir (Brown) - Natural Language for Db

Todo:

Raul Fernandez (MIT) - Combining unstructured tables

Ryan Marcus (Brandeis) - Better DB with ML

Alexandra Meliou (UMass Amherst) - Fairness and Diversity in ML

Todo

Carl Nuessle (SUNY Buffalo) - Big Data on Small Phones

Oscar Moll (MIT) - Video Data Lakes

Mirek Riedewald - Top-k to Any-k

David Cohen (Intel) - OpenSource Cloud-Native Database

Xiaowei Zhu (Tsinhua) - LiveGraph