North East Database Day
NEDD is a one day conference in Cambridge. It’s an inexpensive way to learn about the cutting edge of database research. These are my notes on each speaker.
Large notes⌗
I wasn’t as out of my depth as I thought I was going to be. I owe it all to Distributed systems theory for the distributed systems engineer by Henry Robinson
The storm blew open a locked door, but it wasn’t at a dramatic moment in a talk. Dissappointed in god for the missed opportunity.
I sat next to Michael Stonebraker.
Frans Kaashoek (MIT) - Verification and correctness of kernels⌗
- Linux has many subtle bugs based on system interactions that aren’t caught in testing.
- Verified file systems avoid this problem, like FSCQ, which we’ll be learning about today.
- (I just read about seL4, a microkernel devoted to security that recently performed well in live tests.)
- FSCQ uses CHL, which is Crash Hoare Logic. Normal Hoare Logic has PRE and POST conditions. CHL adds a CRASH condition. It states what must be true if the function crashes at any point.
Notes about FSCQ:⌗
- 80k lines of code in Haskell (and that includes the CHL code).
- 1.5 years to develop, with 4 programmers who were getting up to speed with the CHL tools.
- Testing passed with almost no effort.
- Performed well
- Project page
Interesting tidbit I now believe: higher level languages perform comparable to lower level languages when the program is not CPU-bound.
During Q&A, Michael Stonebraker pointed out that verified systems were tried a couple decades ago, but fell out of favor when it was shown that proofs-of-correctness were just as likely to contain errors as the code itself.
Todo:⌗
- Learn Hoare Logic
- Let Kaashoek know that his versioned trees graph looks a lot like git.
John Hogg (Facebook) - Adding Read-Your-Writes anywhere⌗
Small history of databases: RMDBS hit its limit in 2008, and companies with huge data sets started migrating to key-value stores.
Q. How do you add Read-Your-Writes to a system that doesn’t have it?
A. Cache it in the client! It kinda works!
My question: Doesn’t Facebook solve its consistency problems by having users hit very local datacenters for a few minutes after they post something, until Facebook is reasonably sure the change has propogated to other datacenters?
Note: Easily the prettiest slides.
Todo:⌗
John Cielewicz (Google) - Declarative Query at Scale⌗
F1, like the cars, is fast. It’s Google’s upgrade to sharded SQL.
Basically there’s an F1 server, and a UDF server. The F1 server sends over queries in batches to the UDF server, which processes them and send the results back.
F1 workers are always in process, so there’s no startup time (like AthenaWorkers).
Stratos Idreos (Harvard) - Periodic table of Data Structures⌗
(I just read this paper!)
Normally data structures have a performance trade off between Reading, Updating, and Memory.
This paper has a few things going for it.
- It presents a system to help you choose the correct data structure to fit your needs.
- It decomposes data structures into their component design principles.
- Given those components, it created a Period Table that predicts datastructures that haven’t been invented yet. (Much like how the original periodic table predicted the existence of elements.)
Since the dawn of computer science, ~5,000 data structures have been written about. But given the combination of design principles, we can predict that 10^32 exist.
Todo:⌗
- Get better at B-Trees, blooms filters, and LSM v B-Trees
- Send these slides to paper-a-day guy
James Finnerty (Amazon) - Aurora & PostgreSQL⌗
All about keeping Aurora compatible with PostgreSQL.
Aurora has built in distrubuted computing guarantees of consistency. It’s able to make a few optimizations over a traditional Postgres install since it is built straight on top of S3.
Jeremy Kepner (MIT) - 1 billions updates per second⌗
This guy looks homeless, but then I used to look like a creepy Jesus, so who am I to talk.
He runs the MIT Lincoln Lab Supercloud. This talk is about what they have accomplished. In sum, they’re the best. The goal is to return results to researchers within 15 minutes, before they get started on other things and lose track of their thoughts.
Nate Weir (Brown) - Natural Language for Db⌗
- Train on NL -> SQL pairs
- Implies that this is plug-n-play
Todo:⌗
- Send this to David from Optum
- Read George Miller’s Ambigous Aords
Raul Fernandez (MIT) - Combining unstructured tables⌗
- What if we have unreleated databases or tables, with data in them that we might care about? How do we find the data we want?
- Turn Each row, and each column, into a vector. Then searching becomes a matter of finding nearby vectors.
- Downsides: huge space requirements for all vectors. Some minimization required.
Ryan Marcus (Brandeis) - Better DB with ML⌗
- Here’s a question that database platforms need to answer: is a join query going to “spill”? We train a neural network on thousands of queries with some spill, then map the queries to vectors, and (like the last presentation) compare the vectors clusters to new queries to predict “spillage”. (“Spillage” might not be a database term.)
- Bonus: you can see vectors outliers, and guess where the database is “broken” and needs a bugfix.
Alexandra Meliou (UMass Amherst) - Fairness and Diversity in ML⌗
- Here’s a fun example. Go to an online translator, like Google, and translate “he is a doctor” to Turkish and back. Then translate “she is a doctor” to Turkish and back. The latter will come back as “he is a doctor.” Do the same with “he is a nurse” and “she is a nurse.” Both come back as “she is a nurse.”
- Machines amplify biases.
- This talk focuses on data management, about data going into an ML system.
Todo⌗
- Look up the DIVERSE keyword in SQL.
Carl Nuessle (SUNY Buffalo) - Big Data on Small Phones⌗
Smartphone databases are different from other big data databases:
- They care about latency, not throughput.
- Has bursty data.
- Are single client.
The nature these differences means that smartphone databases Need ACD, not ACID. They don’t need locking and threading.
Also, maybe obviously, smartphone’s care way more about energy consumption that cloud databases.
See more on this project at PocketData.info.
Oscar Moll (MIT) - Video Data Lakes⌗
- When it’s expensive to analyse and calculate data, you have to be picky about getting a good sample from your data lake.
- The two most common methods to pick samples are randomly (which performs pooly) and set-distance (like modulo, which performs weirdly).
- This talk is on a better way to get good samples.
Mirek Riedewald - Top-k to Any-k⌗
(Had to step outside during this talk, didn’t get notes.)
David Cohen (Intel) - OpenSource Cloud-Native Database⌗
Speaker said that banks can’t use AWS because they’re heavily regulated. Why doesn’t healthcare have the same problems? We use AWS all the time, and our regulatory burdens are huge.
Uses Vitess, and RocksDB (!).
Xiaowei Zhu (Tsinhua) - LiveGraph⌗
We want to process live data on a constantly changing graphs. Think retweets, fraudulent payments through circular processors, etc.
This is hard because the normal process is:
update --> [ writable DB ] --> slow snapshot --> [ read-only DB ] --> analytics
Where the writable DB is a normal DB like b-tree or LSMT, and the read-only DB is a CSR (compressed sparse row).
The new process is:
update --> [ LiveGraph DB ] --> analytics
The LiveGraph DB is basically a bunch of logs. Each cell is a log of it’s values+timestamps. To get the state of the database at a certain time, pick any value on any cell, and find the values for tall the other cells <= that timestamp. (Apparently this is quicker than a snapshot?)
The result is slower updates, but faster graph analysis.