Characterizing Database Workloads & Storage Models

Thank you Carnegie Mellon Database Group for putting this online! These are my notes from watching on YouTube.

Carnegie Mellon Databases Storage II, Lecture 4

Prof. Andy Pavlo [Watch on YouTube]

The Problem and Solution

How should the DBMS represent the database in storage files on disk? Solve it by choosing the right storage model for your target workload. The right strategy varies if you are reading data, writing data and with how many joins you are performing.

Workload Characterization

OLTP (Online Transaction Processing): “Simple queries with lots of writes.”

OLAP (Online Analytical Processing): “Read only queries. Lots of joins. Doing a lot of reads, but they’re more complex.”

HTAP (Hybrid Transactional Analytical Processing): “is trying to do both of them. You still want to ingest new data, but analyze it as it comes in. It’s used for companies making decisions on the fly as people are browsing websites, like internet advertising companies.”

Screenshot 2020-06-27 at 11.56.10 AM

Storage Models

screenshots from the lecture

n-ary model

N-ary used to be the dominant model until the ’80s.

DSM model

Additional Reading: All Things Distributed

Column Store Vs. Row Store RDBMS

Row-oriented DBMS (Row Store)

  • PostgreSQL, MySQL
  • Row Store = use OLTP

Column-oriented DBMS (Column Store)

  • Red Shift, BigQuery
  • Column Store = use OLAP

If types are consistent, you can compress data into single column store.