class: center, middle, section-heading
# Understanding the power of data types
## PostgreSQL's Secret Weapon

---

# Peter van Hardenberg
### Heroku Postgres

http://postgres-data-types.pvh.ca/

(If you want a reference or see a bug, let me know!)

#### @pvh / pvh@pvh.ca

---

# Programme

* Basic data types
 * Type internals (for users)
 * Advanced data types
 * Special-purpose data types
 * Customizing types
 * Community extensions
 * Missing types

---

# Why use data types?

## Make queries simpler and more readable.

## Solve problems elegantly and in the database.

## Performance and space efficiency.

---

# Caveat Auditor

## "The root of all evil is premature optimization."
.right[— Knuth]

---

---

# Basic types

## surrogate keys
## text
## numbers
## dates and times

---
# Keys

* Include a surrogate key.
 * Model your natural key via constraints.
 * Use `bigserial` or `uuid` for your primary key.*

---

# Keys: BIGSERIAL

If your table is small, the extra size doesn't matter.

If your table is big, you're going to need it anyway.

---

# Keys: UUID

````sql
CREATE EXTENSION "uuid-ossp";
SELECT uuid_generate_v4();
````

````sql
CREATE EXTENSION pgcrypto;
SELECT gen_random_uuid();
````

(Or create the UUIDs in the client.)

---
# Text

* Use `text`!
 * Avoid `varchar`, `char`, and anything else.

---
# Text: Use indexes for pattern-matching

An index supports prefixes (`LIKE 'Peter%'`)
````sql
CREATE INDEX ON users (name);
SELECT * FROM accounts WHERE email LIKE 'Peter%';
````

A functional index can support suffix lookups.
````sql
CREATE INDEX backsearch ON users (reverse(email));
SELECT * FROM accounts WHERE reverse(email) LIKE reverse('%doe.com');`
````

---

# Text: Other Searching

A tsvectors with a GIN index will find individual words efficiently
````sql
CREATE INDEX ON products USING gin(tsv);
````

Regular expressions: ~
````sql
SELECT * FROM users WHERE name ~ '(John|Jane)\s+Doe';
````

(You can get some index capability via pg_trgm/GIN index)

---
# Times & Dates

## `timestamptz`
### Always use `timestamptz`. (Not `time`.)

--
## `date`
### (For when you just need the date.)

---
# Time functions: date_trunc()

A query that always starts at the beginning of the current week
````sql
SELECT date_trunc('week', now())::date;
````

--
A query that counts users created by day
````sql
SELECT date_trunc('day', created_at), count(*)
FROM users GROUP BY date_trunc('day', created_at)
````

---
# Time travel: interval

When was a year ago?
````sql
SELECT now() - '1 year'::interval;
````
--
All the days last month (good for joining):
````sql
SELECT generate_series(date_trunc('month', now() - '1 month'::interval),
                       date_trunc('month', now()), '1 day'::interval)
````

---
# Boolean values

## Use `bool`, not `bit`

Tip: Don't index on booleans. (Consider a partial index.)

---
# Binary data

## `bytea`

(Do consider if Postgres is the right solution...)

Mostly you will leave these alone and treat them as opaque.

Handy Exception: 
````sql
SELECT md5(binary_data) FROM table;
````

---
# Don't use

* `money`: not up to modern standards
 * `timestamp`: use `timestamptz`
 * `time`: you probably meant `timestamptz`
 * `serial`: use `bigserial`

# Are you sure?

* `float` / `integer`: use `numeric`
 * `varchar`, `char`: use `text`, it's faster
 * `bitstring`: premature optimization
 * `xml`: libxml2 is awful, but...
 * `json`: you probably want jsonb

---

class: center, middle, section-heading
# Data Type Internals
## A simplified guide for users

---
# Internals: What's a type?*

* a row in pg_type
 * a nice name for users to remember
 * `typename_{in, out}()`
 * `typename_{send, recv}()` optionally
 * functions (`host()`, `unnest()`, `average()`)
 * operators (`<`, `=`, `~`)
 * ... and some other things

---
# Internals: How does it work?

* typename_in() gets user input, e.g. `'rain', '{1, 3, 5}'`
 * returns a block of memory (palloc() fixed size or varlena)
 * this is added as an attribute to a tuple
 * the tuple is packed into a page
 * the page is written to disk
 * C language functions operate on this format

---
# Internals

---

# Internals: TOAST
## (The Oversized-Attribute Storage Technique)

Large values are compressed and moved to a special storage location called a TOAST table automatically.

This assumes large values shouldn't be kept with their row anyway.

---

# Internals: TOAST is great

* Keeps data rows small, saving I/O on search.
 * Shrinks large values, saving I/O on retrieval.
 * Can (maybe?) add cost if searching contents of all rows of large data...
--

* Try not to do that.

---
# Internals: Make TOAST faster

## `tsvector` on `text`
## `GIN` on `tsvector` / `jsonb`

---

class: center, middle, section-heading
# Advanced data types
## A closer look at some of the more powerful types

---

# Advanced types greatest hits

## `array`
## `jsonb`
## `enum`
## `range`

---

---
# Array: What?

It's an array.

--
In a value.

--
It's not really rocket science.

Useful for:
 * actual array data
 * tags / collections
 * on-the-fly aggregation in query results

---

# Array: How? (Constructing)

Make an array.
````sql
SELECT ARRAY[1, 2, 3];
````

Make it another way.
````sql
SELECT '{1, 2, 3}'::numeric[];
````

Extend your array.
````sql
SELECT ARRAY[1,2] || 3;
SELECT ARRAY[1,2] || ARRAY[3, 4];
````

Aggregate from records.
````sql
SELECT array_agg(distinct users) from events where name = 'PGDay.RU';
````

--
You can make N-dimensional arrays too.

---

# Array: How? (Accessing)

(Remember SQL arrays are 1-indexed!)
````sql
SELECT ('{one, two, three}'::text[])[1]; -- one
````

Use containment operator to find things.
````sql
CREATE INDEX ON table USING gin(tags); -- because fast!
SELECT * FROM table WHERE tags @> array['sometag'];
````
(GIN indexes are great here)

---

## Not exactly overlooked!

---

# JSON(b): What?

## "We made a database inside a type." 
.right[— Oleg Bartunov]

---

# JSON(b): When? (1/3)

Convenient short-lived or occasional attribute storage.
````sql
 {"weird_user?": true, "abuse_tags": ["new", "possible_abuser"]}
````
(Honour demands that I insist you make these columns if they get used often.)

---

# JSON(b): When? (2/3)

Hierarchical data.
````json
{"name": "car",
 "id": 91371
 "parts": [
   {
     "name": "engine"
     "id": 3241,
     "parts": [
       {
         "name": "crankshaft",
         "id": 23991,
         ...
````

---

# JSON(b): When? (3/3)

The data is already JSON.

Why pull it apart?

---

# JSON(b): `jsonb` vs `json`

Prefer `jsonb`, not `json`.
````sql
ALTER TABLE users ADD COLUMN attrs jsonb;
````

Usually add a GIN index.
````sql
CREATE INDEX ON users USING gin(attrs);
````

---

# JSON(b): Anti-patterns

Probably don't do this...
````sql
CREATE TABLE bad_table(id serial, json json);
````

---

# JSON(b): Anti-patterns

Particularly if..
````sql
=# select distinct jsonb_object_keys(json) from bad_table;
 jsonb_object_keys
-------------------
 id
 created_at
 updated_at
 name
 email
 owner_id
 ...
````
Performance will suffer and your queries will be more complex.

---

# JSON(b): Construction

Input is easy.
````sql
select '{"a": "b"}'::jsonb
````

Adding members too.
```sql
UPDATE documents 
  SET contents = content || '{"new_field": "goes here"}' 
  WHERE document_id = 32421;
```

---

# JSON(b): Accessing

Get to elements with -> and ->>
````sql
select attrs->>'presenter_irc_nick' from talks
 attrs
-------
 pvh
````

Using a single -> returns the internal JSON.
````sql
select attrs->'previous_presentation'->>'location' from talks
   attrs
-----------
 "Ottawa"
````

Or send a whole path
````sql
SELECT jsonb_extract_path_text(attrs, 
  'previous_presentation', 'location')
````

---

# JSON(b): Decomposing

Turning JSON into records (requires column definition).
````sql
select * from json_to_recordset(
  '[{"a":1,"b":"foo"},{"a":"2","c":"bar"}]') as x(a int, b text);
````

---

# JSON(b): Searching

Find documents with a certain key,
````sql
select * from talks where attrs ? 'previous_presentation'
````

or documents with a whole sub-document in common.
````sql
SELECT * FROM talks WHERE attrs @> '{"previous_presentation": "PGCon"}'
````

---

---

# Enums: What?

````sql
CREATE TYPE weekdays AS ('Mon', 'Tue', 'Wed', 'Thu', 'Fri');
````

Enums are conceptually similar to domains.

Fast, transparent mapping of words to integers.

(Data lives in pg_enum.)

---

# Enums: Why?

Enums provide a limited set of labels.

Storage is quite efficient - enums are stored as numbers.

---

# Enums: How?

Enums work like enums in other languages.
````sql
CREATE TYPE server_states AS 
  ENUM ('running', 'uncertain', 'offline', 'restarting');
````

--
You can treat them like their text values.
````sql
INSERT INTO servers(state) VALUES ('offline');
````

Enums block unexpected values from insertion.
````sql
INSERT INTO servers(state) VALUES ('mystery');
ERROR:  invalid input value for enum server_states: "mystery"
LINE 1: insert into servers(state) values ('mystery');
````

--
But you can add new values.
````sql
ALTER TYPE server_states ADD VALUE 'abandoned' AFTER 'offline';
````
--
(NB: You can treat enums as numbers and compare with > and < but... don't.)

---

---

# Ranges: What?

Ranges are a simple but powerful variation on composite types.
````sql
INSERT INTO reservation VALUES 
  ('[2016-01-01 11:30, 2016-01-01 15:00)');
````

Always has upper / lower, inclusive / exclusive bounds.

---

# Ranges: Why?

## Nothing lasts forever

Benefits include: 
 * operators for querying
 * exclusion constraints

---

# Ranges: Why?

Some useful predefined range types:

* numrange
 * tstzrange
 * daterange

Or define your own:
````sql
CREATE TYPE inetrange AS RANGE (
    SUBTYPE = inet
); 
````
---

# Ranges: How? (Basics)

Ranges can be created mid-query.
````sql
SELECT tstzrange(created_at, deleted_at) FROM users;
````

--
(NULLs are helpfully treated as -/+ infinity.)

Find users with active accounts on Christmas Day
````sql
SELECT * FROM users
WHERE tstzrange(created_at, deleted_at) @> '2015-12-25'::timestamptz
````

Indexing makes this very fast.
````sql
CREATE INDEX ON users USING GIST( tstzrange(created_at, deleted_at) );
````

---
# Ranges: How? (Window functions)

Understand how events unfold...
````sql
WITH events_duration AS (
  SELECT 
    tstzrange(
      at_time,
      lead(at_time, 1) OVER (partition by event_source order by at_time)
    ) as range,
    *
  FROM events
)
SELECT * FROM events WHERE range @> now() - '4 hours';
````

---

# Ranges: How? (Constraints)

Ranges are also great for exclusion constraints.

Prevent double-booking!
````sql
ALTER TABLE reservation ADD CONSTRAINT
  EXCLUDE USING gist(room with =, during with &&)
````

---

---

# Email Addresses

Everyone has email addresses, here's how you deal with them today.

````sql
SELECT * FROM users WHERE email LIKE '%@bigcorp.com'
````
--
This is always a full table scan. Sad trombone.

We can try our reverse b-tree trick.
````sql
SELECT * FROM users WHERE reverse(email) LIKE reverse('%@bigcorp.com')
````
--
This is... clever, but ugly and hard to dream up.

---
# `emailaddr`
## github.com/petere/pgemailaddr

````sql
SELECT user(email) FROM users WHERE host(email) = 'bigcorp.com'
````
Bonus: By default, email addresses are sorted domain/user, so this is fast with no tricks!

---
class: middle, center
# What about URLs?

---
# `uri`
## github.com/petere/pguri
--

````sql
SELECT path(referrer) FROM signups WHERE host = 'interestingblog.com'
````

---

# Skip the migration!

````sql
SELECT 'http://google.com/'::uri
````

````sql
SELECT host('http://google.com/'::uri)
````

---

# Geometry: What?

Postgres has a wide selection of geometry types.

* point
 * line
 * lseg
 * box
 * path
 * polygon
 * circle

---

# Geometric: Functions and Operators

* translation: `+`, `-`, `*`, `/`
 * properties: `@-@` (length), `#` (intersects)
 * comparison: `@>` (contains), `<->` (distance)
 * .. and SO many others (`&<|`)
---

# Geometry: How?

````sql
SELECT diameter(circle) FROM circles
  WHERE center(circle) = '(2, 3)'
````

---

# Geometry: Why?

* good for light-weight geometry
 * not as fast or as feature rich as PostGIS
 * built in!

---

# PostGIS

http://postgis.net

PostGIS is the best way to store and query spatial data.

---

# PostGIS: What?

* `geom`
 * `geog`
 * index support
 * wraps various spatial libraries

---

# PostGIS: Why?

* geom datatype
 * better performance
 * huge amounts of functionality (vector / raster)
 * not built-in
 
---

# PostGIS: How?

````sql
SELECT
  sum(ST_Length(r.the_geom))/1000 AS kilometers
FROM
  bc_roads r,
  bc_municipality m
WHERE  r.name = 'Roseberry St' AND m.name = 'VICTORIA'
  AND ST_Contains(m.the_geom, r.the_geom) ;
````

---
# Network addresses

## built-in types

* inet (`192.168.0.100`)
 * cidr (`::ffff:1.2.3.0/120`)
 * macaddr (`08:00:2b:01:02:03`)

## ip4r

https://github.com/RhodiumToad/ip4r

Adds index support for ranges, lighter-weight single IP type.
(And lots more.)

---
# Special mention

## PostBIS
Bio-informatic types for DNA querying

https://colab.mpi-bremen.de/wiki/display/pbis/PostBIS

## pg_currency
Very stale (6 years) but a solid foundation!

https://github.com/samv/pg-currency

---
# Special mention

## FixedDecimal
For when you need a fixed-precision decimal (performance)

https://github.com/2ndQuadrant/fixeddecimal

## HyperLogLog
Not really a data type, but a data structure for estimating distinct values.

https://github.com/aggregateknowledge/postgresql-hll

---

---

# Domains: What?

Create your own types that constrain existing types.

* roll your own email type
 * a more specific URI

--
NB: SQL wants all types to be null-friendly. Your domain should allow nulls.

---
# Domains: When?

* common, stable specification
 * postal code
 * email address
 * NOT name
 * NOT postal address

---
# Domains: Examples

````sql
CREATE DOMAIN email AS TEXT
CHECK(
  VALUE ~ '.+\@.+'
);
````

````sql
CREATE DOMAIN web_url AS uri CHECK ( (uri_scheme(value) = 'http' OR uri_scheme(value) = 'https')  AND uri_host(value) IS NOT null);
````

--
(A better email address would check other things like length.)

---

# Composite Types: What?

Create your own types that combine existing types.
--

* all tables are also types
 * members always go together
 * all members always have the same fields

---

# Composite Types: How? (1/2)

Every row is also a composite type.
````sql
SELECT (users) FROM users;
````

But you can create your own:
````sql
CREATE TYPE point3d AS (x float, y float, z float);
CREATE TABLE threespace (location point3d);
````

---

# Composite Types: How? (2/2)

Referring to a row type (composite type named for table) requires parens
````sql
SELECT (users).email FROM users;
````

Row types are handy in window expressions
````sql
SELECT (agent_statuses) AS current, 
  lead((agent_statuses), 1) OVER (partition by agent_uuid order by time) AS next 
FROM agent_statuses;
````

---

# DIY functions

Defining a new function is straightforward
````sql
CREATE OR REPLACE FUNCTION host(email) RETURNS text 
LANGUAGE sql AS $$ 
  select split_part($1, '@', 2) as return; 
$$;
````

PL/PGSQL is very slow.

PL/V8 is relatively fast.

PL/SQL is quite fast.

---
# DIY Aggregates

Define an initial condition, a per-value function, and a final function (optional).
````sql
CREATE AGGREGATE avg (float8)
(
    sfunc = float8_accum,
    stype = float8[],
    finalfunc = float8_avg,
    initcond = '{0,0,0}'
);
````

---

# DIY Operators

An operator is any number of non-ASCII characters.
````sql
CREATE OPERATOR @-> ( 
  LEFTARG = 'email', 
  PROCEDURE = 'host');

SELECT email@-> FROM emails LIMIT 1;
````

---

class: center, middle, section-heading
# Missing Types
## Opportunities to make cool things

---

# Currency

`money` is not a good type, and we can do better.

````sql

SELECT justify('$34.03 USD, $12.01 CAD', 
 'USD',
 '{"CAD": 0.77}');

````

---

# Physical Units

Postgres users don't crash rovers into Mars.

````sql
SELECT '77 km/h'::physical
SELECT to_unit('ml', '542 fluid ounces'::physical);
SELECT '4 m/s'::physical + '1 ounce'::physical; -- error
````

---

# Images

````sql
UPDATE graphs 
  SET image = render_graph(SELECT * FROM data WHERE run = 117) ;
````

````sql
SELECT gif_frame(image, 3) FROM gifs WHERE name = 'space-kaboom';
````

Image classification, sorting, organizing, resampling, and so on in-DB could be useful.

---

# Music / Sound / Tags

````sql
SELECT genre FROM mp3s;
SELECT max(sound_length(sound)) FROM samples;
````

Could be great for people with large music collections, DJs, researchers.

Consider building on taglib?

---

---

## Basic data types to use

`text`, `timestamptz`, `bigserial`/`uuid`

## Advanced types to use

`jsonb`, `array`, `tstzrange`/`numrange`

## Create your own

Domains, composite types, functions, operators, aggregates.

## Community extensions

PostGIS, ip4r, email, uri, pgxn

---

# fin

## questions? comments? complaints?

## @pvh / pvh@pvh.ca

Feedback is welcome and invited!

thanks to: peter eisentraut, jim mlodgenski, jonathan katz, bruce momjian, josh berkus, will leinweber, rachid belaid

---