+ - 0:00:00
Notes for current slide
Notes for next slide

Understanding the power of data types

PostgreSQL's Secret Weapon

1 / 119

Peter van Hardenberg

Heroku Postgres

http://postgres-data-types.pvh.ca/

(If you want a reference or see a bug, let me know!)

@pvh / pvh@pvh.ca

2 / 119

Programme

  • Basic data types
  • Type internals (for users)
  • Advanced data types
  • Special-purpose data types
  • Customizing types
  • Community extensions
  • Missing types
3 / 119

Why use data types?

Make queries simpler and more readable.

Solve problems elegantly and in the database.

Performance and space efficiency.

4 / 119

Caveat Auditor

"The root of all evil is premature optimization."

— Knuth

5 / 119

Basic data types

A very brief review

6 / 119

Basic types

surrogate keys

text

numbers

dates and times

7 / 119

Keys

  • Include a surrogate key.
  • Model your natural key via constraints.
  • Use bigserial or uuid for your primary key.*
8 / 119

Keys: BIGSERIAL

If your table is small, the extra size doesn't matter.

If your table is big, you're going to need it anyway.

9 / 119

Keys: UUID

CREATE EXTENSION "uuid-ossp";
SELECT uuid_generate_v4();
CREATE EXTENSION pgcrypto;
SELECT gen_random_uuid();

(Or create the UUIDs in the client.)

10 / 119

Text

  • Use text!
  • Avoid varchar, char, and anything else.
11 / 119

Text: Use indexes for pattern-matching

An index supports prefixes (LIKE 'Peter%')

CREATE INDEX ON users (name);
SELECT * FROM accounts WHERE email LIKE 'Peter%';
12 / 119

Text: Use indexes for pattern-matching

An index supports prefixes (LIKE 'Peter%')

CREATE INDEX ON users (name);
SELECT * FROM accounts WHERE email LIKE 'Peter%';

A functional index can support suffix lookups.

CREATE INDEX backsearch ON users (reverse(email));
SELECT * FROM accounts WHERE reverse(email) LIKE reverse('%doe.com');`
13 / 119

Text: Other Searching

A tsvectors with a GIN index will find individual words efficiently

CREATE INDEX ON products USING gin(tsv);

Regular expressions: ~

SELECT * FROM users WHERE name ~ '(John|Jane)\s+Doe';

(You can get some index capability via pg_trgm/GIN index)

14 / 119

Times & Dates

timestamptz

Always use timestamptz. (Not time.)

15 / 119

Times & Dates

timestamptz

Always use timestamptz. (Not time.)

date

(For when you just need the date.)

16 / 119

Time functions: date_trunc()

A query that always starts at the beginning of the current week

SELECT date_trunc('week', now())::date;
17 / 119

Time functions: date_trunc()

A query that always starts at the beginning of the current week

SELECT date_trunc('week', now())::date;

A query that counts users created by day

SELECT date_trunc('day', created_at), count(*)
FROM users GROUP BY date_trunc('day', created_at)
18 / 119

Time travel: interval

When was a year ago?

SELECT now() - '1 year'::interval;
19 / 119

Time travel: interval

When was a year ago?

SELECT now() - '1 year'::interval;

All the days last month (good for joining):

SELECT generate_series(date_trunc('month', now() - '1 month'::interval),
date_trunc('month', now()), '1 day'::interval)
20 / 119

Boolean values

Use bool, not bit

21 / 119

Boolean values

Use bool, not bit

22 / 119

Binary data

bytea

(Do consider if Postgres is the right solution...)

Mostly you will leave these alone and treat them as opaque.

Handy Exception:

SELECT md5(binary_data) FROM table;
23 / 119

Don't use

  • money: not up to modern standards
  • timestamp: use timestamptz
  • time: you probably meant timestamptz
  • serial: use bigserial

Are you sure?

  • float / integer: use numeric
  • varchar, char: use text, it's faster
  • bitstring: premature optimization
  • xml: libxml2 is awful, but...
  • json: you probably want jsonb
24 / 119

Data Type Internals

A simplified guide for users

25 / 119

Internals: What's a type?*

  • a row in pg_type
  • a nice name for users to remember
  • typename_{in, out}()
  • typename_{send, recv}() optionally
  • functions (host(), unnest(), average())
  • operators (<, =, ~)
  • ... and some other things
26 / 119

Internals: How does it work?

  • typename_in() gets user input, e.g. 'rain', '{1, 3, 5}'
  • returns a block of memory (palloc() fixed size or varlena)
  • this is added as an attribute to a tuple
  • the tuple is packed into a page
  • the page is written to disk
  • C language functions operate on this format
27 / 119

Internals

28 / 119

Internals: TOAST

(The Oversized-Attribute Storage Technique)

Large values are compressed and moved to a special storage location called a TOAST table automatically.

This assumes large values shouldn't be kept with their row anyway.

29 / 119

Internals: TOAST is great

  • Keeps data rows small, saving I/O on search.
  • Shrinks large values, saving I/O on retrieval.
  • Can (maybe?) add cost if searching contents of all rows of large data...
30 / 119

Internals: TOAST is great

  • Keeps data rows small, saving I/O on search.
  • Shrinks large values, saving I/O on retrieval.
  • Can (maybe?) add cost if searching contents of all rows of large data...
  • Try not to do that.
31 / 119

Internals: Make TOAST faster

tsvector on text

GIN on tsvector / jsonb

32 / 119

Advanced data types

A closer look at some of the more powerful types

33 / 119

Advanced types greatest hits

array

jsonb

enum

range

34 / 119

Arrays

35 / 119

Array: What?

It's an array.

36 / 119

Array: What?

It's an array. In a value.

37 / 119

Array: What?

It's an array. In a value. It's not really rocket science.

38 / 119

Array: What?

It's an array. In a value. It's not really rocket science.

Useful for:

  • actual array data
  • tags / collections
  • on-the-fly aggregation in query results
39 / 119

Array: How? (Constructing)

Make an array.

SELECT ARRAY[1, 2, 3];

Make it another way.

SELECT '{1, 2, 3}'::numeric[];

Extend your array.

SELECT ARRAY[1,2] || 3;
SELECT ARRAY[1,2] || ARRAY[3, 4];

Aggregate from records.

SELECT array_agg(distinct users) from events where name = 'PGDay.RU';
40 / 119

Array: How? (Constructing)

Make an array.

SELECT ARRAY[1, 2, 3];

Make it another way.

SELECT '{1, 2, 3}'::numeric[];

Extend your array.

SELECT ARRAY[1,2] || 3;
SELECT ARRAY[1,2] || ARRAY[3, 4];

Aggregate from records.

SELECT array_agg(distinct users) from events where name = 'PGDay.RU';

You can make N-dimensional arrays too.

41 / 119

Array: How? (Accessing)

(Remember SQL arrays are 1-indexed!)

SELECT ('{one, two, three}'::text[])[1]; -- one

Use containment operator to find things.

CREATE INDEX ON table USING gin(tags); -- because fast!
SELECT * FROM table WHERE tags @> array['sometag'];

(GIN indexes are great here)

42 / 119

JSON

Not exactly overlooked!

43 / 119

JSON(b): What?

"We made a database inside a type."

— Oleg Bartunov

44 / 119

JSON(b): When? (1/3)

Convenient short-lived or occasional attribute storage.

{"weird_user?": true, "abuse_tags": ["new", "possible_abuser"]}

(Honour demands that I insist you make these columns if they get used often.)

45 / 119

JSON(b): When? (2/3)

Hierarchical data.

{"name": "car",
"id": 91371
"parts": [
{
"name": "engine"
"id": 3241,
"parts": [
{
"name": "crankshaft",
"id": 23991,
...
46 / 119

JSON(b): When? (3/3)

The data is already JSON.

Why pull it apart?

47 / 119

JSON(b): jsonb vs json

Prefer jsonb, not json.

ALTER TABLE users ADD COLUMN attrs jsonb;

Usually add a GIN index.

CREATE INDEX ON users USING gin(attrs);
48 / 119

JSON(b): Anti-patterns

Probably don't do this...

CREATE TABLE bad_table(id serial, json json);
49 / 119

JSON(b): Anti-patterns

Particularly if..

=# select distinct jsonb_object_keys(json) from bad_table;
jsonb_object_keys
-------------------
id
created_at
updated_at
name
email
owner_id
...

Performance will suffer and your queries will be more complex.

50 / 119

JSON(b): Construction

Input is easy.

select '{"a": "b"}'::jsonb

Adding members too.

UPDATE documents
SET contents = content || '{"new_field": "goes here"}'
WHERE document_id = 32421;
51 / 119

JSON(b): Accessing

Get to elements with -> and ->>

select attrs->>'presenter_irc_nick' from talks
attrs
-------
pvh

Using a single -> returns the internal JSON.

select attrs->'previous_presentation'->>'location' from talks
attrs
-----------
"Ottawa"

Or send a whole path

SELECT jsonb_extract_path_text(attrs,
'previous_presentation', 'location')
52 / 119

JSON(b): Decomposing

Turning JSON into records (requires column definition).

select * from json_to_recordset(
'[{"a":1,"b":"foo"},{"a":"2","c":"bar"}]') as x(a int, b text);
53 / 119

JSON(b): Searching

Find documents with a certain key,

select * from talks where attrs ? 'previous_presentation'

or documents with a whole sub-document in common.

SELECT * FROM talks WHERE attrs @> '{"previous_presentation": "PGCon"}'
54 / 119

Enum

55 / 119

Enums: What?

CREATE TYPE weekdays AS ('Mon', 'Tue', 'Wed', 'Thu', 'Fri');

Enums are conceptually similar to domains.

Fast, transparent mapping of words to integers.

(Data lives in pg_enum.)

56 / 119

Enums: Why?

Enums provide a limited set of labels.

Storage is quite efficient - enums are stored as numbers.

57 / 119

Enums: How?

Enums work like enums in other languages.

CREATE TYPE server_states AS
ENUM ('running', 'uncertain', 'offline', 'restarting');
58 / 119

Enums: How?

Enums work like enums in other languages.

CREATE TYPE server_states AS
ENUM ('running', 'uncertain', 'offline', 'restarting');

You can treat them like their text values.

INSERT INTO servers(state) VALUES ('offline');
59 / 119

Enums: How?

Enums work like enums in other languages.

CREATE TYPE server_states AS
ENUM ('running', 'uncertain', 'offline', 'restarting');

You can treat them like their text values.

INSERT INTO servers(state) VALUES ('offline');

Enums block unexpected values from insertion.

INSERT INTO servers(state) VALUES ('mystery');
ERROR: invalid input value for enum server_states: "mystery"
LINE 1: insert into servers(state) values ('mystery');
60 / 119

Enums: How?

Enums work like enums in other languages.

CREATE TYPE server_states AS
ENUM ('running', 'uncertain', 'offline', 'restarting');

You can treat them like their text values.

INSERT INTO servers(state) VALUES ('offline');

Enums block unexpected values from insertion.

INSERT INTO servers(state) VALUES ('mystery');
ERROR: invalid input value for enum server_states: "mystery"
LINE 1: insert into servers(state) values ('mystery');

But you can add new values.

ALTER TYPE server_states ADD VALUE 'abandoned' AFTER 'offline';
61 / 119

Enums: How?

Enums work like enums in other languages.

CREATE TYPE server_states AS
ENUM ('running', 'uncertain', 'offline', 'restarting');

You can treat them like their text values.

INSERT INTO servers(state) VALUES ('offline');

Enums block unexpected values from insertion.

INSERT INTO servers(state) VALUES ('mystery');
ERROR: invalid input value for enum server_states: "mystery"
LINE 1: insert into servers(state) values ('mystery');

But you can add new values.

ALTER TYPE server_states ADD VALUE 'abandoned' AFTER 'offline';

(NB: You can treat enums as numbers and compare with > and < but... don't.)

62 / 119

Ranges

63 / 119

Ranges: What?

Ranges are a simple but powerful variation on composite types.

INSERT INTO reservation VALUES
('[2016-01-01 11:30, 2016-01-01 15:00)');

Always has upper / lower, inclusive / exclusive bounds.

64 / 119

Ranges: Why?

Nothing lasts forever

65 / 119

Ranges: Why?

Nothing lasts forever

Benefits include:

  • operators for querying
  • exclusion constraints
66 / 119

Ranges: Why?

Some useful predefined range types:

  • numrange
  • tstzrange
  • daterange

Or define your own:

CREATE TYPE inetrange AS RANGE (
SUBTYPE = inet
);
67 / 119

Ranges: How? (Basics)

Ranges can be created mid-query.

SELECT tstzrange(created_at, deleted_at) FROM users;
68 / 119

Ranges: How? (Basics)

Ranges can be created mid-query.

SELECT tstzrange(created_at, deleted_at) FROM users;

(NULLs are helpfully treated as -/+ infinity.)

69 / 119

Ranges: How? (Basics)

Ranges can be created mid-query.

SELECT tstzrange(created_at, deleted_at) FROM users;

(NULLs are helpfully treated as -/+ infinity.)

Find users with active accounts on Christmas Day

SELECT * FROM users
WHERE tstzrange(created_at, deleted_at) @> '2015-12-25'::timestamptz
70 / 119

Ranges: How? (Basics)

Ranges can be created mid-query.

SELECT tstzrange(created_at, deleted_at) FROM users;

(NULLs are helpfully treated as -/+ infinity.)

Find users with active accounts on Christmas Day

SELECT * FROM users
WHERE tstzrange(created_at, deleted_at) @> '2015-12-25'::timestamptz

Indexing makes this very fast.

CREATE INDEX ON users USING GIST( tstzrange(created_at, deleted_at) );
71 / 119

Ranges: How? (Window functions)

Understand how events unfold...

WITH events_duration AS (
SELECT
tstzrange(
at_time,
lead(at_time, 1) OVER (partition by event_source order by at_time)
) as range,
*
FROM events
)
SELECT * FROM events WHERE range @> now() - '4 hours';
72 / 119

Ranges: How? (Constraints)

Ranges are also great for exclusion constraints.

Prevent double-booking!

ALTER TABLE reservation ADD CONSTRAINT
EXCLUDE USING gist(room with =, during with &&)
73 / 119

Extension Types

74 / 119

Email Addresses

Everyone has email addresses, here's how you deal with them today.

SELECT * FROM users WHERE email LIKE '%@bigcorp.com'
75 / 119

Email Addresses

Everyone has email addresses, here's how you deal with them today.

SELECT * FROM users WHERE email LIKE '%@bigcorp.com'

This is always a full table scan. Sad trombone.

76 / 119

Email Addresses

Everyone has email addresses, here's how you deal with them today.

SELECT * FROM users WHERE email LIKE '%@bigcorp.com'

This is always a full table scan. Sad trombone.

We can try our reverse b-tree trick.

SELECT * FROM users WHERE reverse(email) LIKE reverse('%@bigcorp.com')
77 / 119

Email Addresses

Everyone has email addresses, here's how you deal with them today.

SELECT * FROM users WHERE email LIKE '%@bigcorp.com'

This is always a full table scan. Sad trombone.

We can try our reverse b-tree trick.

SELECT * FROM users WHERE reverse(email) LIKE reverse('%@bigcorp.com')

This is... clever, but ugly and hard to dream up.

78 / 119

emailaddr

github.com/petere/pgemailaddr

SELECT user(email) FROM users WHERE host(email) = 'bigcorp.com'

Bonus: By default, email addresses are sorted domain/user, so this is fast with no tricks!

79 / 119

What about URLs?

80 / 119

uri

github.com/petere/pguri

81 / 119

uri

github.com/petere/pguri

SELECT path(referrer) FROM signups WHERE host = 'interestingblog.com'
82 / 119

Skip the migration!

SELECT 'http://google.com/'::uri
SELECT host('http://google.com/'::uri)
83 / 119

Special-purpose Types

84 / 119

Geometry: What?

Postgres has a wide selection of geometry types.

  • point
  • line
  • lseg
  • box
  • path
  • polygon
  • circle
85 / 119

Geometric: Functions and Operators

  • translation: +, -, *, /
  • properties: @-@ (length), # (intersects)
  • comparison: @> (contains), <-> (distance)
  • .. and SO many others (&<|)
86 / 119

Geometry: How?

SELECT diameter(circle) FROM circles
WHERE center(circle) = '(2, 3)'
87 / 119

Geometry: Why?

  • good for light-weight geometry
  • not as fast or as feature rich as PostGIS
  • built in!
88 / 119

PostGIS

http://postgis.net

PostGIS is the best way to store and query spatial data.

89 / 119

PostGIS: What?

  • geom
  • geog
  • index support
  • wraps various spatial libraries
90 / 119

PostGIS: Why?

  • geom datatype
  • better performance
  • huge amounts of functionality (vector / raster)
  • not built-in
91 / 119

PostGIS: How?

SELECT
sum(ST_Length(r.the_geom))/1000 AS kilometers
FROM
bc_roads r,
bc_municipality m
WHERE r.name = 'Roseberry St' AND m.name = 'VICTORIA'
AND ST_Contains(m.the_geom, r.the_geom) ;
92 / 119

Network addresses

built-in types

  • inet (192.168.0.100)
  • cidr (::ffff:1.2.3.0/120)
  • macaddr (08:00:2b:01:02:03)

ip4r

https://github.com/RhodiumToad/ip4r

Adds index support for ranges, lighter-weight single IP type. (And lots more.)

93 / 119

Special mention

PostBIS

Bio-informatic types for DNA querying

https://colab.mpi-bremen.de/wiki/display/pbis/PostBIS

pg_currency

Very stale (6 years) but a solid foundation!

https://github.com/samv/pg-currency

94 / 119

Special mention

FixedDecimal

For when you need a fixed-precision decimal (performance)

https://github.com/2ndQuadrant/fixeddecimal

HyperLogLog

Not really a data type, but a data structure for estimating distinct values.

https://github.com/aggregateknowledge/postgresql-hll

95 / 119

Do-it-yourself data-types

96 / 119

Domains: What?

Create your own types that constrain existing types.

97 / 119

Domains: What?

Create your own types that constrain existing types.

  • roll your own email type
  • a more specific URI
98 / 119

Domains: What?

Create your own types that constrain existing types.

  • roll your own email type
  • a more specific URI
99 / 119

Domains: When?

  • common, stable specification
  • postal code
  • email address
  • NOT name
  • NOT postal address
100 / 119

Domains: Examples

CREATE DOMAIN email AS TEXT
CHECK(
VALUE ~ '.+\@.+'
);
CREATE DOMAIN web_url AS uri CHECK ( (uri_scheme(value) = 'http' OR uri_scheme(value) = 'https') AND uri_host(value) IS NOT null);
101 / 119

Domains: Examples

CREATE DOMAIN email AS TEXT
CHECK(
VALUE ~ '.+\@.+'
);
CREATE DOMAIN web_url AS uri CHECK ( (uri_scheme(value) = 'http' OR uri_scheme(value) = 'https') AND uri_host(value) IS NOT null);

(A better email address would check other things like length.)

102 / 119

Composite Types: What?

Create your own types that combine existing types.

103 / 119

Composite Types: What?

Create your own types that combine existing types.

  • all tables are also types
  • members always go together
  • all members always have the same fields
104 / 119

Composite Types: How? (1/2)

Every row is also a composite type.

SELECT (users) FROM users;

But you can create your own:

CREATE TYPE point3d AS (x float, y float, z float);
CREATE TABLE threespace (location point3d);
105 / 119

Composite Types: How? (2/2)

Referring to a row type (composite type named for table) requires parens

SELECT (users).email FROM users;

Row types are handy in window expressions

SELECT (agent_statuses) AS current,
lead((agent_statuses), 1) OVER (partition by agent_uuid order by time) AS next
FROM agent_statuses;
106 / 119

DIY functions

Defining a new function is straightforward

CREATE OR REPLACE FUNCTION host(email) RETURNS text
LANGUAGE sql AS $$
select split_part($1, '@', 2) as return;
$$;
107 / 119

DIY functions

Defining a new function is straightforward

CREATE OR REPLACE FUNCTION host(email) RETURNS text
LANGUAGE sql AS $$
select split_part($1, '@', 2) as return;
$$;

PL/PGSQL is very slow.

PL/V8 is relatively fast.

PL/SQL is quite fast.

108 / 119

DIY Aggregates

Define an initial condition, a per-value function, and a final function (optional).

CREATE AGGREGATE avg (float8)
(
sfunc = float8_accum,
stype = float8[],
finalfunc = float8_avg,
initcond = '{0,0,0}'
);
109 / 119

DIY Operators

An operator is any number of non-ASCII characters.

CREATE OPERATOR @-> (
LEFTARG = 'email',
PROCEDURE = 'host');
SELECT email@-> FROM emails LIMIT 1;
110 / 119

Missing Types

Opportunities to make cool things

111 / 119

Currency

money is not a good type, and we can do better.

SELECT justify('$34.03 USD, $12.01 CAD',
'USD',
'{"CAD": 0.77}');
112 / 119

Physical Units

Postgres users don't crash rovers into Mars.

SELECT '77 km/h'::physical
SELECT to_unit('ml', '542 fluid ounces'::physical);
SELECT '4 m/s'::physical + '1 ounce'::physical; -- error
113 / 119

Images

UPDATE graphs
SET image = render_graph(SELECT * FROM data WHERE run = 117) ;
SELECT gif_frame(image, 3) FROM gifs WHERE name = 'space-kaboom';

Image classification, sorting, organizing, resampling, and so on in-DB could be useful.

114 / 119

Music / Sound / Tags

SELECT genre FROM mp3s;
SELECT max(sound_length(sound)) FROM samples;

Could be great for people with large music collections, DJs, researchers.

Consider building on taglib?

115 / 119

phew!

we made it

116 / 119

Basic data types to use

text, timestamptz, bigserial/uuid

Advanced types to use

jsonb, array, tstzrange/numrange

Create your own

Domains, composite types, functions, operators, aggregates.

Community extensions

PostGIS, ip4r, email, uri, pgxn

117 / 119

fin

questions? comments? complaints?

@pvh / pvh@pvh.ca

Feedback is welcome and invited!

thanks to: peter eisentraut, jim mlodgenski, jonathan katz, bruce momjian, josh berkus, will leinweber, rachid belaid

118 / 119
119 / 119

Peter van Hardenberg

Heroku Postgres

http://postgres-data-types.pvh.ca/

(If you want a reference or see a bug, let me know!)

@pvh / pvh@pvh.ca

2 / 119
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow