class: center, middle, section-heading # Understanding the power of data types ## PostgreSQL's Secret Weapon --- class: center, middle # Peter van Hardenberg ### Heroku Postgres http://postgres-data-types.pvh.ca/ (If you want a reference or see a bug, let me know!) #### @pvh / pvh@pvh.ca --- # Programme * Basic data types * Type internals (for users) * Advanced data types * Special-purpose data types * Customizing types * Community extensions * Missing types --- class: center, middle # Why use data types? ## Make queries simpler and more readable. ## Solve problems elegantly and in the database. ## Performance and space efficiency. --- # Caveat Auditor ## "The root of all evil is premature optimization." .right[— Knuth] --- class: center, middle, section-heading # Basic data types ## A very brief review --- # Basic types ## surrogate keys ## text ## numbers ## dates and times --- # Keys * Include a surrogate key. * Model your natural key via constraints. * Use `bigserial` or `uuid` for your primary key.* --- # Keys: BIGSERIAL If your table is small, the extra size doesn't matter. If your table is big, you're going to need it anyway. --- # Keys: UUID ````sql CREATE EXTENSION "uuid-ossp"; SELECT uuid_generate_v4(); ```` ````sql CREATE EXTENSION pgcrypto; SELECT gen_random_uuid(); ```` (Or create the UUIDs in the client.) --- # Text * Use `text`! * Avoid `varchar`, `char`, and anything else. --- # Text: Use indexes for pattern-matching An index supports prefixes (`LIKE 'Peter%'`) ````sql CREATE INDEX ON users (name); SELECT * FROM accounts WHERE email LIKE 'Peter%'; ```` -- A functional index can support suffix lookups. ````sql CREATE INDEX backsearch ON users (reverse(email)); SELECT * FROM accounts WHERE reverse(email) LIKE reverse('%doe.com');` ```` --- # Text: Other Searching A tsvectors with a GIN index will find individual words efficiently ````sql CREATE INDEX ON products USING gin(tsv); ```` Regular expressions: ~ ````sql SELECT * FROM users WHERE name ~ '(John|Jane)\s+Doe'; ```` (You can get some index capability via pg_trgm/GIN index) --- # Times & Dates ## `timestamptz` ### Always use `timestamptz`. (Not `time`.) -- ## `date` ### (For when you just need the date.) --- # Time functions: date_trunc() A query that always starts at the beginning of the current week ````sql SELECT date_trunc('week', now())::date; ```` -- A query that counts users created by day ````sql SELECT date_trunc('day', created_at), count(*) FROM users GROUP BY date_trunc('day', created_at) ```` --- # Time travel: interval When was a year ago? ````sql SELECT now() - '1 year'::interval; ```` -- All the days last month (good for joining): ````sql SELECT generate_series(date_trunc('month', now() - '1 month'::interval), date_trunc('month', now()), '1 day'::interval) ```` --- # Boolean values ## Use `bool`, not `bit` -- Tip: Don't index on booleans. (Consider a partial index.) --- # Binary data ## `bytea` (Do consider if Postgres is the right solution...) Mostly you will leave these alone and treat them as opaque. Handy Exception: ````sql SELECT md5(binary_data) FROM table; ```` --- # Don't use * `money`: not up to modern standards * `timestamp`: use `timestamptz` * `time`: you probably meant `timestamptz` * `serial`: use `bigserial` # Are you sure? * `float` / `integer`: use `numeric` * `varchar`, `char`: use `text`, it's faster * `bitstring`: premature optimization * `xml`: libxml2 is awful, but... * `json`: you probably want jsonb --- class: center, middle, section-heading # Data Type Internals ## A simplified guide for users --- # Internals: What's a type?* * a row in pg_type * a nice name for users to remember * `typename_{in, out}()` * `typename_{send, recv}()` optionally * functions (`host()`, `unnest()`, `average()`) * operators (`<`, `=`, `~`) * ... and some other things --- # Internals: How does it work? * typename_in() gets user input, e.g. `'rain', '{1, 3, 5}'` * returns a block of memory (palloc() fixed size or varlena) * this is added as an attribute to a tuple * the tuple is packed into a page * the page is written to disk * C language functions operate on this format --- # Internals <img width=220 src="tuple.png"/> <img width=500 src="page.png"/> --- # Internals: TOAST ## (The Oversized-Attribute Storage Technique) Large values are compressed and moved to a special storage location called a TOAST table automatically. This assumes large values shouldn't be kept with their row anyway. --- # Internals: TOAST is great * Keeps data rows small, saving I/O on search. * Shrinks large values, saving I/O on retrieval. * Can (maybe?) add cost if searching contents of all rows of large data... -- * Try not to do that. --- # Internals: Make TOAST faster ## `tsvector` on `text` ## `GIN` on `tsvector` / `jsonb` --- class: center, middle, section-heading # Advanced data types ## A closer look at some of the more powerful types --- class: center # Advanced types greatest hits ## `array` ## `jsonb` ## `enum` ## `range` --- class: center, middle, section-heading # Arrays --- # Array: What? It's an array. -- In a value. -- It's not really rocket science. -- Useful for: * actual array data * tags / collections * on-the-fly aggregation in query results --- # Array: How? (Constructing) Make an array. ````sql SELECT ARRAY[1, 2, 3]; ```` Make it another way. ````sql SELECT '{1, 2, 3}'::numeric[]; ```` Extend your array. ````sql SELECT ARRAY[1,2] || 3; SELECT ARRAY[1,2] || ARRAY[3, 4]; ```` Aggregate from records. ````sql SELECT array_agg(distinct users) from events where name = 'PGDay.RU'; ```` -- You can make N-dimensional arrays too. --- # Array: How? (Accessing) (Remember SQL arrays are 1-indexed!) ````sql SELECT ('{one, two, three}'::text[])[1]; -- one ```` Use containment operator to find things. ````sql CREATE INDEX ON table USING gin(tags); -- because fast! SELECT * FROM table WHERE tags @> array['sometag']; ```` (GIN indexes are great here) --- class: center, middle, section-heading # JSON ## Not exactly overlooked! --- # JSON(b): What? ## "We made a database inside a type." .right[— Oleg Bartunov] --- # JSON(b): When? (1/3) Convenient short-lived or occasional attribute storage. ````sql {"weird_user?": true, "abuse_tags": ["new", "possible_abuser"]} ```` (Honour demands that I insist you make these columns if they get used often.) --- # JSON(b): When? (2/3) Hierarchical data. ````json {"name": "car", "id": 91371 "parts": [ { "name": "engine" "id": 3241, "parts": [ { "name": "crankshaft", "id": 23991, ... ```` --- # JSON(b): When? (3/3) The data is already JSON. Why pull it apart? --- # JSON(b): `jsonb` vs `json` Prefer `jsonb`, not `json`. ````sql ALTER TABLE users ADD COLUMN attrs jsonb; ```` Usually add a GIN index. ````sql CREATE INDEX ON users USING gin(attrs); ```` --- # JSON(b): Anti-patterns Probably don't do this... ````sql CREATE TABLE bad_table(id serial, json json); ```` --- # JSON(b): Anti-patterns Particularly if.. ````sql =# select distinct jsonb_object_keys(json) from bad_table; jsonb_object_keys ------------------- id created_at updated_at name email owner_id ... ```` Performance will suffer and your queries will be more complex. --- # JSON(b): Construction Input is easy. ````sql select '{"a": "b"}'::jsonb ```` Adding members too. ```sql UPDATE documents SET contents = content || '{"new_field": "goes here"}' WHERE document_id = 32421; ``` --- # JSON(b): Accessing Get to elements with -> and ->> ````sql select attrs->>'presenter_irc_nick' from talks attrs ------- pvh ```` Using a single -> returns the internal JSON. ````sql select attrs->'previous_presentation'->>'location' from talks attrs ----------- "Ottawa" ```` Or send a whole path ````sql SELECT jsonb_extract_path_text(attrs, 'previous_presentation', 'location') ```` --- # JSON(b): Decomposing Turning JSON into records (requires column definition). ````sql select * from json_to_recordset( '[{"a":1,"b":"foo"},{"a":"2","c":"bar"}]') as x(a int, b text); ```` --- # JSON(b): Searching Find documents with a certain key, ````sql select * from talks where attrs ? 'previous_presentation' ```` or documents with a whole sub-document in common. ````sql SELECT * FROM talks WHERE attrs @> '{"previous_presentation": "PGCon"}' ```` --- class: center, middle, section-heading # Enum --- # Enums: What? ````sql CREATE TYPE weekdays AS ('Mon', 'Tue', 'Wed', 'Thu', 'Fri'); ```` Enums are conceptually similar to domains. Fast, transparent mapping of words to integers. (Data lives in pg_enum.) --- # Enums: Why? Enums provide a limited set of labels. Storage is quite efficient - enums are stored as numbers. --- # Enums: How? Enums work like enums in other languages. ````sql CREATE TYPE server_states AS ENUM ('running', 'uncertain', 'offline', 'restarting'); ```` -- You can treat them like their text values. ````sql INSERT INTO servers(state) VALUES ('offline'); ```` -- Enums block unexpected values from insertion. ````sql INSERT INTO servers(state) VALUES ('mystery'); ERROR: invalid input value for enum server_states: "mystery" LINE 1: insert into servers(state) values ('mystery'); ```` -- But you can add new values. ````sql ALTER TYPE server_states ADD VALUE 'abandoned' AFTER 'offline'; ```` -- (NB: You can treat enums as numbers and compare with > and < but... don't.) --- class: center, middle, section-heading # Ranges --- # Ranges: What? Ranges are a simple but powerful variation on composite types. ````sql INSERT INTO reservation VALUES ('[2016-01-01 11:30, 2016-01-01 15:00)'); ```` Always has upper / lower, inclusive / exclusive bounds. --- # Ranges: Why? ## Nothing lasts forever -- Benefits include: * operators for querying * exclusion constraints --- # Ranges: Why? Some useful predefined range types: * numrange * tstzrange * daterange Or define your own: ````sql CREATE TYPE inetrange AS RANGE ( SUBTYPE = inet ); ```` --- # Ranges: How? (Basics) Ranges can be created mid-query. ````sql SELECT tstzrange(created_at, deleted_at) FROM users; ```` -- (NULLs are helpfully treated as -/+ infinity.) -- Find users with active accounts on Christmas Day ````sql SELECT * FROM users WHERE tstzrange(created_at, deleted_at) @> '2015-12-25'::timestamptz ```` -- Indexing makes this very fast. ````sql CREATE INDEX ON users USING GIST( tstzrange(created_at, deleted_at) ); ```` --- # Ranges: How? (Window functions) Understand how events unfold... ````sql WITH events_duration AS ( SELECT tstzrange( at_time, lead(at_time, 1) OVER (partition by event_source order by at_time) ) as range, * FROM events ) SELECT * FROM events WHERE range @> now() - '4 hours'; ```` --- # Ranges: How? (Constraints) Ranges are also great for exclusion constraints. Prevent double-booking! ````sql ALTER TABLE reservation ADD CONSTRAINT EXCLUDE USING gist(room with =, during with &&) ```` --- class: center, middle, section-heading # Extension Types --- # Email Addresses Everyone has email addresses, here's how you deal with them today. ````sql SELECT * FROM users WHERE email LIKE '%@bigcorp.com' ```` -- This is always a full table scan. Sad trombone. -- We can try our reverse b-tree trick. ````sql SELECT * FROM users WHERE reverse(email) LIKE reverse('%@bigcorp.com') ```` -- This is... clever, but ugly and hard to dream up. --- # `emailaddr` ## github.com/petere/pgemailaddr ````sql SELECT user(email) FROM users WHERE host(email) = 'bigcorp.com' ```` Bonus: By default, email addresses are sorted domain/user, so this is fast with no tricks! --- class: middle, center # What about URLs? --- # `uri` ## github.com/petere/pguri -- ````sql SELECT path(referrer) FROM signups WHERE host = 'interestingblog.com' ```` --- # Skip the migration! ````sql SELECT 'http://google.com/'::uri ```` ````sql SELECT host('http://google.com/'::uri) ```` --- class: center, middle, section-heading # Special-purpose Types --- # Geometry: What? Postgres has a wide selection of geometry types. * point * line * lseg * box * path * polygon * circle --- # Geometric: Functions and Operators * translation: `+`, `-`, `*`, `/` * properties: `@-@` (length), `#` (intersects) * comparison: `@>` (contains), `<->` (distance) * .. and SO many others (`&<|`) --- # Geometry: How? ````sql SELECT diameter(circle) FROM circles WHERE center(circle) = '(2, 3)' ```` --- # Geometry: Why? * good for light-weight geometry * not as fast or as feature rich as PostGIS * built in! --- # PostGIS http://postgis.net PostGIS is the best way to store and query spatial data. --- # PostGIS: What? * `geom` * `geog` * index support * wraps various spatial libraries --- # PostGIS: Why? * geom datatype * better performance * huge amounts of functionality (vector / raster) * not built-in --- # PostGIS: How? ````sql SELECT sum(ST_Length(r.the_geom))/1000 AS kilometers FROM bc_roads r, bc_municipality m WHERE r.name = 'Roseberry St' AND m.name = 'VICTORIA' AND ST_Contains(m.the_geom, r.the_geom) ; ```` --- # Network addresses ## built-in types * inet (`192.168.0.100`) * cidr (`::ffff:1.2.3.0/120`) * macaddr (`08:00:2b:01:02:03`) ## ip4r https://github.com/RhodiumToad/ip4r Adds index support for ranges, lighter-weight single IP type. (And lots more.) --- # Special mention ## PostBIS Bio-informatic types for DNA querying https://colab.mpi-bremen.de/wiki/display/pbis/PostBIS ## pg_currency Very stale (6 years) but a solid foundation! https://github.com/samv/pg-currency --- # Special mention ## FixedDecimal For when you need a fixed-precision decimal (performance) https://github.com/2ndQuadrant/fixeddecimal ## HyperLogLog Not really a data type, but a data structure for estimating distinct values. https://github.com/aggregateknowledge/postgresql-hll --- class: center, middle, section-heading # Do-it-yourself data-types --- # Domains: What? Create your own types that constrain existing types. -- * roll your own email type * a more specific URI -- NB: SQL wants all types to be null-friendly. Your domain should allow nulls. --- # Domains: When? * common, stable specification * postal code * email address * NOT name * NOT postal address --- # Domains: Examples ````sql CREATE DOMAIN email AS TEXT CHECK( VALUE ~ '.+\@.+' ); ```` ````sql CREATE DOMAIN web_url AS uri CHECK ( (uri_scheme(value) = 'http' OR uri_scheme(value) = 'https') AND uri_host(value) IS NOT null); ```` -- (A better email address would check other things like length.) --- # Composite Types: What? Create your own types that combine existing types. -- * all tables are also types * members always go together * all members always have the same fields --- # Composite Types: How? (1/2) Every row is also a composite type. ````sql SELECT (users) FROM users; ```` But you can create your own: ````sql CREATE TYPE point3d AS (x float, y float, z float); CREATE TABLE threespace (location point3d); ```` --- # Composite Types: How? (2/2) Referring to a row type (composite type named for table) requires parens ````sql SELECT (users).email FROM users; ```` Row types are handy in window expressions ````sql SELECT (agent_statuses) AS current, lead((agent_statuses), 1) OVER (partition by agent_uuid order by time) AS next FROM agent_statuses; ```` --- # DIY functions Defining a new function is straightforward ````sql CREATE OR REPLACE FUNCTION host(email) RETURNS text LANGUAGE sql AS $$ select split_part($1, '@', 2) as return; $$; ```` -- PL/PGSQL is very slow. PL/V8 is relatively fast. PL/SQL is quite fast. --- # DIY Aggregates Define an initial condition, a per-value function, and a final function (optional). ````sql CREATE AGGREGATE avg (float8) ( sfunc = float8_accum, stype = float8[], finalfunc = float8_avg, initcond = '{0,0,0}' ); ```` --- # DIY Operators An operator is any number of non-ASCII characters. ````sql CREATE OPERATOR @-> ( LEFTARG = 'email', PROCEDURE = 'host'); SELECT email@-> FROM emails LIMIT 1; ```` --- class: center, middle, section-heading # Missing Types ## Opportunities to make cool things --- # Currency `money` is not a good type, and we can do better. ````sql SELECT justify('$34.03 USD, $12.01 CAD', 'USD', '{"CAD": 0.77}'); ```` --- # Physical Units Postgres users don't crash rovers into Mars. ````sql SELECT '77 km/h'::physical SELECT to_unit('ml', '542 fluid ounces'::physical); SELECT '4 m/s'::physical + '1 ounce'::physical; -- error ```` --- # Images ````sql UPDATE graphs SET image = render_graph(SELECT * FROM data WHERE run = 117) ; ```` ````sql SELECT gif_frame(image, 3) FROM gifs WHERE name = 'space-kaboom'; ```` Image classification, sorting, organizing, resampling, and so on in-DB could be useful. --- # Music / Sound / Tags ````sql SELECT genre FROM mp3s; SELECT max(sound_length(sound)) FROM samples; ```` Could be great for people with large music collections, DJs, researchers. Consider building on taglib? --- class: center, middle, section-heading # phew! # we made it --- ## Basic data types to use `text`, `timestamptz`, `bigserial`/`uuid` ## Advanced types to use `jsonb`, `array`, `tstzrange`/`numrange` ## Create your own Domains, composite types, functions, operators, aggregates. ## Community extensions PostGIS, ip4r, email, uri, pgxn --- class: section-heading, center, middle # fin ## questions? comments? complaints? ## @pvh / pvh@pvh.ca Feedback is welcome and invited! thanks to: peter eisentraut, jim mlodgenski, jonathan katz, bruce momjian, josh berkus, will leinweber, rachid belaid ---