Postgres provides two types of Character Types, one is character varying(n) and the other is character(n), Both of these types can store strings up to n characters (not bytes) in length

Name Description
character varying(n), varchar(n) variable-length with limit
character(n), char(n) fixed-length, blank padded
text variable unlimited length

To store them, the storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string, which includes the space padding in the case of character. Longer strings have 4 bytes of overhead instead of 1.

Long strings that larger than 2k are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values. In any case, the longest possible character string that can be stored is about 1 GB.

The compression for strings > 2k are being handled by TOAST.

TOAST(The Oversized-Attribute Storage Technique)

The TOAST management code recognizes four different strategies for storing TOAST-able columns on disk:

  • PLAIN prevents either compression or out-of-line storage; furthermore it disables use of single-byte headers for varlena types. This is the only possible strategy for columns of non-TOAST-able data types.

  • EXTENDED allows both compression and out-of-line storage. This is the default for most TOAST-able data types. Compression will be attempted first, then out-of-line storage if the row is still too big.

  • EXTERNAL allows out-of-line storage but not compression. Use of EXTERNAL will make substring operations on wide text and bytea columns faster (at the penalty of increased storage space) because these operations are optimized to fetch only the required parts of the out-of-line value when it is not compressed.

  • MAIN allows compression but not out-of-line storage. (Actually, out-of-line storage will still be performed for such columns, but only as a last resort when there is no other way to make the row small enough to fit on a page.)

The TOAST management code is triggered only when a row value to be stored in a table is wider than TOAST_TUPLE_THRESHOLD bytes (normally 2 kB). The TOAST code will compress and/or move field values out-of-line until the row value is shorter than TOAST_TUPLE_TARGET bytes (also normally 2 kB) or no more gains can be had. During an UPDATE operation, values of unchanged fields are normally preserved as-is; so an UPDATE of a row with out-of-line values incurs no TOAST costs if none of the out-of-line values change.

However, the TOAST_TUPLE_THRESHOLD parameter change need recompile.

Postgres Foreign data wrapper

As discussed above, the column that larger than 2k will be compressed automatically, what about the column that less than 2k?

Technically speaking, there is no solution within Postgres does that, however, since Postgres 9.3, using new feature Foreign data wrapper APIs, external extentions can be defined to archive it. cstore-fdw is the one of Postgres foreign data wrappers

cstore-fdw is an open sourced Postgres extension that uses the Optimized Row Columnar (ORC) format for its data layout. Table metadata  handled by protobuf-c for serializing and deserializing table metadata. It's dedicated for data analysis purpose.
ORC improves upon the RCFile format developed at Facebook, open source
protocal buffer is developed at Google, open source

It brings huge benefit on space cost, as well as performance.

Postgres works with it through Postgres foreign data wrapper APIs, so it's transparent to user end. Regular SQLs, stored procedures, triggers are all good with it. No  table level vacuum any more, but it does need vacuum analyze sometime for a while to let PG planner to work better with it.

Comments powered by CComment