How Postgres Stores Rows

Posted at — Feb 7, 2022 · 8 min read

Out of curiosity, I was trying to understand how PostgreSQL stores the data onto the disk and there are a few interesting things that I have noticed that might be useful for application developers. In this post, I will try to go into the implementation level details and map out how PostgreSQL row storage really works.

Just to be clear, PostgreSQL stores a lot of files on disk such as transaction commit data, subtransaction status data, write ahead logs (WAL), etc. I will only be exploring heap files. Now what the heck is a heap file? A heap file is just a file of records. Note that the heap file has got nothing to do with heap memory. Although their use case is very similar, which is storing dynamic data.

How rows are organized in files?

PostgreSQL stores the actual data into segment files (more generally called heap files). Typically its fixed to 1GB size but you can configure that at compile time using --with-segsize. When a table or index exceeds 1 GB, it is divided into gigabyte-sized segments. This arrangement avoids problems on platforms that have file size limitations but 1GB is very conservative choice for any modern platform.

These segment files contain data in fixed size pages which is usually 8Kb, although a different page size can be selected when compiling the server with --with-blocksize option but this size usually falls in ideal size when considering performance and reliability tradeoffs. If the page size is too small, rows won’t fit inside the page and if it’s too large there is risk of write failure because hardware generally can only guarantee atomicity for a fixed size blocks which can vary disk to disk (usually ranges from 512 bytes to 4096 bytes).

Pages themselves are organized in a layout something like following:

lp(1..N) is line pointer array. It points to a logical offset within that page. Since these are arrays, elements are fixed sized but the number of elements can vary.

row(1..N) represents actual SQL rows. They are added in reverse order within a page. They are generally variable sized and in order to reach to a specific tuple we use line pointer. Since there can be variable number of rows inside a page, items are added backward while line pointer is added to front.

special space is typically used when storing indexes in these page, usually sibling nodes in a B-Tree for example. For table data this is not used.

lower and upper pointer mark the locations that have already been used in a page. Thus, (upper - lower) determines if a page has already been used.

page header itself is not that interesting, it just contains book-keeping stuff related a page including lower and upper pointer. It’s structure looks like this:

field	size	description
pd_lsn	8 bytes	LSN: next byte after last byte of WAL record for last change to this page
pd_checksum	2 bytes	Page checksum
pd_flags	2 bytes	Flag bits
pd_lower	2 bytes	Offset to start of free space
pd_upper	2 bytes	Offset to end of free space
pd_special	2 bytes	Offset to start of special space
pd_pagesize_version	2 bytes	Page size and layout version number information
pd_prune_xid	4 bytes	Oldest unpruned XMAX on page, or zero if none

Now, this method of storage has few implications.

Firstly, tuples cannot span multiple pages. Therefore, it is not possible to store very large field values directly. Large field values are compressed and/or broken up into multiple physical rows.
Secondly, the maximum number of columns is limited between 250 and 1600 depending on column types.
Thirdly, if you have a table with multiple columns and naively assume that querying for selected column will be faster because of lower disk I/O, that is not going to be the case. Although it can help with network saturation if there’s too much traffic between the application and the database.
This kind of storage is not optimized for analytical workloads because you have to bring in data from disk which would be irrelevant for your queries. This is why most analytical based databases (OLAP) such as vertica use columnar storage.

Peeking into pages

Let’s try to peek into pages and see what tools PostgreSQL provides to inspect pages. We will start by creating a database, a table inside it and populating the table with some dummy data.

 1
 2postgres=# \c hail_mary ;
 3
 4You are now connected to database "hail_mary" as user "postgres".
 5
 6hail_mary=# 
 7
 8hail_mary=# create table users(id int, name text);
 9
10CREATE TABLE
11
12hail_mary=# insert into users(id, name)
13
14select i,md5(i::text)::text from generate_series(1, 50000, 1) as i;
15
16INSERT 0 50000
17
18hail_mary=# select * from users limit 5;
19
20 id |               name               
21
22----+----------------------------------
23
24  1 | c4ca4238a0b923820dcc509a6f75849b
25
26  2 | c81e728d9d4c2f636f067f89cc14862c
27
28  3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
29
30  4 | a87ff679a2f3e71d9181a67b7542122c
31
32  5 | e4da3b7fbbce2345d7772b0674a318d5
33
34(5 rows)
35

Internally PostgreSQL maintains a unique row id for our data which is usually opaque to users. We can query it explicitly to see its value.

 1
 2hail_mary=# select ctid, * from users limit 5;
 3
 4 ctid  | id |               name               
 5
 6-------+----+----------------------------------
 7
 8 (0,1) |  1 | c4ca4238a0b923820dcc509a6f75849b
 9
10 (0,2) |  2 | c81e728d9d4c2f636f067f89cc14862c
11
12 (0,3) |  3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
13
14 (0,4) |  4 | a87ff679a2f3e71d9181a67b7542122c
15
16 (0,5) |  5 | e4da3b7fbbce2345d7772b0674a318d5
17
18(5 rows)
19
20
21
22

In ctid First digit stand for the page number and the second digit stands for the tuple number. PostgreSQL moves around these tuples when VACUUM is run to defragment the page. We can actually see it in action.

 1
 2hail_mary=# select ctid, * from users OFFSET 49995 limit 5;
 3
 4   ctid   |  id   |               name               
 5
 6----------+-------+----------------------------------
 7
 8 (416,76) | 49996 | 2acfa04df8cc1e5b051866c32f9eb072
 9
10 (416,77) | 49997 | 87aa98d07ec242cc4d8f685f0299257b
11
12 (416,78) | 49998 | 12775d2a4498f0ec748a4beed90e5ad2
13
14 (416,79) | 49999 | c703af5c89b1d0bc2e99f540f553f182
15
16 (416,80) | 50000 | 1017bfd4673955ffee4641ad3d481b1c
17
18(5 rows)
19
20hail_mary=# 
21
22hail_mary=# delete from users where name = 'c703af5c89b1d0bc2e99f540f553f182';
23
24DELETE 1
25
26hail_mary=# insert into users VALUES(9999999, 'ryland grace');
27
28INSERT 0 1
29
30hail_mary=# select ctid, * from users OFFSET 49995 limit 5;
31
32   ctid   |   id    |               name               
33
34----------+---------+----------------------------------
35
36 (416,76) |   49996 | 2acfa04df8cc1e5b051866c32f9eb072
37
38 (416,77) |   49997 | 87aa98d07ec242cc4d8f685f0299257b
39
40 (416,78) |   49998 | 12775d2a4498f0ec748a4beed90e5ad2
41
42 (416,80) |   50000 | 1017bfd4673955ffee4641ad3d481b1c
43
44 (416,81) | 9999999 | ryland grace
45
46(5 rows)
47

What we can see here is that PostgreSQL left tuple with name c703af5c89b1d0bc2e99f540f553f182 (which is now deleted) as is and decided to add new data into a new physical tuple instead of reusing it. What if we run VACCUM ?

 1
 2hail_mary=# VACUUM FULL;
 3
 4hail_mary=# select ctid, * from users OFFSET 49995 limit 5;
 5
 6   ctid   |   id    |               name               
 7
 8----------+---------+----------------------------------
 9
10 (416,76) |   49996 | 2acfa04df8cc1e5b051866c32f9eb072
11
12 (416,77) |   49997 | 87aa98d07ec242cc4d8f685f0299257b
13
14 (416,78) |   49998 | 12775d2a4498f0ec748a4beed90e5ad2
15
16 (416,79) |   50000 | 1017bfd4673955ffee4641ad3d481b1c
17
18 (416,80) | 9999999 | ryland grace
19
20(5 rows)
21

VACCUM as expected defragmented page by moving the tuples around.

Visually it can be understood as following

before

Before deleting

after-delete

After deleting and inserting

after-vaccum

After Vacuum

We can also look inside actual pages with the help of an extention called pageinspect which provides X-ray for heap files

1
2hail_mary=# create extension pageinspect;
3
4CREATE EXTENSION
5

Let’s take a look at last page (416)

1
2hail_mary=# select * from page_header(get_raw_page('users', 416));
3
4    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
5
6-----------+----------+-------+-------+-------+---------+----------+---------+-----------
7
8 0/428E4F0 |        0 |     0 |   344 |  3088 |    8192 |     8192 |       4 |         0
9

lower is 344 and upper is 3088 which means free space in this page is 2744 bytes

special points to 8192 which means it points to end of page and is storing nothing.

We look at previous page, we expect that page not to contain enough space for our row data

 1
 2hail_mary=# select * from page_header(get_raw_page('users', 415));
 3
 4    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
 5
 6-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 7
 8 0/428CF58 |        0 |     0 |   504 |   512 |    8192 |     8192 |       4 |         0
 9
10(1 row)
11

As expected, 8 bytes are not enough to store id + name

We can also peek inside page items using the following query:

 1
 2hail_mary=# select * from heap_page_items(get_raw_page('users', 416)) offset 78 limit 2;
 3
 4 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 |  t_ctid  | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |                                    t_data                                    
 5
 6----+--------+----------+--------+--------+--------+----------+----------+-------------+------------+--------+--------+-------+------------------------------------------------------------------------------
 7
 8 79 |   3136 |        1 |     61 |  11780 |      0 |        0 | (416,79) |           2 |       2818 |     24 |        |       | \x50c30000433130313762666434363733393535666665653436343161643364343831623163
 9
10 80 |   3088 |        1 |     41 |  11783 |      0 |        0 | (416,80) |           2 |       2818 |     24 |        |       | \x7f9698001b72796c616e64206772616365
11

References

Database Physical Storage - https://www.postgresql.org/docs/9.5/storage.html
PageInspect - https://www.postgresql.org/docs/12/pageinspect.html
Row storage in PostgreSQL, heap file page layout - https://www.youtube.com/watch?v=L-dw1yRFYVg
Database Storage I (CMU Databases Systems / Fall 2019) - https://www.youtube.com/watch?v=1D81vXw2T_w

Ketan Singh

Collection of musing and ramblings

How Postgres Stores Rows

How rows are organized in files?

Peeking into pages

References