Ketan Singh

Collection of musing and ramblings

How Postgres Stores Rows

Posted at — Feb 7, 2022 · 8 min read

Out of curiosity, I was trying to understand how PostgreSQL stores the data onto the disk and there are a few interesting things that I have noticed that might be useful for application developers. In this post, I will try to go into the implementation level details and map out how PostgreSQL row storage really works.

Just to be clear, PostgreSQL stores a lot of files on disk such as transaction commit data, subtransaction status data, write ahead logs (WAL), etc. I will only be exploring heap files. Now what the heck is a heap file? A heap file is just a file of records. Note that the heap file has got nothing to do with heap memory. Although their use case is very similar, which is storing dynamic data.

How rows are organized in files?

PostgreSQL stores the actual data into segment files (more generally called heap files). Typically its fixed to 1GB size but you can configure that at compile time using --with-segsize. When a table or index exceeds 1 GB, it is divided into gigabyte-sized segments. This arrangement avoids problems on platforms that have file size limitations but 1GB is very conservative choice for any modern platform.

These segment files contain data in fixed size pages which is usually 8Kb, although a different page size can be selected when compiling the server with --with-blocksize option but this size usually falls in ideal size when considering performance and reliability tradeoffs. If the page size is too small, rows won’t fit inside the page and if it’s too large there is risk of write failure because hardware generally can only guarantee atomicity for a fixed size blocks which can vary disk to disk (usually ranges from 512 bytes to 4096 bytes).

Pages themselves are organized in a layout something like following:

https://i.imgur.com/RCnRkeV.png

lp(1..N) is line pointer array. It points to a logical offset within that page. Since these are arrays, elements are fixed sized but the number of elements can vary.

row(1..N) represents actual SQL rows. They are added in reverse order within a page. They are generally variable sized and in order to reach to a specific tuple we use line pointer. Since there can be variable number of rows inside a page, items are added backward while line pointer is added to front. 

special space is typically used when storing indexes in these page, usually sibling nodes in a B-Tree for example. For table data this is not used.

lower and upper pointer mark the locations that have already been used in a page. Thus, (upper - lower) determines if a page has already been used.

page header itself is not that interesting, it just contains book-keeping stuff related a page including lower and upper pointer. It’s structure looks like this:

field              size    description                                                              
pd_lsn              8 bytesLSN: next byte after last byte of WAL record for last change to this page
pd_checksum        2 bytesPage checksum                                                            
pd_flags            2 bytesFlag bits                                                                
pd_lower            2 bytesOffset to start of free space                                            
pd_upper            2 bytesOffset to end of free space                                              
pd_special          2 bytesOffset to start of special space                                          
pd_pagesize_version2 bytesPage size and layout version number information                          
pd_prune_xid        4 bytesOldest unpruned XMAX on page, or zero if none                            

Now, this method of storage has few implications. 

Peeking into pages

Let’s try to peek into pages and see what tools PostgreSQL provides to inspect pages. We will start by creating a database, a table inside it and populating the table with some dummy data.

 1
 2postgres=# \c hail_mary ;
 3
 4You are now connected to database "hail_mary" as user "postgres".
 5
 6hail_mary=# 
 7
 8hail_mary=# create table users(id int, name text);
 9
10CREATE TABLE
11
12hail_mary=# insert into users(id, name)
13
14select i,md5(i::text)::text from generate_series(1, 50000, 1) as i;
15
16INSERT 0 50000
17
18hail_mary=# select * from users limit 5;
19
20 id |               name               
21
22----+----------------------------------
23
24  1 | c4ca4238a0b923820dcc509a6f75849b
25
26  2 | c81e728d9d4c2f636f067f89cc14862c
27
28  3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
29
30  4 | a87ff679a2f3e71d9181a67b7542122c
31
32  5 | e4da3b7fbbce2345d7772b0674a318d5
33
34(5 rows)
35

Internally PostgreSQL maintains a unique row id for our data which is usually opaque to users. We can query it explicitly to see its value.

 1
 2hail_mary=# select ctid, * from users limit 5;
 3
 4 ctid  | id |               name               
 5
 6-------+----+----------------------------------
 7
 8 (0,1) |  1 | c4ca4238a0b923820dcc509a6f75849b
 9
10 (0,2) |  2 | c81e728d9d4c2f636f067f89cc14862c
11
12 (0,3) |  3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
13
14 (0,4) |  4 | a87ff679a2f3e71d9181a67b7542122c
15
16 (0,5) |  5 | e4da3b7fbbce2345d7772b0674a318d5
17
18(5 rows)
19
20
21
22

In ctid First digit stand for the page number and the second digit stands for the tuple number. PostgreSQL moves around these tuples when VACUUM is run to defragment the page. We can actually see it in action. 

 1
 2hail_mary=# select ctid, * from users OFFSET 49995 limit 5;
 3
 4   ctid   |  id   |               name               
 5
 6----------+-------+----------------------------------
 7
 8 (416,76) | 49996 | 2acfa04df8cc1e5b051866c32f9eb072
 9
10 (416,77) | 49997 | 87aa98d07ec242cc4d8f685f0299257b
11
12 (416,78) | 49998 | 12775d2a4498f0ec748a4beed90e5ad2
13
14 (416,79) | 49999 | c703af5c89b1d0bc2e99f540f553f182
15
16 (416,80) | 50000 | 1017bfd4673955ffee4641ad3d481b1c
17
18(5 rows)
19
20hail_mary=# 
21
22hail_mary=# delete from users where name = 'c703af5c89b1d0bc2e99f540f553f182';
23
24DELETE 1
25
26hail_mary=# insert into users VALUES(9999999, 'ryland grace');
27
28INSERT 0 1
29
30hail_mary=# select ctid, * from users OFFSET 49995 limit 5;
31
32   ctid   |   id    |               name               
33
34----------+---------+----------------------------------
35
36 (416,76) |   49996 | 2acfa04df8cc1e5b051866c32f9eb072
37
38 (416,77) |   49997 | 87aa98d07ec242cc4d8f685f0299257b
39
40 (416,78) |   49998 | 12775d2a4498f0ec748a4beed90e5ad2
41
42 (416,80) |   50000 | 1017bfd4673955ffee4641ad3d481b1c
43
44 (416,81) | 9999999 | ryland grace
45
46(5 rows)
47

What we can see here is that PostgreSQL left tuple with name c703af5c89b1d0bc2e99f540f553f182 (which is now deleted) as is and decided to add new data into a new physical tuple instead of reusing it. What if we run VACCUM ?

 1
 2hail_mary=# VACUUM FULL;
 3
 4hail_mary=# select ctid, * from users OFFSET 49995 limit 5;
 5
 6   ctid   |   id    |               name               
 7
 8----------+---------+----------------------------------
 9
10 (416,76) |   49996 | 2acfa04df8cc1e5b051866c32f9eb072
11
12 (416,77) |   49997 | 87aa98d07ec242cc4d8f685f0299257b
13
14 (416,78) |   49998 | 12775d2a4498f0ec748a4beed90e5ad2
15
16 (416,79) |   50000 | 1017bfd4673955ffee4641ad3d481b1c
17
18 (416,80) | 9999999 | ryland grace
19
20(5 rows)
21

VACCUM as expected defragmented page by moving the tuples around.

Visually it can be understood as following

before

Before deleting

after-delete

After deleting and inserting

after-vaccum

After Vacuum

We can also look inside actual pages with the help of an extention called pageinspect which provides X-ray for heap files

1
2hail_mary=# create extension pageinspect;
3
4CREATE EXTENSION
5

Let’s take a look at last page (416)

1
2hail_mary=# select * from page_header(get_raw_page('users', 416));
3
4    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
5
6-----------+----------+-------+-------+-------+---------+----------+---------+-----------
7
8 0/428E4F0 |        0 |     0 |   344 |  3088 |    8192 |     8192 |       4 |         0
9

lower is 344  and upper is 3088 which means free space in this page is 2744 bytes

special points to 8192 which means it points to end of page and is storing nothing.

We look at previous page, we expect that page not to contain enough space for our row data

 1
 2hail_mary=# select * from page_header(get_raw_page('users', 415));
 3
 4    lsn    | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
 5
 6-----------+----------+-------+-------+-------+---------+----------+---------+-----------
 7
 8 0/428CF58 |        0 |     0 |   504 |   512 |    8192 |     8192 |       4 |         0
 9
10(1 row)
11

As expected, 8 bytes are not enough to store id + name

We can also peek inside page items using the following query:

 1
 2hail_mary=# select * from heap_page_items(get_raw_page('users', 416)) offset 78 limit 2;
 3
 4 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 |  t_ctid  | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |                                    t_data                                    
 5
 6----+--------+----------+--------+--------+--------+----------+----------+-------------+------------+--------+--------+-------+------------------------------------------------------------------------------
 7
 8 79 |   3136 |        1 |     61 |  11780 |      0 |        0 | (416,79) |           2 |       2818 |     24 |        |       | \x50c30000433130313762666434363733393535666665653436343161643364343831623163
 9
10 80 |   3088 |        1 |     41 |  11783 |      0 |        0 | (416,80) |           2 |       2818 |     24 |        |       | \x7f9698001b72796c616e64206772616365
11

References