Exam DP203 file formats Parquet

From MillerSql.com
Revision as of 19:42, 13 January 2025 by NeilM (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

The third file formats is: Parquet

Parquet is a columnar data format.

It is also binary, so cannot be opened in Notepad++. It needs to be opened in an app like ParquetViewer

It is a hybrid format, where the data is stored in row groups of (say) 1000 rows, and in columnar format within that. And for each row group the minimum and maximum value in the row group is also stored. This means that query engines can determine from this which row groups are in scope for a particular query predicate.

Compression

Dictionary encoding (numbers for each value)

Run-length encoding (number of instances of each value down the column).

Stores the file column definitions with datatypes.