Exam DP203 file formats Parquet

From MillerSql.com
Revision as of 22:54, 1 December 2024 by NeilM (talk | contribs)

The third file formats is: Parquet

Parquet is a columnar data format.

It is also binary, so cannot be opened in Notepad++. It needs to be opened in an app like ParquetViewer

It is a hybrid format, where the data is stored in row groups of (say) 1000 rows, and in columnar format within that. I think max min values might also be stored per row group.

Compression

Dictionary encoding (numbers for each value)

Run-length encoding (number of instances of each value down the column).

Stores the file column definitions with datatypes.