Exam DP203 file formats Parquet: Difference between revisions

From MillerSql.com
NeilM (talk | contribs)
No edit summary
NeilM (talk | contribs)
No edit summary
Line 9: Line 9:
Compression
Compression


Dictionary encoding
Dictionary encoding (numbers for each value)
 
Run-length encoding (number of instances of each value down the column).

Revision as of 22:51, 1 December 2024

The third file formats is: Parquet

Parquet is a columnar data format.

It is also binary, so cannot be opened in Notepad++. It needs to be opened in an app like ParquetViewer

It is a hybrid format, where the data is stored in row groups of (say) 1000 rows, and in columnar format within that. I think max min values might also be stored per row group.

Compression

Dictionary encoding (numbers for each value)

Run-length encoding (number of instances of each value down the column).