Exam DP203 file formats Parquet: Difference between revisions
No edit summary |
No edit summary |
||
Line 9: | Line 9: | ||
Compression | Compression | ||
Dictionary encoding | Dictionary encoding (numbers for each value) | ||
Run-length encoding (number of instances of each value down the column). |
Revision as of 22:51, 1 December 2024
The third file formats is: Parquet
Parquet is a columnar data format.
It is also binary, so cannot be opened in Notepad++. It needs to be opened in an app like ParquetViewer
It is a hybrid format, where the data is stored in row groups of (say) 1000 rows, and in columnar format within that. I think max min values might also be stored per row group.
Compression
Dictionary encoding (numbers for each value)
Run-length encoding (number of instances of each value down the column).