Exam DP203 file formats Parquet: Difference between revisions
No edit summary |
No edit summary |
||
Line 12: | Line 12: | ||
Run-length encoding (number of instances of each value down the column). | Run-length encoding (number of instances of each value down the column). | ||
Stores the file column definitions with datatypes. |
Revision as of 22:54, 1 December 2024
The third file formats is: Parquet
Parquet is a columnar data format.
It is also binary, so cannot be opened in Notepad++. It needs to be opened in an app like ParquetViewer
It is a hybrid format, where the data is stored in row groups of (say) 1000 rows, and in columnar format within that. I think max min values might also be stored per row group.
Compression
Dictionary encoding (numbers for each value)
Run-length encoding (number of instances of each value down the column).
Stores the file column definitions with datatypes.