Exam DP203 file formats Parquet
The third file formats is: Parquet
Parquet is a columnar data format.
It is also binary, so cannot be opened in Notepad++. It needs to be opened in an app like ParquetViewer
It is a hybrid format, where the data is stored in row groups of (say) 1000 rows, and in columnar format within that. And for each row group the minimum and maximum value in the row group is also stored. This means that query engines can determine from this which row groups are in scope for a particular query predicate.
Compression
Dictionary encoding (numbers for each value)
Run-length encoding (number of instances of each value down the column).
Stores the file column definitions with datatypes.