• Databrick hive metastore vs unity catalog
  • delta lake and iceberg table format command Describe detail Describe history

Transaction logs

OPTIMIZE command Z-ORDER indexing VACCUM: deffault 7 days retention

optimize file layout:

  • partition split to folder
  • z order index order incremental record in files re optimize need full scan
  • liquid cluster re optimize only affect new incremental data

Automatic liquid clustering auto predict cluster column base on table history query workload, need predictive optimiztion on Unity catalog

Table

  • managed table: own both metadata and actual data drop will lose all
  • external table: specify location path when create, only own metadata drop will not lose actual data in location path

deep clone: fully copy data + metadata, can sync change shallow clone: just copy delta log

View

  • Stored view: persited object
  • temp view: tied to spark session lost when session end
  • global temp view: other user connect to same spark cluster still can access this temp view

data files (Parquets) are immutable. An UPDATE or DELETE operation does not change an existing file. Instead, it works by adding new files and atomically marking the old files as no longer valid in the transaction log.

As a table undergoes thousands of transactions, reading all the individual JSON files to compute the current state would become slow. To solve this, Delta Lake periodically creates checkpoint files.

  • A checkpoint is a snapshot of the table’s state up to a specific commit, stored in an efficient Parquet format. By default, a checkpoint is created every 10 commits. When reading a table, Spark can jump to the latest checkpoint and then apply only the few JSON commits made since, dramatically improving read performance.

  • It is crucial to align your data retention policy with your time travel requirements. For long-term historical analysis (e.g., tracking changes over years), time travel is not the right tool due to storage costs. In those cases, you would use data modeling patterns like Slowly Changing Dimensions (SCDs),

MERGE: The Ultimate DML Command

  • WHEN MATCHED: The source key exists in the target. You can then UPDATE or DELETE the target row.
  • WHEN NOT MATCHED: The source key does not exist in the target. You can then INSERT the new row.
  • WHEN NOT MATCHED BY SOURCE: A row exists in the target but not in the source. This is useful for tasks like soft-deleting records that have been removed from the source system.

  • Before deleting anything, always perform a DRY RUN. This will show you a list of files that would be deleted without actually deleting them.