Questions and Answers
Question LCl3DYzTKXokVOSBYiX1
Question
A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:
SELECT COUNT (*) FROM table -
Which of the following describes how results are generated each time the dashboard is updated?
Choices
- A: The total count of rows is calculated by scanning all data files
- B: The total count of rows will be returned from cached results unless REFRESH is run
- C: The total count of records is calculated from the Delta transaction logs
- D: The total count of records is calculated from the parquet file metadata
- E: The total count of records is calculated from the Hive metastore
answer?
Answer: C Answer_ET: C Community answer C (91%) 9% Discussion
Comment 1076494 by aragorn_brego
- Upvotes: 6
Selected Answer: C Delta Lake maintains a transaction log that records details about every change made to a table. When you execute a count operation on a Delta table, Delta Lake can use the information in the transaction log to calculate the total number of records without having to scan all the data files. This is because the transaction log includes information about the number of records in each file, allowing for an efficient aggregation of these counts to get the total number of records in the table.
Comment 1062381 by Syd
- Upvotes: 5
Comment 1328254 by AlejandroU
- Upvotes: 2
Selected Answer: D Answer D. Parquet Metadata Usage: Delta Lake does utilize Parquet file metadata for COUNT(*) operations. Parquet files store metadata, including row counts. Delta efficiently reads this metadata to get the total count without scanning the actual data within the files. This is a key optimization for performance.
Why not always scan: Scanning all data files for every COUNT(*) would be extremely inefficient, especially for large tables. This defeats the purpose of using a columnar storage format like Parquet and the optimizations built into Delta Lake and Spark.
The transaction log tracks changes to the table (adds, deletes, updates) but doesn’t store pre-computed row counts. It’s used for time travel, ACID properties, and other Delta features.
Comment 1326024 by Sriramiyer92
- Upvotes: 1
Selected Answer: C “stats”: ”{“numRecords”: 3, “minValues”: {“x”: 1}, “maxValues”: {“x”: 3}, “nullCount”: {“x”: 0}}”, numRecords - In Delta tx logs will give you the value
Comment 1236964 by Ati1362
- Upvotes: 2
Selected Answer: C Delta transaction log
Comment 1100010 by sodere
- Upvotes: 4
Selected Answer: C Transaction log provides statistics about the delta table.
Comment 1099533 by alexvno
- Upvotes: 3
Selected Answer: C C - transaction logs contains info about files rows count
Comment 1066041 by Dileepvikram
- Upvotes: 2
The answer is C
Comment 1063238 by PearApple
- Upvotes: 2
Selected Answer: C The answer should be C
Comment 1052890 by sturcu
- Upvotes: 3
Selected Answer: C total rows will be calculated from delta logs
Question Y1oYlKc8Xz4jNeFTpJl6
Question
A Delta Lake table was created with the below query:
//IMG//
Consider the following query:
DROP TABLE prod.sales_by_store -
If this statement is executed by a workspace admin, which result will occur?
Choices
- A: Nothing will occur until a COMMIT command is executed.
- B: The table will be removed from the catalog but the data will remain in storage.
- C: The table will be removed from the catalog and the data will be deleted.
- D: An error will occur because Delta Lake prevents the deletion of production data.
- E: Data will be marked as deleted but still recoverable with Time Travel.
answer?
Answer: C Answer_ET: C Community answer C (83%) Other Discussion
Comment 1167613 by hal2401me
- Upvotes: 10
Selected Answer: C According to the exam courses answer is C, for a managed table dropped. But, as after Nov’23, UNDROP is introduced and I have test it working with UC managed tables. https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-undrop-table.html However, I don’t see any official doc says UNDROP related to ‘time travel’. So, be aware of the above info; in exam, watch the question carefully if it is updated.
Comment 1558619 by kishanu
- Upvotes: 1
Selected Answer: C UNDROP can be used, within a 7 day retention period .
Comment 1191364 by Er5
- Upvotes: 4
C. is only correct statement. Though the table can be UNDROP in 7 days https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-syntax-ddl-undrop-table E. Time Travel can retrieve versioned records but not tables. https://www.databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html
Comment 1160126 by Curious76
- Upvotes: 2
Selected Answer: E I think E is better answer
Comment 1107227 by Luv4data
- Upvotes: 1
E. Since the table is still recoverable from transaction logs.
Comment 1099536 by alexvno
- Upvotes: 2
Selected Answer: C C : AS SELECT - Managed table Will remove table and data
Comment 1076498 by aragorn_brego
- Upvotes: 3
Selected Answer: C In Delta Lake, when a DROP TABLE command is executed, it removes both the metadata entry for the table from the catalog and the data in storage associated with that table. Workspace administrators typically have the necessary permissions to drop tables, and unless there are additional protections or retention policies in place, the data is not recoverable through normal operations after the table is dropped.
Comment 1070856 by 60ties
- Upvotes: 3
I meant C is correct, not D
Comment 1070854 by 60ties
- Upvotes: 1
Selected Answer: D D is most correct
Comment 1066042 by Dileepvikram
- Upvotes: 1
Answer is C as it is a managed table
Comment 1058627 by lokvamsi
- Upvotes: 1
Selected Answer: C it is a managed table
Comment 1058623 by lokvamsi
- Upvotes: 1
Selected Answer: A its a as it is managed table
Comment 1052892 by sturcu
- Upvotes: 1
Selected Answer: C it is a managed table. So both table def and data will be deleted
Comment 1048746 by jyothsna12496
- Upvotes: 1
Selected Answer: C Drop will usually delete the table structure and data if its managed, hence c
Question ZEcbn80CEC5ZhO8pfEw7
Question
Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().
Which of the following statements is correct?
Choices
- A: DBFS is a file system protocol that allows users to interact with files stored in object storage using syntax and guarantees similar to Unix file systems.
- B: By default, both the DBFS root and mounted data sources are only accessible to workspace administrators.
- C: The DBFS root is the most secure location to store data, because mounted storage volumes must have full public read and write permissions.
- D: Neither the DBFS root nor mounted storage can be accessed when using %sh in a Databricks notebook.
- E: The DBFS root stores files in ephemeral block volumes attached to the driver, while mounted directories will always persist saved data to external storage between sessions.
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1160128 by Curious76
- Upvotes: 4
Selected Answer: A A is correct . For E, This statement is partially incorrect. The DBFS root does use ephemeral storage, but not block volumes. Data saved there is lost when the cluster terminates unless explicitly persisted elsewhere. Mounted storage, however, can persist data between sessions depending on the underlying storage service and configuration.
Comment 1100016 by sodere
- Upvotes: 2
Selected Answer: A DBFS is a layer on top of cloud storage providers.
Comment 1076503 by aragorn_brego
- Upvotes: 3
Selected Answer: A Databricks File System (DBFS) is a layer over a cloud object storage (like AWS S3, Azure Blob Storage, or GCP Cloud Storage) that allows users to interact with data as if they were using a traditional file system. It provides familiar file system semantics and is designed to be consistent with POSIX-like file system behavior, which includes commands and actions similar to those used in Unix and Linux file systems.
Comment 1066043 by Dileepvikram
- Upvotes: 1
Answer is A
Comment 1052933 by sturcu
- Upvotes: 2
Selected Answer: A it is not E. The only on that would be plausible is A
Question BUMr1XndsVXERaQfY65e
Question
The following code has been migrated to a Databricks notebook from a legacy workload:
//IMG//
The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.
Which statement is a possible explanation for this behavior?
Choices
- A: %sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.
- B: Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.
- C: %sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.
- D: Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.
- E: %sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.
answer?
Answer: E Answer_ET: E Community answer E (100%) Discussion
Comment 1076504 by aragorn_brego
- Upvotes: 9
Selected Answer: E When using %sh in a Databricks notebook, the commands are executed in a shell environment on the driver node. This means that only the resources of the driver node are used, and the execution does not leverage the distributed computing capabilities of the worker nodes in the Spark cluster. This can result in slower performance, especially for data-intensive tasks, compared to an approach that distributes the workload across all nodes in the cluster using Spark.
Comment 1269668 by robodog
- Upvotes: 1
Selected Answer: E Option E correct
Comment 1220305 by Freyr
- Upvotes: 2
Selected Answer: E Option E: Correct. The %sh magic command in Databricks runs shell commands on the driver node only. This means the operations within %sh do not leverage the distributed nature of the Databricks cluster. Consequently, the Git clone, Python script execution, and file move operations are all performed on a single node (the driver), which explains why it takes a long time to process and move 1 GB of data. This approach does not utilize the parallel processing capabilities of the worker nodes or the optimization features of Databricks Spark.
Option C: Incorrect. %sh does not inherently distribute any operations, but the issue here is broader than just file moving operations. Using %fs for file operations is a best practice, but it does not resolve the inefficiency of running all commands on the driver node.
Comment 1066046 by Dileepvikram
- Upvotes: 2
E is the answer as the command is ran in the driver node and other nodes in the cluster are not used
Comment 1057408 by sturcu
- Upvotes: 3
Selected Answer: E %sh run Bash commands on the driver node of the cluster. https://www.databricks.com/blog/2020/08/31/introducing-the-databricks-web-terminal.html
Comment 1053003 by sturcu
- Upvotes: 1
you can use mv with %sh, but the syntax is not correct , it is missing the destination operand
Question NgcoLknyqUSwcSBBOJKU
Question
The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:
item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING
The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.
A junior data engineer suggests converting this data to Delta Lake will improve query performance.
Which response to the junior data engineer s suggestion is correct?
Choices
- A: Delta Lake statistics are not optimized for free text fields with high cardinality.
- B: Text data cannot be stored with Delta Lake.
- C: ZORDER ON review will need to be run to see performance gains.
- D: The Delta log creates a term matrix for free text fields to support selective filtering.
- E: Delta Lake statistics are only collected on the first 4 columns in a table.
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1076505 by aragorn_brego
- Upvotes: 7
Selected Answer: A Delta Lake uses statistics and data skipping to improve query performance, but these optimizations are most effective for columns with low to medium cardinality (i.e., columns with a limited set of distinct values). Free-form text fields like the review column typically have high cardinality, meaning each value in the column (each review text) is unique or nearly unique. Consequently, statistics on such columns do not significantly improve the performance of queries searching for specific keywords within the text.
Comment 1538254 by bp_a_user
- Upvotes: 1
Selected Answer: A Collecting statistics on a column containing long values such as string or binary is an expensive operation https://docs.delta.io/latest/optimizations-oss.html
Comment 1066047 by Dileepvikram
- Upvotes: 2
answer is A
Comment 1054631 by mouad_attaqi
- Upvotes: 2
Selected Answer: A A is correct
Comment 1053010 by sturcu
- Upvotes: 2
Selected Answer: A Collecting statistics on long strings is an expensive operation