Questions and Answers

Question LCl3DYzTKXokVOSBYiX1

Question

A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:

SELECT COUNT (*) FROM table -

Which of the following describes how results are generated each time the dashboard is updated?

Choices

A: The total count of rows is calculated by scanning all data files
B: The total count of rows will be returned from cached results unless REFRESH is run
C: The total count of records is calculated from the Delta transaction logs
D: The total count of records is calculated from the parquet file metadata
E: The total count of records is calculated from the Hive metastore

answer?

Answer: C Answer_ET: C Community answer C (91%) 9% Discussion

Comment 1076494 by aragorn_brego

Upvotes: 6

Selected Answer: C Delta Lake maintains a transaction log that records details about every change made to a table. When you execute a count operation on a Delta table, Delta Lake can use the information in the transaction log to calculate the total number of records without having to scan all the data files. This is because the transaction log includes information about the number of records in each file, allowing for an efficient aggregation of these counts to get the total number of records in the table.

Comment 1062381 by Syd

Upvotes: 5

Answer C https://delta.io/blog/2023-04-19-faster-aggregations-metadata/#:~:text=You%20can%20get%20the%20number,a%20given%20Delta%20table%20version.

Comment 1328254 by AlejandroU

Upvotes: 2

Selected Answer: D Answer D. Parquet Metadata Usage: Delta Lake does utilize Parquet file metadata for COUNT(*) operations. Parquet files store metadata, including row counts. Delta efficiently reads this metadata to get the total count without scanning the actual data within the files. This is a key optimization for performance.

Why not always scan: Scanning all data files for every COUNT(*) would be extremely inefficient, especially for large tables. This defeats the purpose of using a columnar storage format like Parquet and the optimizations built into Delta Lake and Spark.

The transaction log tracks changes to the table (adds, deletes, updates) but doesn’t store pre-computed row counts. It’s used for time travel, ACID properties, and other Delta features.

Comment 1326024 by Sriramiyer92

Upvotes: 1

Selected Answer: C “stats”: ”{“numRecords”: 3, “minValues”: {“x”: 1}, “maxValues”: {“x”: 3}, “nullCount”: {“x”: 0}}”, numRecords - In Delta tx logs will give you the value

Comment 1236964 by Ati1362

Upvotes: 2

Selected Answer: C Delta transaction log

Comment 1100010 by sodere

Upvotes: 4

Selected Answer: C Transaction log provides statistics about the delta table.

Comment 1099533 by alexvno

Upvotes: 3

Selected Answer: C C - transaction logs contains info about files rows count

Comment 1066041 by Dileepvikram

Upvotes: 2

The answer is C

Comment 1063238 by PearApple

Upvotes: 2

Selected Answer: C The answer should be C

Comment 1052890 by sturcu

Upvotes: 3

Selected Answer: C total rows will be calculated from delta logs

Question Y1oYlKc8Xz4jNeFTpJl6

Question

A Delta Lake table was created with the below query:

//IMG//

Consider the following query:

DROP TABLE prod.sales_by_store -

If this statement is executed by a workspace admin, which result will occur?

Choices

A: Nothing will occur until a COMMIT command is executed.
B: The table will be removed from the catalog but the data will remain in storage.
C: The table will be removed from the catalog and the data will be deleted.
D: An error will occur because Delta Lake prevents the deletion of production data.
E: Data will be marked as deleted but still recoverable with Time Travel.

answer?

Answer: C Answer_ET: C Community answer C (83%) Other Discussion

Comment 1167613 by hal2401me

Upvotes: 10

Selected Answer: C According to the exam courses answer is C, for a managed table dropped. But, as after Nov’23, UNDROP is introduced and I have test it working with UC managed tables. https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-undrop-table.html However, I don’t see any official doc says UNDROP related to ‘time travel’. So, be aware of the above info; in exam, watch the question carefully if it is updated.

Comment 1558619 by kishanu

Upvotes: 1

Selected Answer: C UNDROP can be used, within a 7 day retention period .

Comment 1191364 by Er5

Upvotes: 4

C. is only correct statement. Though the table can be UNDROP in 7 days https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-syntax-ddl-undrop-table E. Time Travel can retrieve versioned records but not tables. https://www.databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html

Comment 1160126 by Curious76

Upvotes: 2

Selected Answer: E I think E is better answer

Comment 1107227 by Luv4data

Upvotes: 1

E. Since the table is still recoverable from transaction logs.

Comment 1099536 by alexvno

Upvotes: 2

Selected Answer: C C : AS SELECT - Managed table Will remove table and data

Comment 1076498 by aragorn_brego

Upvotes: 3

Selected Answer: C In Delta Lake, when a DROP TABLE command is executed, it removes both the metadata entry for the table from the catalog and the data in storage associated with that table. Workspace administrators typically have the necessary permissions to drop tables, and unless there are additional protections or retention policies in place, the data is not recoverable through normal operations after the table is dropped.

Comment 1070856 by 60ties

Upvotes: 3

I meant C is correct, not D

Comment 1070854 by 60ties

Upvotes: 1

Selected Answer: D D is most correct

Comment 1066042 by Dileepvikram

Upvotes: 1

Answer is C as it is a managed table

Comment 1058627 by lokvamsi

Upvotes: 1

Selected Answer: C it is a managed table

Comment 1058623 by lokvamsi

Upvotes: 1

Selected Answer: A its a as it is managed table

Comment 1052892 by sturcu

Upvotes: 1

Selected Answer: C it is a managed table. So both table def and data will be deleted

Comment 1048746 by jyothsna12496

Upvotes: 1

Selected Answer: C Drop will usually delete the table structure and data if its managed, hence c

Question ZEcbn80CEC5ZhO8pfEw7

Question

Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().

Which of the following statements is correct?

Choices

A: DBFS is a file system protocol that allows users to interact with files stored in object storage using syntax and guarantees similar to Unix file systems.
B: By default, both the DBFS root and mounted data sources are only accessible to workspace administrators.
C: The DBFS root is the most secure location to store data, because mounted storage volumes must have full public read and write permissions.
D: Neither the DBFS root nor mounted storage can be accessed when using %sh in a Databricks notebook.
E: The DBFS root stores files in ephemeral block volumes attached to the driver, while mounted directories will always persist saved data to external storage between sessions.

answer?

Answer: A Answer_ET: A Community answer A (100%) Discussion

Comment 1160128 by Curious76

Upvotes: 4

Selected Answer: A A is correct . For E, This statement is partially incorrect. The DBFS root does use ephemeral storage, but not block volumes. Data saved there is lost when the cluster terminates unless explicitly persisted elsewhere. Mounted storage, however, can persist data between sessions depending on the underlying storage service and configuration.

Comment 1100016 by sodere

Upvotes: 2

Selected Answer: A DBFS is a layer on top of cloud storage providers.

Comment 1076503 by aragorn_brego

Upvotes: 3

Selected Answer: A Databricks File System (DBFS) is a layer over a cloud object storage (like AWS S3, Azure Blob Storage, or GCP Cloud Storage) that allows users to interact with data as if they were using a traditional file system. It provides familiar file system semantics and is designed to be consistent with POSIX-like file system behavior, which includes commands and actions similar to those used in Unix and Linux file systems.

Comment 1066043 by Dileepvikram

Upvotes: 1

Answer is A

Comment 1052933 by sturcu

Upvotes: 2

Selected Answer: A it is not E. The only on that would be plausible is A

Question BUMr1XndsVXERaQfY65e

Question

The following code has been migrated to a Databricks notebook from a legacy workload:

//IMG//

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

Choices

A: %sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.
B: Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.
C: %sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.
D: Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.
E: %sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

answer?

Answer: E Answer_ET: E Community answer E (100%) Discussion

Comment 1076504 by aragorn_brego

Upvotes: 9

Selected Answer: E When using %sh in a Databricks notebook, the commands are executed in a shell environment on the driver node. This means that only the resources of the driver node are used, and the execution does not leverage the distributed computing capabilities of the worker nodes in the Spark cluster. This can result in slower performance, especially for data-intensive tasks, compared to an approach that distributes the workload across all nodes in the cluster using Spark.

Comment 1269668 by robodog

Upvotes: 1

Selected Answer: E Option E correct

Comment 1220305 by Freyr

Upvotes: 2

Selected Answer: E Option E: Correct. The %sh magic command in Databricks runs shell commands on the driver node only. This means the operations within %sh do not leverage the distributed nature of the Databricks cluster. Consequently, the Git clone, Python script execution, and file move operations are all performed on a single node (the driver), which explains why it takes a long time to process and move 1 GB of data. This approach does not utilize the parallel processing capabilities of the worker nodes or the optimization features of Databricks Spark.

Option C: Incorrect. %sh does not inherently distribute any operations, but the issue here is broader than just file moving operations. Using %fs for file operations is a best practice, but it does not resolve the inefficiency of running all commands on the driver node.

Comment 1066046 by Dileepvikram

Upvotes: 2

E is the answer as the command is ran in the driver node and other nodes in the cluster are not used

Comment 1057408 by sturcu

Upvotes: 3

Selected Answer: E %sh run Bash commands on the driver node of the cluster. https://www.databricks.com/blog/2020/08/31/introducing-the-databricks-web-terminal.html

Comment 1053003 by sturcu

Upvotes: 1

you can use mv with %sh, but the syntax is not correct , it is missing the destination operand

Question NgcoLknyqUSwcSBBOJKU

Question

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer s suggestion is correct?

Choices

A: Delta Lake statistics are not optimized for free text fields with high cardinality.
B: Text data cannot be stored with Delta Lake.
C: ZORDER ON review will need to be run to see performance gains.
D: The Delta log creates a term matrix for free text fields to support selective filtering.
E: Delta Lake statistics are only collected on the first 4 columns in a table.

answer?

Answer: A Answer_ET: A Community answer A (100%) Discussion

Comment 1076505 by aragorn_brego

Upvotes: 7

Selected Answer: A Delta Lake uses statistics and data skipping to improve query performance, but these optimizations are most effective for columns with low to medium cardinality (i.e., columns with a limited set of distinct values). Free-form text fields like the review column typically have high cardinality, meaning each value in the column (each review text) is unique or nearly unique. Consequently, statistics on such columns do not significantly improve the performance of queries searching for specific keywords within the text.

Comment 1538254 by bp_a_user

Upvotes: 1

Selected Answer: A Collecting statistics on a column containing long values such as string or binary is an expensive operation https://docs.delta.io/latest/optimizations-oss.html

Comment 1066047 by Dileepvikram

Upvotes: 2

answer is A

Comment 1054631 by mouad_attaqi

Upvotes: 2

Selected Answer: A A is correct

Comment 1053010 by sturcu

Upvotes: 2

Selected Answer: A Collecting statistics on long strings is an expensive operation

vuthanhdatt's Second Brain

Explorer

33

Questions and Answers

Question LCl3DYzTKXokVOSBYiX1

Question

Choices

Comment 1076494 by aragorn_brego

Comment 1062381 by Syd

Comment 1328254 by AlejandroU

Comment 1326024 by Sriramiyer92

Comment 1236964 by Ati1362

Comment 1100010 by sodere

Comment 1099533 by alexvno

Comment 1066041 by Dileepvikram

Comment 1063238 by PearApple

Comment 1052890 by sturcu

Question Y1oYlKc8Xz4jNeFTpJl6

Question

Choices

Comment 1167613 by hal2401me

Comment 1558619 by kishanu

Comment 1191364 by Er5

Comment 1160126 by Curious76

Comment 1107227 by Luv4data

Comment 1099536 by alexvno

Comment 1076498 by aragorn_brego

Comment 1070856 by 60ties

Comment 1070854 by 60ties

Comment 1066042 by Dileepvikram

Comment 1058627 by lokvamsi

Comment 1058623 by lokvamsi

Comment 1052892 by sturcu

Comment 1048746 by jyothsna12496

Question ZEcbn80CEC5ZhO8pfEw7

Question

Choices

Comment 1160128 by Curious76

Comment 1100016 by sodere

Comment 1076503 by aragorn_brego

Comment 1066043 by Dileepvikram

Comment 1052933 by sturcu

Question BUMr1XndsVXERaQfY65e

Question

Choices

Comment 1076504 by aragorn_brego

Comment 1269668 by robodog

Comment 1220305 by Freyr

Comment 1066046 by Dileepvikram

Comment 1057408 by sturcu

Comment 1053003 by sturcu

Question NgcoLknyqUSwcSBBOJKU

Question

Choices

Comment 1076505 by aragorn_brego

Comment 1538254 by bp_a_user

Comment 1066047 by Dileepvikram

Comment 1054631 by mouad_attaqi

Comment 1053010 by sturcu

Graph View

Table of Contents