Questions and Answers

Question jThfRKaCLSBlQbdYPGtA

Question

A Delta Lake table representing metadata about content posts from users has the following schema: user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE This table is partitioned by the date column. A query is run with the following filter: longitude < 20 & longitude > -20 Which statement describes how data will be filtered?

Choices

A: Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.
B: No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.
C: The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.
D: Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.
E: The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.

answer?

Answer: D Answer_ET: D Community answer D (90%) 5% Discussion

Comment 1080129 by Enduresoul

Upvotes: 12

Selected Answer: D D is correct. A partition can include multiple files. And the statistics are collected for each file.

Comment 1326395 by AlejandroU

Upvotes: 1

Selected Answer: B Answer B. Single Comparison Filter (e.g., latitude > 66.3): File skipping is highly efficient because Delta can use min/max statistics to directly eliminate files that don’t meet the condition. Range Filters (e.g., longitude < 20 AND longitude > -20): File skipping is still possible but less efficient, because Delta has to evaluate whether any records in the file might meet the condition, even if the min and max values of the column in the file overlap with the filter range. So in summary, file skipping works best with single comparisons like latitude > 66.3 but is less effective with range filters like longitude < 20 AND longitude > -20.

Comment 1325633 by Sriramiyer92

Upvotes: 1

Selected Answer: D Do not get confused between option c and d. Given answer is correct.

Comment 1320092 by hebied

Upvotes: 1

Selected Answer: D D is more suitable

Comment 1270078 by AndreFR

Upvotes: 2

Selected Answer: D Min and max values of each parquet file are stored in Delta Logs Delta data skipping automatically collects the stats (min, max, etc.) for the first 32 columns for each underlying Parquet file when you write data into a Delta table. Databricks takes advantage of this information (minimum and maximum values) at query time to skip unnecessary files in order to speed up the queries. https://www.databricks.com/discover/pages/optimize-data-workloads-guide#delta-data

Comment 1134149 by AziLa

Upvotes: 2

Correct Ans is D

Comment 1057435 by Quadronoid

Upvotes: 1

Selected Answer: C I guess C option is right since transaction log contains information about max/min values of first 32 columns, it can be used in order to filter files.

Comment 1044837 by sturcu

Upvotes: 3

Selected Answer: D D is Correct

Question 3agNAMhRPP8lTPWrJ4YV

Question

A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company’s data is stored in regional cloud storage in the United States. The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed. Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?

Choices

A: Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines must be deployed in the region where the data is stored.
B: Databricks workspaces do not rely on any regional infrastructure; as such, the decision should be made based upon what is most convenient for the workspace administrator.
C: Cross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored.
D: Databricks leverages user workstations as the driver during interactive development; as such, users should always use a workspace deployed in a region they are physically near.
E: Databricks notebooks send all executable code from the user’s browser to virtual machines over the open internet; whenever possible, choosing a workspace region near the end users is the most secure.

answer?

Answer: C Answer_ET: C Community answer C (86%) 14% Discussion

Comment 1339914 by RandomForest

Upvotes: 1

Selected Answer: C C is the correct answer.

Comment 1223854 by imatheushenrique

Upvotes: 2

(C) The decision is about where the Databricks workspace used by the contractors should be deployed. The contractors are based in India, while all the company’s data is stored in regional cloud storage in the United States. When choosing a region for deploying a Databricks workspace, one of the important factors to consider is the proximity to the data sources and sinks. Cross-region reads and writes can incur significant costs and latency due to network bandwidth and data transfer fees. Therefore, whenever possible, compute should be deployed in the same region the data is stored to optimize performance and reduce costs

Comment 1131058 by spaceexplorer

Upvotes: 3

Selected Answer: C C is the answer.

Comment 1114431 by RafaelCFC

Upvotes: 2

Selected Answer: C An important part of data governance is usage cost, and, as a general data engineering practice, egress costs related to moving data between regions is always an important consideration. Having the workspaces located in a different region than the contractors will incur to them in very little nuisance, while greatly saving in this sense.

Comment 1108682 by Patito

Upvotes: 1

Selected Answer: B From where data engineering team developes pipelines is independent of where the data objects reside in the cloud storage.

Comment 1050001 by chokthewa

Upvotes: 2

C is correct.

Question hD3f5kmERd4wSPoJFg5r

Question

The downstream consumers of a Delta Lake table have been complaining about data quality issues impacting performance in their applications. Specifically, they have complained that invalid latitude and longitude values in the activity_details table have been breaking their ability to use other geolocation processes. A junior engineer has written the following code to add CHECK constraints to the Delta Lake table: //IMG//

A senior engineer has confirmed the above logic is correct and the valid ranges for latitude and longitude are provided, but the code fails when executed. Which statement explains the cause of this failure?

Choices

A: Because another team uses this table to support a frequently running application, two-phase locking is preventing the operation from committing.
B: The activity_details table already exists; CHECK constraints can only be added during initial table creation.
C: The activity_details table already contains records that violate the constraints; all existing data must pass CHECK constraints in order to add them to an existing table.
D: The activity_details table already contains records; CHECK constraints can only be added prior to inserting values into a table.
E: The current table schema does not contain the field valid_coordinates; schema evolution will need to be enabled before altering the table to add a constraint.

answer?

Answer: C Answer_ET: C Community answer C (100%) Discussion

Comment 973896 by 8605246

Upvotes: 13

incorrect the correct option is C, with constraints, if added to an existing table the existing data in the table must be consistent with the constraint otherwise it fails https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-alter-table.html#add-constraint

Comment 1270114 by AndreFR

Upvotes: 1

Selected Answer: C — CREATE TABLE create table test_constraint (t1 varchar(2), n1 int);

— ADD VALUE insert into test_constraint values (‘v3’, 3);

— ADD CONSTAINT VIOLATED BY CURRENT DATA — should throw error : 1 row in spark_catalog.default.test_constaint violate the new CHECK constraint (n1 < 3)

alter table test_constraint add constraint valid_n1 check (n1 < 3);

— ADD CONSTAINT NOT VIOLATED BY CURRENT DATA (no error) alter table test_constraint add constraint valid_n1 check (n1 < 100);

Comment 1260290 by faraaz132

Upvotes: 1

Selected Answer: C C is correct.

Comment 1145313 by PrashantTiwari

Upvotes: 1

C is correct

Comment 1138097 by DAN_H

Upvotes: 1

correct ans is C

Comment 1134174 by AziLa

Upvotes: 1

correct ans is C

Comment 1121973 by Jay_98_11

Upvotes: 1

Selected Answer: C correct

Comment 1118784 by kz_data

Upvotes: 1

Selected Answer: C C is the correct answer

Comment 1108685 by Patito

Upvotes: 1

Selected Answer: C C is correct

Comment 1086758 by hamzaKhribi

Upvotes: 1

Selected Answer: C C is correct

Comment 1080135 by Enduresoul

Upvotes: 1

Selected Answer: C C is correct

Comment 1076384 by aragorn_brego

Upvotes: 2

Selected Answer: C When adding a CHECK constraint to an existing table, the operation will fail if there are any rows in the table that do not meet the constraint. Before a CHECK constraint can be added, the data already in the table must be validated to ensure that it complies with the constraint conditions. If any existing records violate the new constraints, they must be corrected or removed before the ALTER TABLE command can be successfully executed.

Comment 1060831 by BIKRAM063

Upvotes: 1

Selected Answer: C Correct option C : existing data violated check constraint condition

Comment 1057411 by Quadronoid

Upvotes: 1

Selected Answer: C Right answer is C

Comment 1044838 by sturcu

Upvotes: 1

Selected Answer: C C - table already has data

Comment 1015359 by MarceloManhaes

Upvotes: 1

Yes the correct is option C

Question YISFJT5AhX4JVUoFdbJR

Question

Which of the following is true of Delta Lake and the Lakehouse?

Choices

A: Because Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.
B: Delta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in data skipping based on query filters.
C: Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.
D: Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.
E: Z-order can only be applied to numeric values stored in Delta Lake tables.

answer?

Answer: B Answer_ET: B Community answer B (89%) 11% Discussion

Comment 1341787 by SRV_33

Upvotes: 1

Selected Answer: B Complete statement is correct only in this option

Comment 1145316 by PrashantTiwari

Upvotes: 1

B is correct

Comment 1141084 by guillesd

Upvotes: 2

Selected Answer: B B is correct

Comment 1131068 by spaceexplorer

Upvotes: 1

Selected Answer: B B is correct

Comment 1113779 by Crocjun

Upvotes: 1

Can anyone explain why D is not correct?

Comment 1108687 by Patito

Upvotes: 3

Selected Answer: B B is correct since statistics are collected for the first 32 columns and stored in the transaction log.

Comment 1106522 by ervinshang

Upvotes: 1

Selected Answer: B B is correct， C is error, con’t have new cache in view

Comment 1101333 by f728f7f

Upvotes: 1

Selected Answer: C C is correct

Comment 1051446 by chokthewa

Upvotes: 1

B is correct. https://docs.delta.io/2.0.0/table-properties.html

Question 2O2ggTfzmxaKnrTXl0Nm

Question

The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings. The below query is used to create the alert: //IMG//

The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120. Notifications are triggered to be sent at most every 1 minute. If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?

Choices

A: The total average temperature across all sensors exceeded 120 on three consecutive executions of the query
B: The recent_sensor_recordings table was unresponsive for three consecutive runs of the query
C: The source query failed to update properly for three consecutive minutes and then restarted
D: The maximum temperature recording for at least one sensor exceeded 120 on three consecutive executions of the query
E: The average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query

answer?

Answer: E Answer_ET: E Community answer E (100%) Discussion

Comment 1351579 by EelkeV

Upvotes: 2

Selected Answer: E Because the mean is calculated for each sensor, and on that the alert is raised. It happened for three times, unknown for which sensor. Could be any

Comment 1268444 by AndreFR

Upvotes: 2

A excluded because there is a group by clause B & C excluded table needs to be updated to mean value to change D excluded, because alert is set on average not max temperature Correct answer is E by elimination

Comment 1236071 by panya

Upvotes: 1

Correct

Comment 1224435 by imatheushenrique

Upvotes: 1

E. The average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query

Comment 1121589 by Jay_98_11

Upvotes: 1

Selected Answer: E E is correct

Comment 1040233 by sturcu

Upvotes: 2

Selected Answer: E correct

Comment 1021421 by saikot

Upvotes: 1

The correct answer is E https://www.myexamcollection.com/databricks-certified-professional-data-engineer-databricks-certified-professional-data-engineer-exam-question-answers.htm

vuthanhdatt's Second Brain

Explorer

27

Questions and Answers

Question jThfRKaCLSBlQbdYPGtA

Question

Choices

Comment 1080129 by Enduresoul

Comment 1326395 by AlejandroU

Comment 1325633 by Sriramiyer92

Comment 1320092 by hebied

Comment 1270078 by AndreFR

Comment 1134149 by AziLa

Comment 1057435 by Quadronoid

Comment 1044837 by sturcu

Question 3agNAMhRPP8lTPWrJ4YV

Question

Choices

Comment 1339914 by RandomForest

Comment 1223854 by imatheushenrique

Comment 1131058 by spaceexplorer

Comment 1114431 by RafaelCFC

Comment 1108682 by Patito

Comment 1050001 by chokthewa

Question hD3f5kmERd4wSPoJFg5r

Question

Choices

Comment 973896 by 8605246

Comment 1270114 by AndreFR

Comment 1260290 by faraaz132

Comment 1145313 by PrashantTiwari

Comment 1138097 by DAN_H

Comment 1134174 by AziLa

Comment 1121973 by Jay_98_11

Comment 1118784 by kz_data

Comment 1108685 by Patito

Comment 1086758 by hamzaKhribi

Comment 1080135 by Enduresoul

Comment 1076384 by aragorn_brego

Comment 1060831 by BIKRAM063

Comment 1057411 by Quadronoid

Comment 1044838 by sturcu

Comment 1015359 by MarceloManhaes

Question YISFJT5AhX4JVUoFdbJR

Question

Choices

Comment 1341787 by SRV_33

Comment 1145316 by PrashantTiwari

Comment 1141084 by guillesd

Comment 1131068 by spaceexplorer

Comment 1113779 by Crocjun

Comment 1108687 by Patito

Comment 1106522 by ervinshang

Comment 1101333 by f728f7f

Comment 1051446 by chokthewa

Question 2O2ggTfzmxaKnrTXl0Nm

Question

Choices

Comment 1351579 by EelkeV

Comment 1268444 by AndreFR

Comment 1236071 by panya

Comment 1224435 by imatheushenrique

Comment 1121589 by Jay_98_11

Comment 1040233 by sturcu

Comment 1021421 by saikot

Graph View

Table of Contents