Questions and Answers
Question jThfRKaCLSBlQbdYPGtA
Question
A Delta Lake table representing metadata about content posts from users has the following schema: user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE This table is partitioned by the date column. A query is run with the following filter: longitude < 20 & longitude > -20 Which statement describes how data will be filtered?
Choices
- A: Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.
- B: No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.
- C: The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.
- D: Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.
- E: The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.
answer?
Answer: D Answer_ET: D Community answer D (90%) 5% Discussion
Comment 1080129 by Enduresoul
- Upvotes: 12
Selected Answer: D D is correct. A partition can include multiple files. And the statistics are collected for each file.
Comment 1326395 by AlejandroU
- Upvotes: 1
Selected Answer: B Answer B. Single Comparison Filter (e.g., latitude > 66.3): File skipping is highly efficient because Delta can use min/max statistics to directly eliminate files that don’t meet the condition. Range Filters (e.g., longitude < 20 AND longitude > -20): File skipping is still possible but less efficient, because Delta has to evaluate whether any records in the file might meet the condition, even if the min and max values of the column in the file overlap with the filter range. So in summary, file skipping works best with single comparisons like latitude > 66.3 but is less effective with range filters like longitude < 20 AND longitude > -20.
Comment 1325633 by Sriramiyer92
- Upvotes: 1
Selected Answer: D Do not get confused between option c and d. Given answer is correct.
Comment 1320092 by hebied
- Upvotes: 1
Selected Answer: D D is more suitable
Comment 1270078 by AndreFR
- Upvotes: 2
Selected Answer: D Min and max values of each parquet file are stored in Delta Logs Delta data skipping automatically collects the stats (min, max, etc.) for the first 32 columns for each underlying Parquet file when you write data into a Delta table. Databricks takes advantage of this information (minimum and maximum values) at query time to skip unnecessary files in order to speed up the queries. https://www.databricks.com/discover/pages/optimize-data-workloads-guide#delta-data
Comment 1134149 by AziLa
- Upvotes: 2
Correct Ans is D
Comment 1057435 by Quadronoid
- Upvotes: 1
Selected Answer: C I guess C option is right since transaction log contains information about max/min values of first 32 columns, it can be used in order to filter files.
Comment 1044837 by sturcu
- Upvotes: 3
Selected Answer: D D is Correct
Question 3agNAMhRPP8lTPWrJ4YV
Question
A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company’s data is stored in regional cloud storage in the United States. The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed. Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?
Choices
- A: Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines must be deployed in the region where the data is stored.
- B: Databricks workspaces do not rely on any regional infrastructure; as such, the decision should be made based upon what is most convenient for the workspace administrator.
- C: Cross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored.
- D: Databricks leverages user workstations as the driver during interactive development; as such, users should always use a workspace deployed in a region they are physically near.
- E: Databricks notebooks send all executable code from the user’s browser to virtual machines over the open internet; whenever possible, choosing a workspace region near the end users is the most secure.
answer?
Answer: C Answer_ET: C Community answer C (86%) 14% Discussion
Comment 1339914 by RandomForest
- Upvotes: 1
Selected Answer: C C is the correct answer.
Comment 1223854 by imatheushenrique
- Upvotes: 2
(C) The decision is about where the Databricks workspace used by the contractors should be deployed. The contractors are based in India, while all the company’s data is stored in regional cloud storage in the United States. When choosing a region for deploying a Databricks workspace, one of the important factors to consider is the proximity to the data sources and sinks. Cross-region reads and writes can incur significant costs and latency due to network bandwidth and data transfer fees. Therefore, whenever possible, compute should be deployed in the same region the data is stored to optimize performance and reduce costs
Comment 1131058 by spaceexplorer
- Upvotes: 3
Selected Answer: C C is the answer.
Comment 1114431 by RafaelCFC
- Upvotes: 2
Selected Answer: C An important part of data governance is usage cost, and, as a general data engineering practice, egress costs related to moving data between regions is always an important consideration. Having the workspaces located in a different region than the contractors will incur to them in very little nuisance, while greatly saving in this sense.
Comment 1108682 by Patito
- Upvotes: 1
Selected Answer: B From where data engineering team developes pipelines is independent of where the data objects reside in the cloud storage.
Comment 1050001 by chokthewa
- Upvotes: 2
C is correct.
Question hD3f5kmERd4wSPoJFg5r
Question
The downstream consumers of a Delta Lake table have been complaining about data quality issues impacting performance in their applications. Specifically, they have complained that invalid latitude and longitude values in the activity_details table have been breaking their ability to use other geolocation processes. A junior engineer has written the following code to add CHECK constraints to the Delta Lake table: //IMG//
A senior engineer has confirmed the above logic is correct and the valid ranges for latitude and longitude are provided, but the code fails when executed. Which statement explains the cause of this failure?
Choices
- A: Because another team uses this table to support a frequently running application, two-phase locking is preventing the operation from committing.
- B: The activity_details table already exists; CHECK constraints can only be added during initial table creation.
- C: The activity_details table already contains records that violate the constraints; all existing data must pass CHECK constraints in order to add them to an existing table.
- D: The activity_details table already contains records; CHECK constraints can only be added prior to inserting values into a table.
- E: The current table schema does not contain the field valid_coordinates; schema evolution will need to be enabled before altering the table to add a constraint.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 973896 by 8605246
- Upvotes: 13
incorrect the correct option is C, with constraints, if added to an existing table the existing data in the table must be consistent with the constraint otherwise it fails https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-alter-table.html#add-constraint
Comment 1270114 by AndreFR
- Upvotes: 1
Selected Answer: C — CREATE TABLE create table test_constraint (t1 varchar(2), n1 int);
— ADD VALUE insert into test_constraint values (‘v3’, 3);
— ADD CONSTAINT VIOLATED BY CURRENT DATA — should throw error : 1 row in spark_catalog.default.test_constaint violate the new CHECK constraint (n1 < 3)
alter table test_constraint add constraint valid_n1 check (n1 < 3);
— ADD CONSTAINT NOT VIOLATED BY CURRENT DATA (no error) alter table test_constraint add constraint valid_n1 check (n1 < 100);
Comment 1260290 by faraaz132
- Upvotes: 1
Selected Answer: C C is correct.
Comment 1145313 by PrashantTiwari
- Upvotes: 1
C is correct
Comment 1138097 by DAN_H
- Upvotes: 1
correct ans is C
Comment 1134174 by AziLa
- Upvotes: 1
correct ans is C
Comment 1121973 by Jay_98_11
- Upvotes: 1
Selected Answer: C correct
Comment 1118784 by kz_data
- Upvotes: 1
Selected Answer: C C is the correct answer
Comment 1108685 by Patito
- Upvotes: 1
Selected Answer: C C is correct
Comment 1086758 by hamzaKhribi
- Upvotes: 1
Selected Answer: C C is correct
Comment 1080135 by Enduresoul
- Upvotes: 1
Selected Answer: C C is correct
Comment 1076384 by aragorn_brego
- Upvotes: 2
Selected Answer: C When adding a CHECK constraint to an existing table, the operation will fail if there are any rows in the table that do not meet the constraint. Before a CHECK constraint can be added, the data already in the table must be validated to ensure that it complies with the constraint conditions. If any existing records violate the new constraints, they must be corrected or removed before the ALTER TABLE command can be successfully executed.
Comment 1060831 by BIKRAM063
- Upvotes: 1
Selected Answer: C Correct option C : existing data violated check constraint condition
Comment 1057411 by Quadronoid
- Upvotes: 1
Selected Answer: C Right answer is C
Comment 1044838 by sturcu
- Upvotes: 1
Selected Answer: C C - table already has data
Comment 1015359 by MarceloManhaes
- Upvotes: 1
Yes the correct is option C
Question YISFJT5AhX4JVUoFdbJR
Question
Which of the following is true of Delta Lake and the Lakehouse?
Choices
- A: Because Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.
- B: Delta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in data skipping based on query filters.
- C: Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.
- D: Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.
- E: Z-order can only be applied to numeric values stored in Delta Lake tables.
answer?
Answer: B Answer_ET: B Community answer B (89%) 11% Discussion
Comment 1341787 by SRV_33
- Upvotes: 1
Selected Answer: B Complete statement is correct only in this option
Comment 1145316 by PrashantTiwari
- Upvotes: 1
B is correct
Comment 1141084 by guillesd
- Upvotes: 2
Selected Answer: B B is correct
Comment 1131068 by spaceexplorer
- Upvotes: 1
Selected Answer: B B is correct
Comment 1113779 by Crocjun
- Upvotes: 1
Can anyone explain why D is not correct?
Comment 1108687 by Patito
- Upvotes: 3
Selected Answer: B B is correct since statistics are collected for the first 32 columns and stored in the transaction log.
Comment 1106522 by ervinshang
- Upvotes: 1
Selected Answer: B B is correct, C is error, con’t have new cache in view
Comment 1101333 by f728f7f
- Upvotes: 1
Selected Answer: C C is correct
Comment 1051446 by chokthewa
- Upvotes: 1
B is correct. https://docs.delta.io/2.0.0/table-properties.html
Question 2O2ggTfzmxaKnrTXl0Nm
Question
The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings. The below query is used to create the alert: //IMG//
The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120. Notifications are triggered to be sent at most every 1 minute. If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?
Choices
- A: The total average temperature across all sensors exceeded 120 on three consecutive executions of the query
- B: The recent_sensor_recordings table was unresponsive for three consecutive runs of the query
- C: The source query failed to update properly for three consecutive minutes and then restarted
- D: The maximum temperature recording for at least one sensor exceeded 120 on three consecutive executions of the query
- E: The average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query
answer?
Answer: E Answer_ET: E Community answer E (100%) Discussion
Comment 1351579 by EelkeV
- Upvotes: 2
Selected Answer: E Because the mean is calculated for each sensor, and on that the alert is raised. It happened for three times, unknown for which sensor. Could be any
Comment 1268444 by AndreFR
- Upvotes: 2
A excluded because there is a group by clause B & C excluded table needs to be updated to mean value to change D excluded, because alert is set on average not max temperature Correct answer is E by elimination
Comment 1236071 by panya
- Upvotes: 1
Correct
Comment 1224435 by imatheushenrique
- Upvotes: 1
E. The average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query
Comment 1121589 by Jay_98_11
- Upvotes: 1
Selected Answer: E E is correct
Comment 1040233 by sturcu
- Upvotes: 2
Selected Answer: E correct
Comment 1021421 by saikot
- Upvotes: 1
The correct answer is E https://www.myexamcollection.com/databricks-certified-professional-data-engineer-databricks-certified-professional-data-engineer-exam-question-answers.htm