Questions and Answers
Question 9pPsQ1dW9Xiuc8b3xZgi
Question
A junior data engineer on your team has implemented the following code block.
//IMG//
The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table.
When this query is executed, what will happen with new records that have the same event_id as an existing record?
Choices
- A: They are merged.
- B: They are ignored.
- C: They are updated.
- D: They are inserted.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1400957 by Kyries
- Upvotes: 1
Selected Answer: B B is correct
Question 4ROJiPnXNj0Ib6RtBiA3
Question
A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds. Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?
Choices
- A: Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.
- B: Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.
- C: The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.
- D: Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.
- E: Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.
answer?
Answer: E Answer_ET: E Community answer E (92%) 4% Discussion
Comment 1114342 by RafaelCFC
- Upvotes: 9
Selected Answer: E I believe this is a case of the least bad option, not exactly the best option possible.
- A is wrong because in Streaming you very rarely have any executors idle, as all cores are engaged in processing the window of data;
- B is wrong because triggering every 30s will not meet the 10s target processing interval;
- C is wrong in two manners: increasing shuffle partitions to any number above the number of available cores in the cluster will worsen performance in streaming; also, the checkpoint folder has no connection with trigger time.
- D is wrong because, keeping all other things the same as described by the problem, keeping the trigger time as 10s will not change the underlying conditions of the delay (i.e.: too much data to be processed in a timely manner).
E is the only option that might improve processing time.
Comment 1334781 by arekm
- Upvotes: 1
Selected Answer: E Answer E, see explanation by RafaelCFC.
Comment 1334272 by ASRCA
- Upvotes: 1
Selected Answer: A Option A emphasizes utilizing idle executors to begin processing the next batch while longer-running tasks from previous batches finish. This approach can help maintain a steady flow of data processing and reduce the likelihood of bottlenecks.
Comment 1321828 by Thameur01
- Upvotes: 1
Selected Answer: B If microbatch execution occasionally exceeds 30 seconds, a trigger interval of 5 seconds would cause multiple batches to queue up while the previous batch is still running. This would exacerbate the delays and potentially lead to backpressure and failure. B is the best option in this case. If we assume for sure that execution time should be less than 10s, then in that case a 5s interval will make more sense and E should be the best answer.
Comment 1261091 by wdeleersnyder
- Upvotes: 2
In Databricks Runtime 11.3 LTS and above, the Trigger.Once setting is deprecated. Databricks recommends you use Trigger.AvailableNow for all incremental batch processing workloads.
https://docs.databricks.com/en/structured-streaming/triggers.html
Doesn’t seem like E is a valid and recommended option given that it is deprecated.
Comment 1222470 by imatheushenrique
- Upvotes: 2
Considering the best option for performance gain is: E. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.
Comment 1149836 by ojudz08
- Upvotes: 2
Selected Answer: E E is the answer. Enable the settings uses the 128 MB as the target file size https://learn.microsoft.com/en-us/azure/databricks/delta/tune-file-size
Comment 1136828 by DAN_H
- Upvotes: 2
Selected Answer: E E is correct as A is wrong because in Streaming you very rarely have any executors idle
Comment 1118618 by kz_data
- Upvotes: 1
Selected Answer: E I think is E is correct
Comment 1105127 by ervinshang
- Upvotes: 1
Selected Answer: E correct answer is E
Comment 1065030 by ofed
- Upvotes: 1
Only C. Even if you trigger more frequently you decrease both load and time for this load. E doesn’t change anything.
Comment 1040333 by sturcu
- Upvotes: 4
Selected Answer: E Changing trigger interval to “one” will cause this to be a “batch” and will not execute in microbranches. This will not help at all
Comment 1013349 by Eertyy
- Upvotes: 1
correct answer is E
Comment 1003002 by azurearch
- Upvotes: 1
sorry, the caveat is holding all other variables constant.. that means we are not allowed to change trigger intervals. is C the answer then
Comment 1002969 by azurearch
- Upvotes: 1
what if in between those 5 seconds trigger interval if there are more records, that would still increase the time it takes to process.. i doubt E is correct. I will go with answer D. it is not to execute all queries within 10 secs. it is to execute trigger now batch every 10 seconds.
Comment 1002913 by azurearch
- Upvotes: 1
A option also is about setting trigger interval to 5 seconds, just to understand.. why its not the answer
Comment 999175 by cotardo2077
- Upvotes: 2
Selected Answer: E for sure E
Comment 994000 by Eertyy
- Upvotes: 2
correct anwer is E
Comment 970010 by asmayassineg
- Upvotes: 4
correct answer is E. D means a job will need to acquire resources in 10s which is impossible without serverless
Question 9lFLULRyWdwqJwqp7rDR
Question
A team of data engineers are adding tables to a DLT pipeline that contain repetitive expectations for many of the same data quality checks. One member of the team suggests reusing these data quality rules across all tables defined for this pipeline.
What approach would allow them to do this?
Choices
- A: Add data quality constraints to tables in this pipeline using an external job with access to pipeline configuration files.
- B: Use global Python variables to make expectations visible across DLT notebooks included in the same pipeline.
- C: Maintain data quality rules in a separate Databricks notebook that each DLT notebook or file can import as a library.
- D: Maintain data quality rules in a Delta table outside of this pipeline’s target schema, providing the schema name as a pipeline parameter.
answer?
Answer: D Answer_ET: D Community answer D (75%) C (25%) Discussion
Comment 1323650 by benni_ale
- Upvotes: 5
Selected Answer: D https://docs.databricks.com/en/delta-live-tables/expectations.html “You can maintain data quality rules separately from your pipeline implementations. Databricks recommends storing the rules in a Delta table with each rule categorized by a tag.”
Comment 1410019 by lakime
- Upvotes: 1
Selected Answer: C Initially C, currently D
Comment 1335576 by arekm
- Upvotes: 1
Selected Answer: D D is what Databricks suggests as of now
Comment 1324005 by Thameur01
- Upvotes: 1
Selected Answer: C To reuse repetitive data quality rules across multiple tables in a Delta Live Tables (DLT) pipeline, the most efficient approach is to maintain these rules in a separate notebook or Python module and import them where needed. This promotes code reusability, maintainability, and consistency
Question kq9QC6lcVBqdorDO5UCR
Question
A user wants to use DLT expectations to validate that a derived table report contains all records from the source, included in the table validation_copy.
The user attempts and fails to accomplish this by adding an expectation to the report table definition.
//IMG//
Which approach would allow using DLT expectations to validate all expected records are present in this table?
Choices
- A: Define a temporary table that performs a left outer join on validation_copy and report, and define an expectation that no report key values are null
- B: Define a SQL UDF that performs a left outer join on two tables, and check if this returns null values for report key values in a DLT expectation for the report table
- C: Define a view that performs a left outer join on validation_copy and report, and reference this view in DLT expectations for the report table
- D: Define a function that performs a left outer join on validation_copy and report, and check against the result in a DLT expectation for the report table
answer?
Answer: A Answer_ET: A Community answer A (50%) C (50%) Discussion
Comment 1401039 by mohadjhamad
- Upvotes: 1
Selected Answer: C Why a view works: By creating a view that performs a left outer join between validation_copy (the expected source of records) and report, missing records can be identified where the report.key is NULL. Referencing the view in DLT: The view can then be used in an expectation rule within the report table to ensure all expected records are present.
Comment 1323665 by benni_ale
- Upvotes: 1
Selected Answer: A Argubly a better solution would be Define a materialised view that performs a left outer join on validation_copy and report, and define an expectation that no report key values are null : https://docs.databricks.com/en/delta-live-tables/expectations.html?utm_source=chatgpt.com#perform-advanced-validation-with-delta-live-tables-expectations
Question HAFezWFahsKdX3S28ocQ
Question
A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.
Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?
Choices
- A: The Jobs UI should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.
- B: The only way to meaningfully troubleshoot code execution times in development notebooks is to use production-sized data and production-sized clusters with Run All execution.
- C: Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.
- D: Calling display() forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.
answer?
Answer: B Answer_ET: B Community answer B (75%) D (25%) Discussion
Comment 1402027 by suranga4
- Upvotes: 1
Selected Answer: B B is the right answer
Comment 1401041 by mohadjhamad
- Upvotes: 1
Selected Answer: D Calling display() forces a job to trigger,
Comment 1337372 by mouthwash
- Upvotes: 1
Selected Answer: B B is the right answer.
Comment 1335580 by arekm
- Upvotes: 1
Selected Answer: B B - this is a repeated question. You need production like data & environment to be able to troubleshoot the performance issues. display() does introduce overhead, but we use it here to investigate, where the problems start to surface. We still need the production like env and data first.