Questions and Answers

Question 9pPsQ1dW9Xiuc8b3xZgi

Question

A junior data engineer on your team has implemented the following code block.

//IMG//

The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table.

When this query is executed, what will happen with new records that have the same event_id as an existing record?

Choices

  • A: They are merged.
  • B: They are ignored.
  • C: They are updated.
  • D: They are inserted.

Question 4ROJiPnXNj0Ib6RtBiA3

Question

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds. Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

Choices

  • A: Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.
  • B: Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.
  • C: The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.
  • D: Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.
  • E: Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.

Question 9lFLULRyWdwqJwqp7rDR

Question

A team of data engineers are adding tables to a DLT pipeline that contain repetitive expectations for many of the same data quality checks. One member of the team suggests reusing these data quality rules across all tables defined for this pipeline.

What approach would allow them to do this?

Choices

  • A: Add data quality constraints to tables in this pipeline using an external job with access to pipeline configuration files.
  • B: Use global Python variables to make expectations visible across DLT notebooks included in the same pipeline.
  • C: Maintain data quality rules in a separate Databricks notebook that each DLT notebook or file can import as a library.
  • D: Maintain data quality rules in a Delta table outside of this pipeline’s target schema, providing the schema name as a pipeline parameter.

Question kq9QC6lcVBqdorDO5UCR

Question

A user wants to use DLT expectations to validate that a derived table report contains all records from the source, included in the table validation_copy.

The user attempts and fails to accomplish this by adding an expectation to the report table definition.

//IMG//

Which approach would allow using DLT expectations to validate all expected records are present in this table?

Choices

  • A: Define a temporary table that performs a left outer join on validation_copy and report, and define an expectation that no report key values are null
  • B: Define a SQL UDF that performs a left outer join on two tables, and check if this returns null values for report key values in a DLT expectation for the report table
  • C: Define a view that performs a left outer join on validation_copy and report, and reference this view in DLT expectations for the report table
  • D: Define a function that performs a left outer join on validation_copy and report, and check against the result in a DLT expectation for the report table

Question HAFezWFahsKdX3S28ocQ

Question

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

Choices

  • A: The Jobs UI should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.
  • B: The only way to meaningfully troubleshoot code execution times in development notebooks is to use production-sized data and production-sized clusters with Run All execution.
  • C: Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.
  • D: Calling display() forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.