Questions and Answers

Question ntXzzV0x7LobKSuTWfDf

Question

A view is registered with the following code:

//IMG//

Both users and orders are Delta Lake tables.

Which statement describes the results of querying recent_orders?

Choices

  • A: All logic will execute when the view is defined and store the result of joining tables to the DBFS; this stored data will be returned when the view is queried.
  • B: Results will be computed and cached when the view is defined; these cached results will incrementally update as new records are inserted into source tables.
  • C: All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.
  • D: All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

Question sdTAJVROJYfRIR3gv1nL

Question

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

Choices

  • A: Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
  • B: Set spark.sql.shuffle.partitions to 2,048 partitions (1TB10241024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
  • C: Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB10241024/512), and then write to parquet.
  • D: Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.

Question Wrj7BXqh47XQHk2pI9Lh

Question

Which statement regarding stream-static joins and static Delta tables is correct?

Choices

  • A: The checkpoint directory will be used to track updates to the static Delta table.
  • B: Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job’s initialization.
  • C: The checkpoint directory will be used to track state information for the unique keys present in the join.
  • D: Stream-static joins cannot use static Delta tables because of consistency issues.

Question a6wMp9vjIQJJsNXzy9WN

Question

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

Streaming DataFrame df has the following schema:

“device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT”

Code block:

//IMG//

Which line of code correctly fills in the blank within the code block to complete this task?

Choices

  • A: to_interval(“event_time”, “5 minutes”).alias(“time”)
  • B: window(“event_time”, “5 minutes”).alias(“time”)
  • C: “event_time”
  • D: lag(“event_time”, “10 minutes”).alias(“time”)

Question l7nbfkY1EoyT51MvKbJj

Question

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each microbatch of data is processed in less than 3s; at least 12 times per minute, a microbatch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution.

Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?

Choices

  • A: Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs.
  • B: Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.
  • C: Set the trigger interval to 10 minutes; each batch calls APIs in the source storage account, so decreasing trigger frequency to maximum allowable threshold should minimize this cost.
  • D: Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger interval ensures that the source is not queried too frequently.