Questions and Answers

Question aSgHpnxemCqtjyiUnj1X

Question

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.

Given an extremely long-running job for which completion must be guaranteed, which cluster configuration will be able to guarantee completion of the job in light of one or more VM failures?

Choices

  • A: • Total VMs: 8 • 50 GB per Executor • 20 Cores / Executor
  • B: • Total VMs: 16 • 25 GB per Executor • 10 Cores / Executor
  • C: • Total VMs: 1 • 400 GB per Executor • 160 Cores/Executor
  • D: • Total VMs: 4 • 100 GB per Executor • 40 Cores / Executor

Question s9bsbaoJT28uFRn5yZaC

Question

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:

//IMG//

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system.

If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

Choices

  • A: Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.
  • B: All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.
  • C: The orders table will contain only the most recent 2 hours of records and no duplicates will be present.
  • D: The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.

Question utBTqWfkLzDH8lmppTWE

Question

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

Choices

  • A: Rely on Delta Lake schema enforcement to prevent duplicate records.
  • B: VACUUM the Delta table after each batch completes.
  • C: Perform an insert-only merge with a matching condition on a unique key.
  • D: Perform a full outer join on a unique key and overwrite existing data.

Question uBxn0eXd0Fzo1PEELyp0

Question

A junior data engineer seeks to leverage Delta Lake’s Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property delta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:

//IMG//

Which statement describes the execution and results of running the above query multiple times?

Choices

  • A: Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.
  • B: Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.
  • C: Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table, giving the desired result.
  • D: Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.

Question ir9F7dlyrvVitDkdTPFw

Question

A DLT pipeline includes the following streaming tables:

• raw_iot ingests raw device measurement data from a heart rate tracking device. • bpm_stats incrementally computes user statistics based on BPM measurements from raw_iot.

How can the data engineer configure this pipeline to be able to retain manually deleted or updated records in the raw_iot table, while recomputing the downstream table bpm_stats table when a pipeline update is run?

Choices

  • A: Set the pipelines.reset.allowed property to false on raw_iot
  • B: Set the skipChangeCommits flag to true on raw_iot
  • C: Set the pipelines.reset.allowed property to false on bpm_stats
  • D: Set the skipChangeCommits flag to true on bpm_stats