Questions and Answers

Question rOsq6LS2MwxoYcQmIw38

Question

A data team’s Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new promotion, and they would like to add a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows. Note that proposed changes are in bold.

Original query:

//IMG//

Proposed query:

//IMG//

Proposed query:

.start(“/item_agg”)

Which step must also be completed to put the proposed query into production?

Choices

  • A: Specify a new checkpointLocation
  • B: Increase the shuffle partitions to account for additional aggregates
  • C: Run REFRESH TABLE delta.’/item_agg’
  • D: Register the data in the “/item_agg” directory to the Hive metastore
  • E: Remove .option(‘mergeSchema’, ‘true’) from the streaming write

Question UA1iAPk9u9ejipxfyymG

Question

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each microbatch of data is processed in less than 3s; at least 12 times per minute, a microbatch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution.

Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?

Choices

  • A: Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs.
  • B: Increase the number of shuffle partitions to maximize parallelism, since the trigger interval cannot be modified without modifying the checkpoint directory.
  • C: Set the trigger interval to 10 minutes; each batch calls APIs in the source storage account, so decreasing trigger frequency to maximum allowable threshold should minimize this cost.
  • D: Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger interval ensures that the source is not queried too frequently.
  • E: Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.

Question oqBpL4l1E6rNZzJqaNKl

Question

Which statement describes the correct use of pyspark.sql.functions.broadcast?

Choices

  • A: It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.
  • B: It marks a column as small enough to store in memory on all executors, allowing a broadcast join.
  • C: It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.
  • D: It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.
  • E: It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.

Question VmfletzXAl3bWxsHTY1D

Question

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

Choices

  • A: Set the configuration delta.deduplicate = true.
  • B: VACUUM the Delta table after each batch completes.
  • C: Perform an insert-only merge with a matching condition on a unique key.
  • D: Perform a full outer join on a unique key and overwrite existing data.
  • E: Rely on Delta Lake schema enforcement to prevent duplicate records.

Question dPEvuEwrLsNOIa2csn8V

Question

A data pipeline uses Structured Streaming to ingest data from Apache Kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka-generated timestamp, key, and value. Three months after the pipeline is deployed, the data engineering team has noticed some latency issues during certain times of the day.

A senior data engineer updates the Delta Table’s schema and ingestion logic to include the current timestamp (as recorded by Apache Spark) as well as the Kafka topic and partition. The team plans to use these additional metadata fields to diagnose the transient processing delays.

Which limitation will the team face while diagnosing this problem?

Choices

  • A: New fields will not be computed for historic records.
  • B: Spark cannot capture the topic and partition fields from a Kafka source.
  • C: New fields cannot be added to a production Delta table.
  • D: Updating the table schema will invalidate the Delta transaction log metadata.
  • E: Updating the table schema requires a default value provided for each field added.