Questions and Answers

Question DQGxvIYeVozLgJQu8WDe

Question

A data engineer wants to refactor the following DLT code, which includes multiple table definitions with very similar code.

//IMG//

In an attempt to programmatically create these tables using a parameterized table definition, the data engineer writes the following code.

//IMG//

The pipeline runs an update with this refactored code, but generates a different DAG showing incorrect configuration values for these tables.

How can the data engineer fix this?

Choices

  • A: Wrap the for loop inside another table definition, using generalized names and properties to replace with those from the inner table definition.
  • B: Convert the list of configuration values to a dictionary of table settings, using table names as keys.
  • C: Move the table definition into a separate function, and make calls to this function using different input parameters inside the for loop.
  • D: Load the configuration values for these tables from a separate file, located at a path provided by a pipeline parameter.

Question ncAplwMCDiQwucWNcTcd

Question

Which statement characterizes the general programming model used by Spark Structured Streaming?

Choices

  • A: Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.
  • B: Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.
  • C: Structured Streaming uses specialized hardware and I/O streams to achieve sub-second latency for data transfer.
  • D: Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.
  • E: Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.

Question aaeirAs5Cm4derdxGkLE

Question

Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?

Choices

  • A: spark.sql.files.maxPartitionBytes
  • B: spark.sql.autoBroadcastJoinThreshold
  • C: spark.sql.files.openCostInBytes
  • D: spark.sql.adaptive.coalescePartitions.minPartitionNum
  • E: spark.sql.adaptive.advisoryPartitionSizeInBytes

Question fAKTHx7trU7vYFzbUQXB

Question

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum. Which situation is causing increased duration of the overall job?

Choices

  • A: Task queueing resulting from improper thread pool assignment.
  • B: Spill resulting from attached volume storage being too small.
  • C: Network latency due to some cluster nodes being in different regions from the source data
  • D: Skew caused by more data being assigned to a subset of spark-partitions.
  • E: Credential validation errors while pulling data from an external system.

Question bUUaAnJWxDHlnnXRLqNh

Question

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM. Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance?

Choices

  • A: • Total VMs; 1 • 400 GB per Executor • 160 Cores / Executor
  • B: • Total VMs: 8 • 50 GB per Executor • 20 Cores / Executor
  • C: • Total VMs: 16 • 25 GB per Executor • 10 Cores/Executor
  • D: • Total VMs: 4 • 100 GB per Executor • 40 Cores/Executor
  • E: • Total VMs:2 • 200 GB per Executor • 80 Cores / Executor