Questions and Answers

Question YCCp8V78FGdJhhI4wNc2

Question

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema: user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id. Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job?

Choices

  • A: Use Auto Loader to subscribe to new files in the account_history directory; configure a Structured Streaming trigger once job to batch update newly detected files into the account_current table.
  • B: Overwrite the account_current table with each batch using the results of a query against the account_history table grouping by user_id and filtering for the max value of last_updated.
  • C: Filter records in account_history using the last_updated field and the most recent hour processed, as well as the max last_iogin by user_id write a merge statement to update or insert the most recent value for each user_id.
  • D: Use Delta Lake version history to get the difference between the latest version of account_history and one version prior, then write these records to account_current.
  • E: Filter records in account_history using the last_updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the most recent value for each username.

Question tunrJHVCdlciidWUf8Tl

Question

Which statement describes Delta Lake optimized writes?

Choices

  • A: Before a Jobs cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job.
  • B: An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 1 GB.
  • C: A shuffle occurs prior to writing to try to group similar data together resulting in fewer files instead of each executor writing multiple files based on directory partitions.
  • D: Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.

Question 7kWoGfFgk7PjaHW86rLe

Question

Which statement characterizes the general programming model used by Spark Structured Streaming?

Choices

  • A: Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.
  • B: Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.
  • C: Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.
  • D: Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.

Question kTA7ztVOtDkA131KYQAO

Question

Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?

Choices

  • A: spark.sql.files.maxPartitionBytes
  • B: spark.sql.autoBroadcastJoinThreshold
  • C: spark.sql.adaptive.advisoryPartitionSizeInBytes
  • D: spark.sql.adaptive.coalescePartitions.minPartitionNum

Question eetgilNb1NoyG6cEfqPC

Question

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.

Which situation is causing increased duration of the overall job?

Choices

  • A: Task queueing resulting from improper thread pool assignment.
  • B: Spill resulting from attached volume storage being too small.
  • C: Network latency due to some cluster nodes being in different regions from the source data
  • D: Skew caused by more data being assigned to a subset of spark-partitions.