Questions and Answers
Question aSgHpnxemCqtjyiUnj1X
Question
Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.
Given an extremely long-running job for which completion must be guaranteed, which cluster configuration will be able to guarantee completion of the job in light of one or more VM failures?
Choices
- A: • Total VMs: 8 • 50 GB per Executor • 20 Cores / Executor
- B: • Total VMs: 16 • 25 GB per Executor • 10 Cores / Executor
- C: • Total VMs: 1 • 400 GB per Executor • 160 Cores/Executor
- D: • Total VMs: 4 • 100 GB per Executor • 40 Cores / Executor
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1300514 by m79590530
- Upvotes: 1
Selected Answer: B Distributing work across more Workers/Executors will have better guarantee in case of 1 or more of them fail
Question s9bsbaoJT28uFRn5yZaC
Question
A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:
//IMG//
Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system.
If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?
Choices
- A: Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.
- B: All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.
- C: The orders table will contain only the most recent 2 hours of records and no duplicates will be present.
- D: The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.
answer?
Answer: A Answer_ET: A Community answer A (63%) D (38%) Discussion
Comment 1335541 by arekm
- Upvotes: 1
Selected Answer: A A - two orders [customer_id, order_id] might be emitted with time column value that is 2 or more hours apart. So the watermark will not drop the new record, since it will have a new time value. From the composite key perspective, it is going to be a duplicate.
Comment 1329104 by UrcoIbz
- Upvotes: 2
Selected Answer: A dropDuplicates only deletes the duplicates of the processed batch. If we have records with same key in different batches, we will have duplicates in the final table.
In addition, withWatermark, when is not used in a window, gets the MAX(eventTime) and uses the threshold to define the time range. As the time represent when the data has been queued in the source system, we can get records where the time we get older than 2 hours.
pyspark.sql.DataFrame.dropDuplicates — PySpark 3.5.3 documentation pyspark.sql.DataFrame.withWatermark — PySpark 3.5.3 documentation
Comment 1309552 by vish9
- Upvotes: 2
Selected Answer: D The orders arriving 2 hours or later will be dropped. There is a chance that they can be processed, but still deduplication will happen.
Comment 1308874 by smashit
- Upvotes: 1
There might be chance that same record for example A1,O1 comes in Batch B1 also comes in B2. we need to implement merge logic inside our target table or perform insert-only merge.
Comment 1301444 by Jugiboss
- Upvotes: 1
Selected Answer: A Watermark thresholds guarantee that records arriving within the specified threshold are processed according to the semantics of the defined query. Late-arriving records arriving outside the specified threshold might still be processed using query metrics, but this is not guaranteed.
Comment 1300518 by m79590530
- Upvotes: 1
Selected Answer: D The default write mode is ‘append’. Duplicate will be resolved for each 2 hr window and .withWatermark() will drop/ignore the records delayed more than 2 hours apart.
Comment 1278582 by csrazdan
- Upvotes: 1
Selected Answer: A The default write mode is append. Duplicate will be resolved for only 2 hr window but may still exist because of previous execution.
Question utBTqWfkLzDH8lmppTWE
Question
A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.
In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?
Choices
- A: Rely on Delta Lake schema enforcement to prevent duplicate records.
- B: VACUUM the Delta table after each batch completes.
- C: Perform an insert-only merge with a matching condition on a unique key.
- D: Perform a full outer join on a unique key and overwrite existing data.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1307726 by benni_ale
- Upvotes: 1
Selected Answer: C C seems logical
Comment 1300526 by m79590530
- Upvotes: 2
Selected Answer: C From all the provided options Answer C is the only meaningful and possible one. Also MERGE INTO … WHEN NOT MATCHED INSERT *; is a standard solution for adding/appending non-existing records (by key) to the target table withOUT duplicating.
Question uBxn0eXd0Fzo1PEELyp0
Question
A junior data engineer seeks to leverage Delta Lake’s Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property delta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:
//IMG//
Which statement describes the execution and results of running the above query multiple times?
Choices
- A: Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.
- B: Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.
- C: Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table, giving the desired result.
- D: Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1307729 by benni_ale
- Upvotes: 1
Selected Answer: B B seems ok
Comment 1300535 by m79590530
- Upvotes: 1
Selected Answer: B Since the code is using version 0 for the CDF-enabled table every time it is executed all the historical changes being insert of update for the table will be ‘append’-ed to the target table since this is the write command option provided.
Comment 1287307 by Adrifersilva
- Upvotes: 2
B. This bad effect (many duplicates) happens because the code reads from the starting version 0, appending all changes since the beginning.
Question ir9F7dlyrvVitDkdTPFw
Question
A DLT pipeline includes the following streaming tables:
• raw_iot ingests raw device measurement data from a heart rate tracking device. • bpm_stats incrementally computes user statistics based on BPM measurements from raw_iot.
How can the data engineer configure this pipeline to be able to retain manually deleted or updated records in the raw_iot table, while recomputing the downstream table bpm_stats table when a pipeline update is run?
Choices
- A: Set the pipelines.reset.allowed property to false on raw_iot
- B: Set the skipChangeCommits flag to true on raw_iot
- C: Set the pipelines.reset.allowed property to false on bpm_stats
- D: Set the skipChangeCommits flag to true on bpm_stats
answer?
Answer: B Answer_ET: B Community answer B (54%) A (46%) Discussion
Comment 1559185 by teowis
- Upvotes: 1
Selected Answer: A The purpuse of it is to take into account changees made manually in the raw_iot table and propagate them to the downstream tables.
Comment 1335438 by Sriramiyer92
- Upvotes: 2
Selected Answer: B Answer is B.
Ignore updates and deletes - https://docs.databricks.com/en/structured-streaming/delta-lake.html#ignore-updates-and-deletes
Structured Streaming does not handle input that is not an append and throws an exception if any modifications occur on the table being used as a source. There are two main strategies for dealing with changes that cannot be automatically propagated downstream:
You can delete the output and checkpoint and restart the stream from the beginning.
You can set either of these two options:
ignoreDeletes: ignore transactions that delete data at partition boundaries.
skipChangeCommits: ignore transactions that delete or modify existing records. skipChangeCommits subsumes ignoreDeletes.
Comment 1300549 by m79590530
- Upvotes: 2
Selected Answer: A Setting pipelines.reset.allowed property to false on raw_iot prevents its full refresh/reset being done from its source, meaning it preserves all other modifications done for it additionally outside of its source streaming process. If we set skipChangeCommits for it or the target table transactions that delete or modify records on the table are ignored
Comment 1289441 by pk07
- Upvotes: 4
Selected Answer: B B. Set the skipChangeCommits flag to true on raw_iot
Let’s break down the requirements and explain why this is the best solution:
Retain manually deleted or updated records in raw_iot: The skipChangeCommits flag, when set to true, tells Delta Live Tables (DLT) to ignore any manual changes (updates or deletes) made to the table outside of the pipeline. This means that even if records are manually deleted or updated in the raw_iot table, these changes won’t be reflected in the table when the pipeline runs again. Recompute downstream bpm_stats table: By default, DLT will recompute downstream tables when their upstream dependencies change. Since bpm_stats is based on raw_iot, it will naturally be recomputed when the pipeline updates, without any special configuration. Why the other options are not correct:
A. Setting pipelines.reset.allowed to false on raw_iot would prevent the table from being reset, but it wouldn’t address the requirement to retain manually deleted or updated records.
Comment 1245096 by nikoliko
- Upvotes: 3
answer A This property, when set to false, ensures that the table will not be reset during pipeline updates, thus preserving manually deleted or updated records. This is crucial for the raw_iot table to retain the manual modifications.
Comment 1244965 by c00ccb7
- Upvotes: 3
Selected Answer: A Set the pipelines.reset.allowed property to false on raw_iot
Comment 1225051 by Deb9753
- Upvotes: 1
Answer: B