Questions and Answers
Question XgOHpzpY9N90kKH3o5rv
Question
A data pipeline uses Structured Streaming to ingest data from Apache Kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka-generated timestamp, key, and value. Three months after the pipeline is deployed, the data engineering team has noticed some latency issues during certain times of the day.
A senior data engineer updates the Delta Table’s schema and ingestion logic to include the current timestamp (as recorded by Apache Spark) as well as the Kafka topic and partition. The team plans to use these additional metadata fields to diagnose the transient processing delays.
Which limitation will the team face while diagnosing this problem?
Choices
- A: New fields will not be computed for historic records.
- B: Spark cannot capture the topic and partition fields from a Kafka source.
- C: Updating the table schema requires a default value provided for each field added.
- D: Updating the table schema will invalidate the Delta transaction log metadata.
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1341639 by RandomForest
- Upvotes: 1
Selected Answer: A A is correct: the old records are lost as no history was saved for the new toppic.
Comment 1300552 by m79590530
- Upvotes: 2
Selected Answer: A There is no way to reprocess history/old records to populate these values after 3 months as Kafka does not necessarily preserve them so long. This is the function of the Raw or Bronze table. Also, the other answers just don’t make sense.
Question HtlNTKELevtfjte3LlSs
Question
A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources. The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours. Which approach would simplify the identification of these changed records?
Choices
- A: Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
- B: Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
- C: Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.
- D: Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.
- E: Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.
answer?
Answer: E Answer_ET: E Community answer E (80%) D (20%) Discussion
Comment 991565 by Eertyy
- Upvotes: 6
E is right answer
Comment 1410830 by AlHerd
- Upvotes: 1
Selected Answer: E E.
While both D and E look right D only adds a timestamp but doesn’t track whether the record content actually changed, leading to false positives.
Comment 1361629 by Tedet
- Upvotes: 1
Selected Answer: D Evaluation:
Adding a current_timestamp() field to each record during the overwrite allows you to track when each record was written. This makes it easy to identify records that have been updated or inserted recently by filtering on this timestamp field (e.g., filtering for records written in the past 24 hours). This approach simplifies identifying recently changed records because you can easily filter for the most recent data and then run churn predictions only on those records. Conclusion: This is a simple and efficient solution. It allows you to track changes by using a timestamp, making it easy to filter and predict only on changed records without complex logic.
Comment 1334760 by arekm
- Upvotes: 2
Selected Answer: E A, B, and C don’t make sense. Adding a timestamp with an overwrite logic that overwrites everything does not make sense - all records would have a timestamp from the last night. That would be not helpful in identifying what changed.
E is correct. Only write changes, use CDF to identify the changes and apply the model.
Comment 1325421 by Sriramiyer92
- Upvotes: 1
Selected Answer: E While both E and D are correct. E is more accurate, given the scenario
Comment 1322129 by janeZ
- Upvotes: 1
Selected Answer: D D is the right answer
Comment 1261373 by Melik3
- Upvotes: 1
I don’t understand why E is correct. With E we are updating only data needed but we are then doing prediction on the whole table which means that we are doing again predictions on not changing records which is not efficient
Comment 1153470 by leopedroso1
- Upvotes: 2
E is the correct one. By removing overwrite with merge, this will lead to an UPSERT causing updating only the data needed ( When Matched Upate + When not mached insert clauses). Then, with the CDC the capability of identifying is also satisfied.
Comment 1128119 by AziLa
- Upvotes: 1
correct ans is E
Comment 1118585 by kz_data
- Upvotes: 1
Selected Answer: E E is correct
Comment 1044819 by sturcu
- Upvotes: 3
Selected Answer: E E is Correct
Question KbuffBvQypr9k487pUNa
Question
A nightly job ingests data into a Delta Lake table using the following code:
//IMG//
The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.
Which code snippet completes this function definition?
def new_records():
Choices
- A: return spark.readStream.table(“bronze”)
- B: return spark.read.option(“readChangeFeed”, “true”).table (“bronze”)
- C:
- D:
answer?
Answer: A Answer_ET: A Community answer A (42%) B (38%) D (21%) Discussion
Comment 1222900 by Freyr
- Upvotes: 7
Selected Answer: B Correct Answer: B The Change Data Feed (CDF) feature in Delta Lake enables reading only the changes (inserts and updates) to a Delta table. This would allow the function to focus on new or modified data since the last trigger, making it ideal for processing only the new records that have not been processed yet. This directly meets the requirement for identifying and manipulating new records efficiently.
Comment 1410192 by thierryb
- Upvotes: 1
Selected Answer: A any tables can become a stream in Databricks, therefore answer A. It cannot be answer B as the table named bronze does not have CDF enabled
Comment 1399933 by mohadjhamad
- Upvotes: 1
Selected Answer: B The function new_records() is meant to retrieve new records that have not yet been processed in the pipeline.
Delta Lake provides a feature called the Change Data Feed (CDF), which allows you to track changes (inserts, updates, and deletes) in a Delta table.
By using .option(“readChangeFeed”, “true”), the function can read only the new changes from the Delta table named “bronze”.
This ensures that only unprocessed records are returned, aligning with the pipeline’s requirement.
Comment 1335554 by arekm
- Upvotes: 1
Selected Answer: A A - append only table works for streaming, B is missing the starting version, C - timestamp during the insertion will be different from the one during the next step, D - that was a “like” type query (or maybe a substring on the file name == the directory), then yes - actual file names only start with the pattern presented, but are longer.
Comment 1329146 by UrcoIbz
- Upvotes: 2
Selected Answer: A Option A seems to be the correct answer.
Option B seems not to be the right one, because is missing the version. Based on the documentation ‘Change data feed also supports batch execution, which requires specifying a starting version’.
https://learn.microsoft.com/en-us/azure/databricks/delta/delta-change-data-feed#batch
Option D , in my opinion, is not correct, as the function definition above is not having any input parameter.
Comment 1300554 by m79590530
- Upvotes: 3
Selected Answer: A Correct answer is A as we have append-only mode writes which are ideal for simple Structured Streaming as a next step ;)
Comment 1297617 by shaojunni
- Upvotes: 3
Selected Answer: A delta table returns new records in streaming read.
Comment 1289438 by pk07
- Upvotes: 1
Selected Answer: B B. Set the skipChangeCommits flag to true on raw_iot
Let’s break down the requirements and explain why this is the best solution:
Retain manually deleted or updated records in raw_iot: The skipChangeCommits flag, when set to true, tells Delta Live Tables (DLT) to ignore any manual changes (updates or deletes) made to the table outside of the pipeline. This means that even if records are manually deleted or updated in the raw_iot table, these changes won’t be reflected in the table when the pipeline runs again. Recompute downstream bpm_stats table: By default, DLT will recompute downstream tables when their upstream dependencies change. Since bpm_stats is based on raw_iot, it will naturally be recomputed when the pipeline updates, without any special configuration. Why the other options are not correct:
A. Setting pipelines.reset.allowed to false on raw_iot would prevent the table from being reset, but it wouldn’t address the requirement to retain manually deleted or updated records.
Comment 1288304 by shaojunni
- Upvotes: 1
Selected Answer: D You have to know the CDF’s current version and last processed the version in order to get not processed records. B does not provide those versions. It will just return content from the bronze table with CDF turned on. D is only possible solution.
Comment 1256313 by HelixAbdu
- Upvotes: 2
I did not test it. But i think D is wrong as it filtering agenst directory path using ==
Comment 1221167 by MDWPartners
- Upvotes: 4
Selected Answer: D Seems D
Question sBVcM5aRGgJymYoG10jt
Question
A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.
The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.
The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.
Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?
Choices
- A: The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.
- B: Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.
- C: Schema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems.
- D: Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1230599 by Isio05
- Upvotes: 3
Selected Answer: D Agree with propopsed answer, D
Comment 1229304 by hpkr
- Upvotes: 2
Selected Answer: D D is correct
Question 1mK7PdKlAa3uk5QVShAn
Question
The data engineering team maintains the following code:
//IMG//
Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?
Choices
- A: A batch job will update the enriched_itemized_orders_by_account table, replacing only those rows that have different values than the current version of the table, using accountID as the primary key.
- B: The enriched_itemized_orders_by_account table will be overwritten using the current valid version of data in each of the three tables referenced in the join logic.
- C: No computation will occur until enriched_itemized_orders_by_account is queried; upon query materialization, results will be calculated using the current valid version of data in each of the three tables referenced in the join logic.
- D: An incremental job will detect if new rows have been written to any of the source tables; if new rows are detected, all results will be recalculated and used to overwrite the enriched_itemized_orders_by_account table.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1229305 by hpkr
- Upvotes: 3
Selected Answer: B B is correct