Questions and Answers
Question g8lN536cndw1XpBJjrxb
Question
A junior data engineer on your team has implemented the following code block. //IMG//
The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table. When this query is executed, what will happen with new records that have the same event_id as an existing record?
Choices
- A: They are merged.
- B: They are ignored.
- C: They are updated.
- D: They are inserted.
- E: They are deleted.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1351480 by Ashish7singh2020
- Upvotes: 1
Selected Answer: B merge will work if no match
Comment 1334806 by arekm
- Upvotes: 2
Selected Answer: B No WHEN MATCHED section in MERGE, hence no action on those records, hence ignore - answer B.
Comment 1320336 by Shakmak
- Upvotes: 1
Selected Answer: B B is a correct Answer
Comment 1222480 by imatheushenrique
- Upvotes: 2
B. They are ignored. Because there is not mention so there is no WHEN statement for this condition
Comment 1145270 by PrashantTiwari
- Upvotes: 2
B is correct
Comment 1118705 by kz_data
- Upvotes: 1
Selected Answer: B B is correct
Comment 1043947 by alexvno
- Upvotes: 1
Selected Answer: B Ignored
Comment 1016232 by rairaix
- Upvotes: 3
Selected Answer: B The answer is correct. “If none of the WHEN MATCHED conditions evaluate to true for a source and target row pair that matches the merge_condition, then the target row is left unchanged.” https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html#:~:text=If%20none%20of%20the%20WHEN%20MATCHED%20conditions%20evaluate%20to%20true%20for%20a%20source%20and%20target%20row%20pair%20that%20matches%20the%20merge_condition%2C%20then%20the%20target%20row%20is%20left%20unchanged.
Question 48efBnqEFsEgo5q6TEhT
Question
A junior data engineer seeks to leverage Delta Lake’s Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property delta.enableChangeDataFeed = true. They plan to execute the following code as a daily job: //IMG//
Which statement describes the execution and results of running the above query multiple times?
Choices
- A: Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.
- B: Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.
- C: Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.
- D: Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.
- E: Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table, giving the desired result.
answer?
Answer: B Answer_ET: B Community answer B (92%) 8% Discussion
Comment 1363233 by Chugs
- Upvotes: 1
Selected Answer: B As update type is insert and update so B is correct option.
Comment 1356324 by asdsadasdas
- Upvotes: 3
Selected Answer: B B, Spark.read reads the entire table every time processed. If it was readstream then E would be answer
Comment 1351481 by Ashish7singh2020
- Upvotes: 1
Selected Answer: B since start version is 0
Comment 1306485 by akashdesarda
- Upvotes: 1
Selected Answer: B The starting version is 0, that means in every version entire data will be fetched. It is then append.
Comment 1260280 by faraaz132
- Upvotes: 2
Correct Answer: B (not E) Although it was pretty obvious to me, I still wrote the code to check and yes, it will append the entire change during every write since starting version is mentioned as 0. If in doubt, code it yourselves
Comment 1223829 by imatheushenrique
- Upvotes: 3
(“startingVersion”, 0) that means the entiry history of table will be read so B.
Comment 1145271 by PrashantTiwari
- Upvotes: 2
B is correct
Comment 1118714 by kz_data
- Upvotes: 2
Selected Answer: B B is correct
Comment 1111099 by 5ffcd04
- Upvotes: 1
Selected Answer: B Correct B
Comment 1091521 by azurelearn2020
- Upvotes: 1
Selected Answer: B correct answer is B.
Comment 1087907 by [Removed]
- Upvotes: 1
Selected Answer: E Considering that we are talking about Change Data Feed and the code is filtering by[ “update_postimage”, “insert” ] the column “_change_type”, I would go with the option E.
Comment 1047064 by jyothsna12496
- Upvotes: 1
why is it Not E. It gets newly inserted or updated records
Comment 1040450 by sturcu
- Upvotes: 1
Selected Answer: B correct
Comment 1003658 by azurearch
- Upvotes: 2
B is the right answer, sorry.
Comment 1002467 by azurearch
- Upvotes: 1
answer is A, because there is a filter as asmayassineg said. Filter filters only existing records from change feed
Comment 970047 by asmayassineg
- Upvotes: 2
sorry, answer is correct B.
Comment 970046 by asmayassineg
- Upvotes: 2
Answer is A, since the df is filtering on updated records using update_postimage filter
Question 9IZ0tsUD7ka1ifgNnTIp
Question
A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months. Which describes how Delta Lake can help to avoid data loss of this nature in the future?
Choices
- A: The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer.
- B: Delta Lake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original source.
- C: Delta Lake automatically checks that all fields present in the source data are included in the ingestion layer.
- D: Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible under any circumstance.
- E: Ingesting all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.
answer?
Answer: E Answer_ET: E Community answer E (92%) 8% Discussion
Comment 1558223 by kishanu
- Upvotes: 1
Selected Answer: E E is the right answer, as the table in bronze can be replayed again when required.
Comment 1363433 by Tedet
- Upvotes: 1
Selected Answer: A Considering the Databricks documentation on change feed and your need to process new records that have not been processed yet, Option A might actually be a better fit since you’re looking for a streaming solution that can continuously monitor new records. The change feed (Option D) works for batch processing changes from a specific version, which isn’t ideal for real-time streaming.
Comment 1335204 by HairyTorso
- Upvotes: 1
Selected Answer: E E lgtm
Comment 1324397 by Anithec0der
- Upvotes: 3
Selected Answer: E When we design pipeline, we will have to make sure data from source will be present there in the raw layer/bronze layer and the transformation we make should be done in refine and enterprise layer so by this way we can tackle this kind of situation where the necessary column was not replicated in previous runs of pipeline and we can create new column based on raw data we have.
Comment 1223839 by imatheushenrique
- Upvotes: 3
Medallion Architecture is named in E. (Ingesting all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.)
Comment 1149866 by ojudz08
- Upvotes: 2
Selected Answer: E E is correct
Comment 1137643 by DAN_H
- Upvotes: 1
Selected Answer: E I think E is correct
Comment 1118730 by kz_data
- Upvotes: 1
Selected Answer: E I think E is correct
Comment 1062710 by alexvno
- Upvotes: 2
Selected Answer: E Looks good - E
Question 4umAXcyONTdzTWREgmr2
Question
When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?
Choices
- A: Cluster: New Job Cluster; Retries: Unlimited; Maximum Concurrent Runs: Unlimited
- B: Cluster: New Job Cluster; Retries: None; Maximum Concurrent Runs: 1
- C: Cluster: Existing All-Purpose Cluster; Retries: Unlimited; Maximum Concurrent Runs: 1
- D: Cluster: New Job Cluster; Retries: Unlimited; Maximum Concurrent Runs: 1
- E: Cluster: Existing All-Purpose Cluster; Retries: None; Maximum Concurrent Runs: 1
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 972714 by 8605246
- Upvotes: 11
the answer given is correct: Maximum concurrent runs: Set to 1. There must be only one instance of each query concurrently active. Retries: Set to Unlimited. https://docs.databricks.com/en/structured-streaming/query-recovery.html
Comment 1473795 by codebender
- Upvotes: 1
Selected Answer: D Cant be all purpose general compute
Comment 1351570 by EelkeV
- Upvotes: 1
Selected Answer: D Job cluster autoterminates, and you want retries for recover
Comment 1290638 by akashdesarda
- Upvotes: 2
Selected Answer: D Use databricks jobs as it as native integration with Streaming use case. See the example Job here https://docs.databricks.com/en/structured-streaming/query-recovery.html#configure-structured-streaming-jobs-to-restart-streaming-queries-on-failure
Comment 1224434 by imatheushenrique
- Upvotes: 1
D. Cluster: New Job Cluster; Retries: Unlimited; Maximum Concurrent Runs: 1
Comment 1222434 by imatheushenrique
- Upvotes: 1
D. Cluster: New Job Cluster; Retries: Unlimited; Maximum Concurrent Runs: 1
Comment 1191628 by juliom6
- Upvotes: 1
D is correct https://docs.databricks.com/en/structured-streaming/query-recovery.html
Comment 1128056 by AziLa
- Upvotes: 1
Correct Ans is D
Comment 1121587 by Jay_98_11
- Upvotes: 1
Selected Answer: D D is correct
Comment 1102665 by kz_data
- Upvotes: 1
Selected Answer: D D is correct
Comment 1044664 by sturcu
- Upvotes: 1
Selected Answer: D D is correct
Question orSGipt9wuE4e50JXHlo
Question
A nightly job ingests data into a Delta Lake table using the following code: //IMG//
The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline. Which code snippet completes this function definition? def new_records():
Choices
- A: return spark.readStream.table(“bronze”)
- B: return spark.readStream.load(“bronze”)
- C:
- D: return spark.read.option(“readChangeFeed”, “true”).table (“bronze”)
- E:
answer?
Answer: A Answer_ET: A Community answer A (46%) E (28%) D (24%) 1% Discussion
Comment 1068885 by AzureDE2522
- Upvotes: 12
Selected Answer: D
not providing a starting version/timestamp will result in the latest snapshot being fetched first
spark.readStream.format(“delta”)
.option(“readChangeFeed”, “true”)
.table(“myDeltaTable”) Please refer: https://docs.databricks.com/en/delta/delta-change-data-feed.htmlComment 1072418 by Laraujo2022
- Upvotes: 8
In my opinion E is not correct because we do not see parameters pass within to the function (year, month and day)… the function is def new_records():
Comment 1410910 by AlHerd
- Upvotes: 2
Selected Answer: A Option A is best because it creates a streaming source that reads only new appended data from the “bronze” table incrementally. Even if ingestion is done in batch, using spark.readStream.table(“bronze”) lets downstream processing treat the table as a live data stream.
Comment 1363432 by Tedet
- Upvotes: 1
Selected Answer: D Explanation: This is the best option for Delta Lake, as it uses the readChangeFeed option. This option is specifically designed to read only the new changes (insertions, updates, or deletions) since the last read, which is exactly what is needed when you want to handle new records that have not yet been processed. This ensures that only records that are new or changed since the last read are returned. Conclusion: This is the correct choice, as it ensures that only new records are read.
Comment 1356327 by asdsadasdas
- Upvotes: 1
Selected Answer: A “manipulate new records that have not yet been processed to the next table ” readstream can incrementally pick data yet to be processed. with D the issue is spark.read it will read the entire table
Comment 1353735 by shaswat1404
- Upvotes: 2
Selected Answer: E in option A and B assume steaming ingestiopn but ingestion is in batch mode in option C current_timestamp is used which is dynamic and changes every time the query is executed therefore it wont correctly filter records injested in the last batch in option D it only works if delta.enableChangeDataFeed = true was set on the table before the ingestion (its disabled by default and given query does not set this option as true) therefore this option is in valid option E is correct as it correctly filters from the most recent batch as it uses file path to retrieve only data from the latest ingestion column source_file was created specifically for this purpose ensuring the function returns onle new records..
Comment 1334816 by arekm
- Upvotes: 2
Selected Answer: A You can read data from the delta table using structured streaming. You have 2 options:
- without CDF - only process new rows (without updates and deletes)
- with CDF - all changes to the data, i.e. insert, update, delete.
Answer A uses the first option. However, in the question they talk about “new records”. So using streaming for new records is OK. Answer A is correct.
Comment 1327625 by sgerin
- Upvotes: 1
Selected Answer: E New records will be filtered for D /
Comment 1327051 by temple1305
- Upvotes: 1
Selected Answer: D New records will be filtered for D - example https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/
Comment 1326380 by AlejandroU
- Upvotes: 1
Selected Answer: A Answer A. A better approach would involve streaming directly from the Delta table (Option A), possibly along with using metadata like ingest_time to track new records more accurately. It might be better to rely on the streaming process itself rather than trying to filter based on the file path (option E).
Comment 1321841 by Thameur01
- Upvotes: 1
Selected Answer: E Using the source_file metadata field allows you to filter new records ingested from specific files. E is the most robust and reliable option for tracking and working with new records in this batch ingestion pipeline.
Comment 1319134 by benni_ale
- Upvotes: 1
Selected Answer: E I tried myself but none really works
Comment 1300941 by cbj
- Upvotes: 2
Selected Answer: A Others can’t ensure data not being processed. e.g. if the code not run for one day and run next day, C or E will mis process one day’s data.
Comment 1295700 by shaojunni
- Upvotes: 4
Selected Answer: A since “bronze” table is a delta table, readStream() only returns new data.
Comment 1295163 by pk07
- Upvotes: 2
Selected Answer: E If the job runs only once per day, then option E could indeed be a valid and effective solution. Here’s why:
Daily Execution: Since the job runs once per day, all records ingested on that day would be new and unprocessed. Source File Filtering: The filter condition col(“source_file”).like(f”/mnt/daily_batch/{year}/{month}/{day}”) would select only the records that were ingested from the current day’s batch file. Simplicity: This approach is straightforward and doesn’t require maintaining additional state (like last processed version or timestamp). Reliability: As long as the daily batch files are consistently named and placed in the correct directory structure, this method will reliably capture all new records for that day.
Comment 1289057 by AndreFR
- Upvotes: 7
Selected Answer: A A is correct by Elimination. As stated by Alaverdi in another comment. Reads delta table as a stream and processes only newly arrived records.
B excluded because of incorrect syntax
C excluded, will be an empty result, as ingestion time (which comes as a param in the other method) is compared with current timestamp
D excluded because of syntax error, should be : spark.read.option(“readChangeFeed”, “true”).option(“startingVersion”, 1).table(“bronze”)
E excluded, will be an empty result, because “source_file” give a filename, while f”/mnt /daily_batch/{year}/{month}/{day}” gives a folder name
Comment 1268210 by t_d_v
- Upvotes: 1
Selected Answer: C Actually it’s hard to choose between C and E, as both are a bit incorrect: Option E - seems like it will be an empty result, as file name is compared with folder name Option C - seems like it will be an empty result, as ingestion time (which comes as a param in the other method) is compared with current timestamp.
On the other hand, if new_records method had an ingestion time param, then the task would be obvious. Also considering the very first line which imports current_timestamp, let me say it’s C :))
Comment 1267042 by partha1022
- Upvotes: 1
Selected Answer: D D is correct
Comment 1260286 by faraaz132
- Upvotes: 3
Selected Answer: E Correct Answer : E Since, it selects only those records which have been loaded on the specified date and these records are not processed yet. This is what we want
Not A : It reads all records even the ones previously processed since bronze table keeps historic data.
Not D : It is no where mentioned that change data feed is enabled, nor is it present in the code snippet. This is where we have to be careful with self- assumption
Comment 1258389 by aiwithqasim
- Upvotes: 3
Option D. return spark.read.option(“readChangeFeed”, “true”).table (“bronze”)
The following code snippet is from https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/ where the writer explained what will happen if we give “readChangeFeed”, and “true”. It will include all the details from the respective mentioned version.
In our in option D starting version is not described it will pick the latest record. Please refer to the doc https://docs.databricks.com/en/delta/delta-change-data-feed.html and find “By default, the stream returns the latest snapshot of the table when the stream first starts as an INSERT and future changes as change data.” ( spark.read.format(“delta”) .option(“readChangeFeed”, “true”) .option(“startingVersion”, 0) .table(“people”) .show(truncate=False) )
Comment 1237907 by zhiva
- Upvotes: 2
Selected Answer: A Both E and A can be correct but in the definition of the function there are no input parameters. This means we can’t use them correctly in returned statement only with the given information in the question. This is why I vote for A
Comment 1223840 by imatheushenrique
- Upvotes: 1
The E option makes more sense because all the partition would be filtered. Can’t be the options that use CDF because theres no readChangeFeed option in dataframe read
Comment 1184233 by arik90
- Upvotes: 2
Selected Answer: E Since the ingest_daily_batch function writes to the “bronze” table in batch mode using spark.read and write operations, we should not use readStream to read from it in the subsequent function.
Comment 1172494 by alexvno
- Upvotes: 1
Selected Answer: E Probable E, but still filename not specified only folder path
Comment 1168704 by vikram12apr
- Upvotes: 2
Selected Answer: E Please read the question again . it is asking to get the data from bronze table to the some downstream table. Now as its a append only daily nightly job the filter on file name will give the new data available in bronze table which is still not flown down the pipeline.
Comment 1155635 by agreddy
- Upvotes: 1
D is correct. https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/ CDF can be enabled on non-streaming Delta table.. “delta” is default table format.
Comment 1149873 by ojudz08
- Upvotes: 1
Selected Answer: D the question here is how to manipulate new records that have not yet been processed to the next table, since the data has been ingested into the bronze table you need to check whether or not the data ingested daily is already there in the silver table, so I think answer is D. Enabling change data feed allows to track row-level changes between delta table versions
https://docs.databricks.com/en/delta/delta-change-data-feed.html
Comment 1141033 by guillesd
- Upvotes: 1
the problem here is that both A and E are correct. E just follows the previous filtering logic while A uses the readStream method which will have to maintain a checkpoint. But both can work
Comment 1137653 by DAN_H
- Upvotes: 4
Selected Answer: A A as Structured Streaming incrementally reads Delta tables. While a streaming query is active against a Delta table, new records are processed idempotently as new table versions commit to the source table.
Comment 1136640 by adenis
- Upvotes: 1
Selected Answer: A A is Correct
Comment 1136638 by adenis
- Upvotes: 1
A is Correct
Comment 1121964 by Jay_98_11
- Upvotes: 1
Selected Answer: E can’t be D since no read option in CDF. https://docs.databricks.com/en/delta/delta-change-data-feed.html
Comment 1114361 by RafaelCFC
- Upvotes: 1
Selected Answer: E E addresses the desired filtering, while keeping with the logic of the first step being a batch job, and has no code errors.
Comment 1095606 by alaverdi
- Upvotes: 5
Selected Answer: A In my opinion A is the correct answer. You read delta table as a stream and process only newly arrived records. This is maintained while writing the stream with the state stored in checkpoint location. spark.readStream.table(“bronze”) .writeStream .format(“delta”) .outputMode(“append”) .option(“checkpointLocation”, “/path/to/checkpoints/”) .toTable(“silver”)
Comment 1049249 by chokthewa
- Upvotes: 1
E is correct. D use invalid option refer to see sample in https://docs.databricks.com/en/delta/delta-change-data-feed.html . A , B didn’t filter ,so it will gather whole table data. E uses the knew value to filter .
Comment 1044826 by sturcu
- Upvotes: 2
Selected Answer: E we filter on the file_path
Comment 1014938 by MarceloManhaes
- Upvotes: 1
Yes it is E because D, there is o mention that delta.enableChangeDataFeed is true. Also there is no read option in CDF. it is table changes (batch) or readStream. URL https://docs.databricks.com/en/delta/delta-change-data-feed.html
Comment 1009485 by hammer_1234_h
- Upvotes: 2
it could be E, but not D
Comment 1006999 by hammer_1234_h
- Upvotes: 1
the answer should be A https://docs.databricks.com/en/structured-streaming/delta-lake.html#delta-table-as-a-source