Questions and Answers

Question g8lN536cndw1XpBJjrxb

Question

A junior data engineer on your team has implemented the following code block. //IMG//

The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table. When this query is executed, what will happen with new records that have the same event_id as an existing record?

Choices

A: They are merged.
B: They are ignored.
C: They are updated.
D: They are inserted.
E: They are deleted.

answer?

Answer: B Answer_ET: B Community answer B (100%) Discussion

Comment 1351480 by Ashish7singh2020

Upvotes: 1

Selected Answer: B merge will work if no match

Comment 1334806 by arekm

Upvotes: 2

Selected Answer: B No WHEN MATCHED section in MERGE, hence no action on those records, hence ignore - answer B.

Comment 1320336 by Shakmak

Upvotes: 1

Selected Answer: B B is a correct Answer

Comment 1222480 by imatheushenrique

Upvotes: 2

B. They are ignored. Because there is not mention so there is no WHEN statement for this condition

Comment 1145270 by PrashantTiwari

Upvotes: 2

B is correct

Comment 1118705 by kz_data

Upvotes: 1

Selected Answer: B B is correct

Comment 1043947 by alexvno

Upvotes: 1

Selected Answer: B Ignored

Comment 1016232 by rairaix

Upvotes: 3

Selected Answer: B The answer is correct. “If none of the WHEN MATCHED conditions evaluate to true for a source and target row pair that matches the merge_condition, then the target row is left unchanged.” https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html#:~:text=If%20none%20of%20the%20WHEN%20MATCHED%20conditions%20evaluate%20to%20true%20for%20a%20source%20and%20target%20row%20pair%20that%20matches%20the%20merge_condition%2C%20then%20the%20target%20row%20is%20left%20unchanged.

Question 48efBnqEFsEgo5q6TEhT

Question

A junior data engineer seeks to leverage Delta Lake’s Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property delta.enableChangeDataFeed = true. They plan to execute the following code as a daily job: //IMG//

Which statement describes the execution and results of running the above query multiple times?

Choices

A: Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.
B: Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.
C: Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.
D: Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.
E: Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table, giving the desired result.

answer?

Answer: B Answer_ET: B Community answer B (92%) 8% Discussion

Comment 1363233 by Chugs

Upvotes: 1

Selected Answer: B As update type is insert and update so B is correct option.

Comment 1356324 by asdsadasdas

Upvotes: 3

Selected Answer: B B, Spark.read reads the entire table every time processed. If it was readstream then E would be answer

Comment 1351481 by Ashish7singh2020

Upvotes: 1

Selected Answer: B since start version is 0

Comment 1306485 by akashdesarda

Upvotes: 1

Selected Answer: B The starting version is 0, that means in every version entire data will be fetched. It is then append.

Comment 1260280 by faraaz132

Upvotes: 2

Correct Answer: B (not E) Although it was pretty obvious to me, I still wrote the code to check and yes, it will append the entire change during every write since starting version is mentioned as 0. If in doubt, code it yourselves

Comment 1223829 by imatheushenrique

Upvotes: 3

(“startingVersion”, 0) that means the entiry history of table will be read so B.

Comment 1145271 by PrashantTiwari

Upvotes: 2

B is correct

Comment 1118714 by kz_data

Upvotes: 2

Selected Answer: B B is correct

Comment 1111099 by 5ffcd04

Upvotes: 1

Selected Answer: B Correct B

Comment 1091521 by azurelearn2020

Upvotes: 1

Selected Answer: B correct answer is B.

Comment 1087907 by [Removed]

Upvotes: 1

Selected Answer: E Considering that we are talking about Change Data Feed and the code is filtering by[ “update_postimage”, “insert” ] the column “_change_type”, I would go with the option E.

Reference: https://docs.delta.io/latest/delta-change-data-feed.html#:~:text=_change_type,update_preimage%20%2C%20update_postimage

Comment 1047064 by jyothsna12496

Upvotes: 1

why is it Not E. It gets newly inserted or updated records

Comment 1040450 by sturcu

Upvotes: 1

Selected Answer: B correct

Comment 1003658 by azurearch

Upvotes: 2

B is the right answer, sorry.

Comment 1002467 by azurearch

Upvotes: 1

answer is A, because there is a filter as asmayassineg said. Filter filters only existing records from change feed

Comment 970047 by asmayassineg

Upvotes: 2

sorry, answer is correct B.

Comment 970046 by asmayassineg

Upvotes: 2

Answer is A, since the df is filtering on updated records using update_postimage filter

Question 9IZ0tsUD7ka1ifgNnTIp

Question

A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months. Which describes how Delta Lake can help to avoid data loss of this nature in the future?

Choices

A: The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer.
B: Delta Lake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original source.
C: Delta Lake automatically checks that all fields present in the source data are included in the ingestion layer.
D: Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible under any circumstance.
E: Ingesting all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.

answer?

Answer: E Answer_ET: E Community answer E (92%) 8% Discussion

Comment 1558223 by kishanu

Upvotes: 1

Selected Answer: E E is the right answer, as the table in bronze can be replayed again when required.

Comment 1363433 by Tedet

Upvotes: 1

Selected Answer: A Considering the Databricks documentation on change feed and your need to process new records that have not been processed yet, Option A might actually be a better fit since you’re looking for a streaming solution that can continuously monitor new records. The change feed (Option D) works for batch processing changes from a specific version, which isn’t ideal for real-time streaming.

Comment 1335204 by HairyTorso

Upvotes: 1

Selected Answer: E E lgtm

Comment 1324397 by Anithec0der

Upvotes: 3

Selected Answer: E When we design pipeline, we will have to make sure data from source will be present there in the raw layer/bronze layer and the transformation we make should be done in refine and enterprise layer so by this way we can tackle this kind of situation where the necessary column was not replicated in previous runs of pipeline and we can create new column based on raw data we have.

Comment 1223839 by imatheushenrique

Upvotes: 3

Medallion Architecture is named in E. (Ingesting all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.)

Comment 1149866 by ojudz08

Upvotes: 2

Selected Answer: E E is correct

Comment 1137643 by DAN_H

Upvotes: 1

Selected Answer: E I think E is correct

Comment 1118730 by kz_data

Upvotes: 1

Selected Answer: E I think E is correct

Comment 1062710 by alexvno

Upvotes: 2

Selected Answer: E Looks good - E

Question 4umAXcyONTdzTWREgmr2

Question

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?

Choices

A: Cluster: New Job Cluster; Retries: Unlimited; Maximum Concurrent Runs: Unlimited
B: Cluster: New Job Cluster; Retries: None; Maximum Concurrent Runs: 1
C: Cluster: Existing All-Purpose Cluster; Retries: Unlimited; Maximum Concurrent Runs: 1
D: Cluster: New Job Cluster; Retries: Unlimited; Maximum Concurrent Runs: 1
E: Cluster: Existing All-Purpose Cluster; Retries: None; Maximum Concurrent Runs: 1

answer?

Answer: D Answer_ET: D Community answer D (100%) Discussion

Comment 972714 by 8605246

Upvotes: 11

the answer given is correct: Maximum concurrent runs: Set to 1. There must be only one instance of each query concurrently active. Retries: Set to Unlimited. https://docs.databricks.com/en/structured-streaming/query-recovery.html

Comment 1473795 by codebender

Upvotes: 1

Selected Answer: D Cant be all purpose general compute

Comment 1351570 by EelkeV

Upvotes: 1

Selected Answer: D Job cluster autoterminates, and you want retries for recover

Comment 1290638 by akashdesarda

Upvotes: 2

Selected Answer: D Use databricks jobs as it as native integration with Streaming use case. See the example Job here https://docs.databricks.com/en/structured-streaming/query-recovery.html#configure-structured-streaming-jobs-to-restart-streaming-queries-on-failure

Comment 1224434 by imatheushenrique

Upvotes: 1

D. Cluster: New Job Cluster; Retries: Unlimited; Maximum Concurrent Runs: 1

Comment 1222434 by imatheushenrique

Upvotes: 1

D. Cluster: New Job Cluster; Retries: Unlimited; Maximum Concurrent Runs: 1

Comment 1191628 by juliom6

Upvotes: 1

D is correct https://docs.databricks.com/en/structured-streaming/query-recovery.html

Comment 1128056 by AziLa

Upvotes: 1

Correct Ans is D

Comment 1121587 by Jay_98_11

Upvotes: 1

Selected Answer: D D is correct

Comment 1102665 by kz_data

Upvotes: 1

Selected Answer: D D is correct

Comment 1044664 by sturcu

Upvotes: 1

Selected Answer: D D is correct

Question orSGipt9wuE4e50JXHlo

Question

A nightly job ingests data into a Delta Lake table using the following code: //IMG//

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline. Which code snippet completes this function definition? def new_records():

Choices

A: return spark.readStream.table(“bronze”)
B: return spark.readStream.load(“bronze”)
C:
D: return spark.read.option(“readChangeFeed”, “true”).table (“bronze”)
E:

answer?

Answer: A Answer_ET: A Community answer A (46%) E (28%) D (24%) 1% Discussion

Comment 1068885 by AzureDE2522

Upvotes: 12

Selected Answer: D

not providing a starting version/timestamp will result in the latest snapshot being fetched first

spark.readStream.format(“delta”)
.option(“readChangeFeed”, “true”)
.table(“myDeltaTable”) Please refer: https://docs.databricks.com/en/delta/delta-change-data-feed.html

Comment 1072418 by Laraujo2022

Upvotes: 8

In my opinion E is not correct because we do not see parameters pass within to the function (year, month and day)… the function is def new_records():

Comment 1410910 by AlHerd

Upvotes: 2

Selected Answer: A Option A is best because it creates a streaming source that reads only new appended data from the “bronze” table incrementally. Even if ingestion is done in batch, using spark.readStream.table(“bronze”) lets downstream processing treat the table as a live data stream.

Comment 1363432 by Tedet

Upvotes: 1

Selected Answer: D Explanation: This is the best option for Delta Lake, as it uses the readChangeFeed option. This option is specifically designed to read only the new changes (insertions, updates, or deletions) since the last read, which is exactly what is needed when you want to handle new records that have not yet been processed. This ensures that only records that are new or changed since the last read are returned. Conclusion: This is the correct choice, as it ensures that only new records are read.

Comment 1356327 by asdsadasdas

Upvotes: 1

Selected Answer: A “manipulate new records that have not yet been processed to the next table ” readstream can incrementally pick data yet to be processed. with D the issue is spark.read it will read the entire table

Comment 1353735 by shaswat1404

Upvotes: 2

Selected Answer: E in option A and B assume steaming ingestiopn but ingestion is in batch mode in option C current_timestamp is used which is dynamic and changes every time the query is executed therefore it wont correctly filter records injested in the last batch in option D it only works if delta.enableChangeDataFeed = true was set on the table before the ingestion (its disabled by default and given query does not set this option as true) therefore this option is in valid option E is correct as it correctly filters from the most recent batch as it uses file path to retrieve only data from the latest ingestion column source_file was created specifically for this purpose ensuring the function returns onle new records..

Comment 1334816 by arekm

Upvotes: 2

Selected Answer: A You can read data from the delta table using structured streaming. You have 2 options:

without CDF - only process new rows (without updates and deletes)

with CDF - all changes to the data, i.e. insert, update, delete.

Answer A uses the first option. However, in the question they talk about “new records”. So using streaming for new records is OK. Answer A is correct.

Comment 1327625 by sgerin

Upvotes: 1

Selected Answer: E New records will be filtered for D /

Comment 1327051 by temple1305

Upvotes: 1

Selected Answer: D New records will be filtered for D - example https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/

Comment 1326380 by AlejandroU

Upvotes: 1

Selected Answer: A Answer A. A better approach would involve streaming directly from the Delta table (Option A), possibly along with using metadata like ingest_time to track new records more accurately. It might be better to rely on the streaming process itself rather than trying to filter based on the file path (option E).

Comment 1321841 by Thameur01

Upvotes: 1

Selected Answer: E Using the source_file metadata field allows you to filter new records ingested from specific files. E is the most robust and reliable option for tracking and working with new records in this batch ingestion pipeline.

Comment 1319134 by benni_ale

Upvotes: 1

Selected Answer: E I tried myself but none really works

Comment 1300941 by cbj

Upvotes: 2

Selected Answer: A Others can’t ensure data not being processed. e.g. if the code not run for one day and run next day, C or E will mis process one day’s data.

Comment 1295700 by shaojunni

Upvotes: 4

Selected Answer: A since “bronze” table is a delta table, readStream() only returns new data.

Comment 1295163 by pk07

Upvotes: 2

Selected Answer: E If the job runs only once per day, then option E could indeed be a valid and effective solution. Here’s why:

Daily Execution: Since the job runs once per day, all records ingested on that day would be new and unprocessed. Source File Filtering: The filter condition col(“source_file”).like(f”/mnt/daily_batch/{year}/{month}/{day}”) would select only the records that were ingested from the current day’s batch file. Simplicity: This approach is straightforward and doesn’t require maintaining additional state (like last processed version or timestamp). Reliability: As long as the daily batch files are consistently named and placed in the correct directory structure, this method will reliably capture all new records for that day.

Comment 1289057 by AndreFR

Upvotes: 7

Selected Answer: A A is correct by Elimination. As stated by Alaverdi in another comment. Reads delta table as a stream and processes only newly arrived records.

B excluded because of incorrect syntax

C excluded, will be an empty result, as ingestion time (which comes as a param in the other method) is compared with current timestamp

D excluded because of syntax error, should be : spark.read.option(“readChangeFeed”, “true”).option(“startingVersion”, 1).table(“bronze”)

E excluded, will be an empty result, because “source_file” give a filename, while f”/mnt /daily_batch/{year}/{month}/{day}” gives a folder name

Comment 1268210 by t_d_v

Upvotes: 1

Selected Answer: C Actually it’s hard to choose between C and E, as both are a bit incorrect: Option E - seems like it will be an empty result, as file name is compared with folder name Option C - seems like it will be an empty result, as ingestion time (which comes as a param in the other method) is compared with current timestamp.

On the other hand, if new_records method had an ingestion time param, then the task would be obvious. Also considering the very first line which imports current_timestamp, let me say it’s C :))

Comment 1267042 by partha1022

Upvotes: 1

Selected Answer: D D is correct

Comment 1260286 by faraaz132

Upvotes: 3

Selected Answer: E Correct Answer : E Since, it selects only those records which have been loaded on the specified date and these records are not processed yet. This is what we want

Not A : It reads all records even the ones previously processed since bronze table keeps historic data.

Not D : It is no where mentioned that change data feed is enabled, nor is it present in the code snippet. This is where we have to be careful with self- assumption

Comment 1258389 by aiwithqasim

Upvotes: 3

Option D. return spark.read.option(“readChangeFeed”, “true”).table (“bronze”)

The following code snippet is from https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/ where the writer explained what will happen if we give “readChangeFeed”, and “true”. It will include all the details from the respective mentioned version.

In our in option D starting version is not described it will pick the latest record. Please refer to the doc https://docs.databricks.com/en/delta/delta-change-data-feed.html and find “By default, the stream returns the latest snapshot of the table when the stream first starts as an INSERT and future changes as change data.” ( spark.read.format(“delta”) .option(“readChangeFeed”, “true”) .option(“startingVersion”, 0) .table(“people”) .show(truncate=False) )

Comment 1237907 by zhiva

Upvotes: 2

Selected Answer: A Both E and A can be correct but in the definition of the function there are no input parameters. This means we can’t use them correctly in returned statement only with the given information in the question. This is why I vote for A

Comment 1223840 by imatheushenrique

Upvotes: 1

The E option makes more sense because all the partition would be filtered. Can’t be the options that use CDF because theres no readChangeFeed option in dataframe read

Comment 1184233 by arik90

Upvotes: 2

Selected Answer: E Since the ingest_daily_batch function writes to the “bronze” table in batch mode using spark.read and write operations, we should not use readStream to read from it in the subsequent function.

Comment 1172494 by alexvno

Upvotes: 1

Selected Answer: E Probable E, but still filename not specified only folder path

Comment 1168704 by vikram12apr

Upvotes: 2

Selected Answer: E Please read the question again . it is asking to get the data from bronze table to the some downstream table. Now as its a append only daily nightly job the filter on file name will give the new data available in bronze table which is still not flown down the pipeline.

Comment 1155635 by agreddy

Upvotes: 1

D is correct. https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/ CDF can be enabled on non-streaming Delta table.. “delta” is default table format.

Comment 1149873 by ojudz08

Upvotes: 1

Selected Answer: D the question here is how to manipulate new records that have not yet been processed to the next table, since the data has been ingested into the bronze table you need to check whether or not the data ingested daily is already there in the silver table, so I think answer is D. Enabling change data feed allows to track row-level changes between delta table versions

https://docs.databricks.com/en/delta/delta-change-data-feed.html

Comment 1141033 by guillesd

Upvotes: 1

the problem here is that both A and E are correct. E just follows the previous filtering logic while A uses the readStream method which will have to maintain a checkpoint. But both can work

Comment 1137653 by DAN_H

Upvotes: 4

Selected Answer: A A as Structured Streaming incrementally reads Delta tables. While a streaming query is active against a Delta table, new records are processed idempotently as new table versions commit to the source table.

Comment 1136640 by adenis

Upvotes: 1

Selected Answer: A A is Correct

Comment 1136638 by adenis

Upvotes: 1

A is Correct

Comment 1121964 by Jay_98_11

Upvotes: 1

Selected Answer: E can’t be D since no read option in CDF. https://docs.databricks.com/en/delta/delta-change-data-feed.html

Comment 1114361 by RafaelCFC

Upvotes: 1

Selected Answer: E E addresses the desired filtering, while keeping with the logic of the first step being a batch job, and has no code errors.

Comment 1095606 by alaverdi

Upvotes: 5

Selected Answer: A In my opinion A is the correct answer. You read delta table as a stream and process only newly arrived records. This is maintained while writing the stream with the state stored in checkpoint location. spark.readStream.table(“bronze”) .writeStream .format(“delta”) .outputMode(“append”) .option(“checkpointLocation”, “/path/to/checkpoints/”) .toTable(“silver”)

Comment 1049249 by chokthewa

Upvotes: 1

E is correct. D use invalid option refer to see sample in https://docs.databricks.com/en/delta/delta-change-data-feed.html . A , B didn’t filter ,so it will gather whole table data. E uses the knew value to filter .

Comment 1044826 by sturcu

Upvotes: 2

Selected Answer: E we filter on the file_path

Comment 1014938 by MarceloManhaes

Upvotes: 1

Yes it is E because D, there is o mention that delta.enableChangeDataFeed is true. Also there is no read option in CDF. it is table changes (batch) or readStream. URL https://docs.databricks.com/en/delta/delta-change-data-feed.html

Comment 1009485 by hammer_1234_h

Upvotes: 2

it could be E, but not D

Comment 1006999 by hammer_1234_h

Upvotes: 1

the answer should be A https://docs.databricks.com/en/structured-streaming/delta-lake.html#delta-table-as-a-source

vuthanhdatt's Second Brain

Explorer

25

Questions and Answers

Question g8lN536cndw1XpBJjrxb

Question

Choices

Comment 1351480 by Ashish7singh2020

Comment 1334806 by arekm

Comment 1320336 by Shakmak

Comment 1222480 by imatheushenrique

Comment 1145270 by PrashantTiwari

Comment 1118705 by kz_data

Comment 1043947 by alexvno

Comment 1016232 by rairaix

Question 48efBnqEFsEgo5q6TEhT

Question

Choices

Comment 1363233 by Chugs

Comment 1356324 by asdsadasdas

Comment 1351481 by Ashish7singh2020

Comment 1306485 by akashdesarda

Comment 1260280 by faraaz132

Comment 1223829 by imatheushenrique

Comment 1145271 by PrashantTiwari

Comment 1118714 by kz_data

Comment 1111099 by 5ffcd04

Comment 1091521 by azurelearn2020

Comment 1087907 by [Removed]

Comment 1047064 by jyothsna12496

Comment 1040450 by sturcu

Comment 1003658 by azurearch

Comment 1002467 by azurearch

Comment 970047 by asmayassineg

Comment 970046 by asmayassineg

Question 9IZ0tsUD7ka1ifgNnTIp

Question

Choices

Comment 1558223 by kishanu

Comment 1363433 by Tedet

Comment 1335204 by HairyTorso

Comment 1324397 by Anithec0der

Comment 1223839 by imatheushenrique

Comment 1149866 by ojudz08

Comment 1137643 by DAN_H

Comment 1118730 by kz_data

Comment 1062710 by alexvno

Question 4umAXcyONTdzTWREgmr2

Question

Choices

Comment 972714 by 8605246

Comment 1473795 by codebender

Comment 1351570 by EelkeV

Comment 1290638 by akashdesarda

Comment 1224434 by imatheushenrique

Comment 1222434 by imatheushenrique

Comment 1191628 by juliom6

Comment 1128056 by AziLa

Comment 1121587 by Jay_98_11

Comment 1102665 by kz_data

Comment 1044664 by sturcu

Question orSGipt9wuE4e50JXHlo

Question

Choices

Comment 1068885 by AzureDE2522

not providing a starting version/timestamp will result in the latest snapshot being fetched first

Comment 1072418 by Laraujo2022

Comment 1410910 by AlHerd

Comment 1363432 by Tedet

Comment 1356327 by asdsadasdas

Comment 1353735 by shaswat1404

Comment 1334816 by arekm

Comment 1327625 by sgerin

Comment 1327051 by temple1305

Comment 1326380 by AlejandroU

Comment 1321841 by Thameur01

Comment 1319134 by benni_ale

Comment 1300941 by cbj

Comment 1295700 by shaojunni

Comment 1295163 by pk07