Questions and Answers
Question EdGJq4N8UhPRHxSUeFbX
Question
A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:
//IMG//
Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system.
If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?
Choices
- A: Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.
- B: All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.
- C: The orders table will contain only the most recent 2 hours of records and no duplicates will be present.
- D: Duplicate records arriving more than 2 hours apart will be dropped, but duplicates that arrive in the same batch may both be written to the orders table.
- E: The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.
answer?
Answer: A Answer_ET: A Community answer A (54%) E (46%) Discussion
Comment 1100408 by alexvno
- Upvotes: 8
Selected Answer: A Only A seems logical
Comment 1226718 by Isio05
- Upvotes: 5
Selected Answer: A It’s A, rows are deduplicated only in 2hrs window, therefore final table may eventually contain duplicates
Comment 1441690 by a85becd
- Upvotes: 1
Selected Answer: E
- The code snippet uses
withWatermark("time", "2 hours"), which means that Spark will only process records with a timestamp (timefield) that falls within a 2-hour window from the most recent watermark. Any records arriving more than 2 hours late will be ignored and won’t be written to theorderstable.- The
dropDuplicates(["customer_id", "order_id"])ensures that duplicate records (based on the composite key ofcustomer_idandorder_id) are removed within the 2-hour window. This guarantees no duplicates will be present in theorderstable for data falling within the watermark range.Comment 1411620 by EZZALDIN
- Upvotes: 2
Selected Answer: E E. The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table. This matches Spark’s actual behavior with watermark + dropDuplicates: the stateful deduplication ensures no duplicate (customer_id, order_id) rows are ever written to the Delta table, and the 2-hour watermark means any record that comes in more than 2 hours behind the stream’s current event-time will be dropped (i.e. not processed/output)
Comment 1340742 by RandomForest
- Upvotes: 1
Selected Answer: A A is correct. The watermark ensures that data arriving up to 2h late will be deduplicated in the incoming batch.
Comment 1335452 by arekm
- Upvotes: 1
Selected Answer: A A - since the same [customer_id, order_id] record can be emitted twice with the “time” value set apart by more than 2 hours. The target table will then contain duplicates (duplicates from the perspective of the composite key).
Comment 1315290 by benni_ale
- Upvotes: 1
Per me è la cipolla
Comment 1309965 by cf56faf
- Upvotes: 2
Selected Answer: E Seems that E should be the correct answer. As time is the time it was queued in the source_system.
And withWatermark ignores records that have a “time” more than 2 hours old.
Comment 1303304 by Ananth4Sap
- Upvotes: 2
When watermarking is set to 2 hours, the system will wait for up to 2 hours for late data to arrive. Any data that arrives within this 2-hour window will be considered for processing and deduplication. However, data that arrives later than 2 hours after the event time will be considered too late and will be discarded. This ensures that the state store does not grow indefinitely, but it also means that any records arriving more than 2 hours late will not be included in the orders table.
Comment 1299924 by m79590530
- Upvotes: 2
Selected Answer: E Every Stream micro-batch is executed on all of the new data that arrived after the last run 2 hours ago by the .trigger(once=True) option. Deduplication is done for it based on the combined key fields values but all records older than 2 hours based on the ‘time’ field will be ignored thanks to the .withWatermark() option/function. So target table will have deduplicated data withOUT the late records arriving more than 2 hours later based on the 2 hours watermark buffer set for the readStream.
Comment 1296611 by shaojunni
- Upvotes: 2
Selected Answer: E Data arrive outside of watermark will be dropped.
Comment 1265048 by quaternion
- Upvotes: 3
Selected Answer: E Watermark(“time”, “2 hours”) ⇒ does’nt let records arriving more than 2 hours late to be written dropDuplicates ⇒ removes duplicate records from the records that are read
Comment 1225115 by QuangTrinh
- Upvotes: 1
Selected Answer: E Should be E.
Watermarking (withWatermark(“time”, “2 hours”)): This sets a 2-hour watermark on the time column. The watermark specifies the event time threshold for data completeness, meaning that data older than 2 hours will be considered late and may be dropped. Deduplication (dropDuplicates([“customer_id”, “order_id”])): This operation removes duplicates based on the composite key (customer_id and order_id). However, it only works within the window defined by the watermark.
Question AmykmnppUMNnrpjPxD8b
Question
A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constraints and multi-table inserts to validate records on write.
Which consideration will impact the decisions made by the engineer while migrating this workload?
Choices
- A: Databricks only allows foreign key constraints on hashed identifiers, which avoid collisions in highly-parallel writes.
- B: Databricks supports Spark SQL and JDBC; all logic can be directly migrated from the source system without refactoring.
- C: Committing to multiple tables simultaneously requires taking out multiple table locks and can lead to a state of deadlock.
- D: All Delta Lake transactions are ACID compliant against a single table, and Databricks does not enforce foreign key constraints.
- E: Foreign keys must reference a primary key field; multi-table inserts must leverage Delta Lake’s upsert functionality.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1141650 by vctrhugo
- Upvotes: 6
Selected Answer: D In Databricks Delta Lake, transactions are ACID compliant at the table level, meaning that transactions apply to a single table. However, Delta Lake does not enforce foreign key constraints across tables. Therefore, the data engineer needs to be aware that Databricks does not automatically enforce referential integrity between tables through foreign key constraints, and it becomes the responsibility of the data engineer to manage these relationships appropriately.
Comment 1100412 by alexvno
- Upvotes: 2
Selected Answer: D Primary and foreign keys are informational only and are not enforced.
Comment 1071440 by 60ties
- Upvotes: 1
Selected Answer: D D makes more sense. Since there are no database-level transactions, locks, or guarantees, and since primary key & foreign key constraints are informational only, there is no guarantee of enforced relations (the start schema) in place will remain in place after migration. This means B cannot be right.
Question iVvTjn6OSiwkTMSyZHVu
Question
A data architect has heard about Delta Lake’s built-in versioning and time travel capabilities. For auditing purposes, they have a requirement to maintain a full record of all valid street addresses as they appear in the customers table.
The architect is interested in implementing a Type 1 table, overwriting existing records with new values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the project feels that a Type 2 table will provide better performance and scalability.
Which piece of information is critical to this decision?
Choices
- A: Data corruption can occur if a query fails in a partially completed state because Type 2 tables require setting multiple fields in a single update.
- B: Shallow clones can be combined with Type 1 tables to accelerate historic queries for long-term versioning.
- C: Delta Lake time travel cannot be used to query previous versions of these tables because Type 1 changes modify data files in place.
- D: Delta Lake time travel does not scale well in cost or latency to provide a long-term versioning solution.
- E: Delta Lake only supports Type 0 tables; once records are inserted to a Delta Lake table, they cannot be modified.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1339543 by lene
- Upvotes: 1
Selected Answer: D Vacuuming restricts time travel capability
Comment 1326203 by Sriramiyer92
- Upvotes: 3
Selected Answer: D Also performing a Vaccum Op on Data will not suite well with the requirement in the long run.
Comment 1309265 by vish9
- Upvotes: 1
Selected Answer: D D makes sense
Comment 1295212 by KB_Ai_Champ
- Upvotes: 1
its C time travel cant be performed on type 1
Comment 1141649 by vctrhugo
- Upvotes: 4
Selected Answer: D Delta Lake’s time travel feature allows you to access previous versions of the data, which can be useful for auditing purposes. However, if you’re planning to use time travel as a long-term versioning solution, it’s important to know that it may not scale well in terms of cost or latency. This is because every time you perform a write operation, a new version of the data is created, which can consume significant storage over time. Additionally, querying older versions of the data may require scanning through many files, which can increase query latency.
Comment 1071443 by 60ties
- Upvotes: 4
Selected Answer: D D makes more sense
Question OLEAOQTPoyUFbuesNCTS
Question
A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.
The user_ltv table has the following schema:
email STRING, age INT, ltv INT
The following view definition is executed:
//IMG//
An analyst who is not a member of the auditing group executes the following query:
SELECT * FROM user_ltv_no_minors
Which statement describes the results returned by this query?
Choices
- A: All columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted.
- B: All age values less than 18 will be returned as null values, all other columns will be returned with the values in user_ltv.
- C: All values for the age column will be returned as null values, all other columns will be returned with the values in user_ltv.
- D: All records from all columns will be displayed with the values in user_ltv.
- E: All columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.
answer?
Answer: A Answer_ET: A Community answer A (88%) 12% Discussion
Comment 1113061 by dmov
- Upvotes: 8
Selected Answer: A Definitely A. It’s greater than or equal to
Comment 1349175 by 91d511b
- Upvotes: 1
Selected Answer: E
=18 means that the analyst that is NOT in the auditing gorup will only see recorders greater than or equal to 18.
Comment 1335457 by arekm
- Upvotes: 1
Selected Answer: A A - at first I thought that was a silly answer since 17.5 is still greater than 17, but not really what the condition says. However, if you look at the schema - the age is an integer - you can only have …, 17, 18, 19, …
Comment 1326207 by Sriramiyer92
- Upvotes: 1
Selected Answer: A Simple - Play of Words in option A and E. Read the option carefully.
Comment 1245707 by c00ccb7
- Upvotes: 1
Selected Answer: A Because greater than 18 doesnt include 18
All columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.
Comment 1167555 by Tamele001
- Upvotes: 1
Option E
Comment 1111505 by divingbell17
- Upvotes: 4
Selected Answer: A A is correct. “greater than 17” is the equivalent to “equal or greater than 18”
Comment 1100368 by sodere
- Upvotes: 1
Selected Answer: E 18 not 17.
Question qBCT91l3ozzToP7ZwHHN
Question
The data governance team is reviewing code used for deleting records for compliance with GDPR. The following logic has been implemented to propagate delete requests from the user_lookup table to the user_aggregates table.
//IMG//
Assuming that user_id is a unique identifying key and that all users that have requested deletion have been removed from the user_lookup table, which statement describes whether successfully executing the above logic guarantees that the records to be deleted from the user_aggregates table are no longer accessible and why?
Choices
- A: No; the Delta Lake DELETE command only provides ACID guarantees when combined with the MERGE INTO command.
- B: No; files containing deleted records may still be accessible with time travel until a VACUUM command is used to remove invalidated data files.
- C: Yes; the change data feed uses foreign keys to ensure delete consistency throughout the Lakehouse.
- D: Yes; Delta Lake ACID guarantees provide assurance that the DELETE command succeeded fully and permanently purged these records.
- E: No; the change data feed only tracks inserts and updates, not deleted records.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1339565 by lene
- Upvotes: 1
Selected Answer: B vacuum prevents delta travel
Comment 1255590 by Hadiler
- Upvotes: 2
Selected Answer: B B records will be available in time travel until VACUUM will be executed
Comment 1100421 by alexvno
- Upvotes: 2
Selected Answer: B Delta travel
Comment 1071471 by 60ties
- Upvotes: 2
Selected Answer: B B is best. VACUUM command is needed to completely remove logs of the deleted files.