Questions and Answers

Question qrrNao8PuTik19HKaZVV

Question

The following table consists of items found in user carts within an e-commerce website.

//IMG//

The following MERGE statement is used to update this table using an updates view, with schema evolution enabled on this table.

//IMG//

How would the following update be handled?

//IMG//

Choices

A: The update throws an error because changes to existing columns in the target schema are not supported.
B: The new nested Field is added to the target schema, and dynamically read as NULL for existing unmatched records.
C: The update is moved to a separate “rescued” column because it is missing a column expected in the target schema.
D: The new nested field is added to the target schema, and files underlying existing records are updated to include NULL values for the new field.

answer?

Answer: B Answer_ET: B Community answer B (100%) Discussion

Comment 1222837 by Freyr

Upvotes: 5

Selected Answer: B Correct Answer: B. In the question it is mentioned that the schema evolution is enabled. This option states that the new nested field is added to the target schema, and existing records not matching the new schema format are populated with NULL for the newly added columns. This behavior aligns with how schema evolution functions in Delta Lake, dynamically adapting the schema to include new fields.

Comment 1335572 by HairyTorso

Upvotes: 1

Selected Answer: B Seems like B is correct if below config is enabled:

SET spark.databricks.delta.schema.autoMerge.enabled = true;

Comment 1277909 by pppppppppie

Upvotes: 1

because B says for unmatched records. but my new records has 1001 which is existing record. we should give ans what will happen with this new record. it will update the record in target with NULL as coupon

Comment 1277905 by pppppppppie

Upvotes: 1

It has to be D

Comment 1221139 by MDWPartners

Upvotes: 4

Selected Answer: B schema evolution is enabled, so B.

Question KCFKe3rJ5QnVm0ksNseR

Question

An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id.

For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.

Which solution meets these requirements?

Choices

A: Iterate through an ordered set of changes to the table, applying each in turn to create the current state of the table, (insert, update, delete), timestamp of change, and the values.
B: Use merge into to insert, update, or delete the most recent entry for each pk_id into a table, then propagate all changes throughout the system.
C: Deduplicate records in each batch by pk_id and overwrite the target table.
D: Use Delta Lake’s change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse.

answer?

Answer: D Answer_ET: D Community answer D (77%) B (23%) Discussion

Comment 1553215 by a85becd

Upvotes: 1

Selected Answer: B Delta Lake’s Change Data Feed (CDF) only works on Delta tables, as it is a feature explicitly tied to Delta Lake’s architecture. In the scenario described in the question, the source is an external system writing CDC logs to cloud object storage, and there is no mention of the data being stored in Delta tables initially.

Comment 1332257 by OnlyPraveen

Upvotes: 3

Selected Answer: D If its exact duplicate of Question #13 then we are missing a good choice of : “E. Ingest all log information into a bronze table; use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a silver table to recreate the current table state.”

Comment 1319318 by e904bf4

Upvotes: 2

Selected Answer: B The right answer is B because CDF only works on delta tables and here the source is external

Comment 1304874 by Huepig

Upvotes: 3

There is no right answer. Closest is B only after the CDC logs are ingested to a Bronze table and then use merge into a silver table.

Why not D? Because CDF only works on delta tables and not on external CDC logs.

Comment 1256275 by HelixAbdu

Upvotes: 1

The MERGE INTO statement in Delta Lake is a powerful feature designed to handle Change Data Capture (CDC) data efficiently. This approach meets both the auditing and analytical requirements.

CDF is not enabled by default. So these data is not generated by it to handel them.

Comment 1236283 by Ati1362

Upvotes: 2

Selected Answer: D agree with D

Comment 1224761 by BrianNguyen95

Upvotes: 2

Selected Answer: D Delta Lake provides built-in change data feed functionality. It captures changes (inserts, updates, deletes) and propagates them to dependent tables. By using Delta Lake, you can maintain historical records and propagate changes efficiently.

Comment 1222841 by Freyr

Upvotes: 3

Selected Answer: D Correct Answer: D Delta Lake’s change data feed feature is specifically designed to handle CDC scenarios. It processes data from external systems, tracking all changes (inserts, updates, deletes) and maintaining a detailed history of these changes. This feature allows for keeping a comprehensive log while also ensuring the most recent state is correctly reflected in analytical tables.

Question ZYlCl9a4PBryUdOQnQXj

Question

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema:

user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT

New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id.

Which implementation can be used to efficiently update the described account_current table as part of each hourly batch job assuming there are millions of user accounts and tens of thousands of records processed hourly?

Choices

A: Filter records in account_history using the last_updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the most recent value for each username.
B: Use Auto Loader to subscribe to new files in the account_history directory; configure a Structured Streaming trigger available job to batch update newly detected files into the account_current table.
C: Overwrite the account_current table with each batch using the results of a query against the account_history table grouping by user_id and filtering for the max value of last_updated.
D: Filter records in account_history using the last_updated field and the most recent hour processed, as well as the max last_login by user_id write a merge statement to update or insert the most recent value for each user_id.

answer?

Answer: D Answer_ET: D Community answer D (71%) 14% 14% Discussion

Comment 1307253 by benni_ale

Upvotes: 1

Selected Answer: D B or D … associate course tells us to use auto loader when bilions of rows…

Comment 1288816 by RyanAck24

Upvotes: 1

Selected Answer: D D seems like the best option

Comment 1288190 by shaojunni

Upvotes: 1

Selected Answer: B A, D both wrong. They only take data from the latest update. It is too narrow. Same user_id can have several updates within an hour to update different fields. So use auto loader to apply all the updates within an hour is the only correct answer.

Comment 1268169 by fe3b2fc

Upvotes: 1

Selected Answer: A Answer is A. You’re meeting all the requirements with less overhead. It’s only updating on the most recent record, so duplicates are handled.

Answer D is too much overhead. They’re doing a full table scan for all records, which as the question stated, is millions of records.

Comment 1222863 by Freyr

Upvotes: 3

Selected Answer: D Correct Answer: D Similar to Option A, but specifically designed around the user_id, which is the primary key. This approach ensures that the account_current is always up-to-date with the latest information per user based on the primary key, reducing the risk of duplicate information and ensuring the integrity of the data with respect to the unique identifier.

Question grzKEzJUKeZKovXuBFwA

Question

The business intelligence team has a dashboard configured to track various summary metrics for retail stores. This includes total sales for the previous day alongside totals and averages for a variety of time periods. The fields required to populate this dashboard have the following schema:

//IMG//

For demand forecasting, the Lakehouse contains a validated table of all itemized sales updated incrementally in near real-time. This table, named products_per_order, includes the following fields:

//IMG//

Because reporting on long-term sales trends is less volatile, analysts using the new dashboard only require data to be refreshed once daily. Because the dashboard will be queried interactively by many users throughout a normal business day, it should return results quickly and reduce total compute associated with each materialization.

Which solution meets the expectations of the end users while controlling and limiting possible costs?

Choices

A: Populate the dashboard by configuring a nightly batch job to save the required values as a table overwritten with each update.
B: Use Structured Streaming to configure a live dashboard against the products_per_order table within a Databricks notebook.
C: Define a view against the products_per_order table and define the dashboard against this view.
D: Use the Delta Cache to persist the products_per_order table in memory to quickly update the dashboard with each query.

answer?

Answer: A Answer_ET: A Community answer A (100%) Discussion

Comment 1341580 by RandomForest

Upvotes: 1

Selected Answer: A A is correct: The data should only be refreshed once a day so nightly load will do just fine.

Comment 1297175 by 17kyu

Upvotes: 1

E #69 in professional one

Comment 1221146 by MDWPartners

Upvotes: 2

Selected Answer: A Seems correct

Question Oo2uVxBuPVvr2gJ6ZBlk

Question

A Delta lake table with CDF enabled table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.

Which approach would simplify the identification of these changed records?

Choices

A: Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
B: Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
C: Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.
D: Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.

answer?

Answer: C Answer_ET: C Community answer C (100%) Discussion

Comment 1341585 by RandomForest

Upvotes: 1

Selected Answer: C C is correct: Use the fact that CDF is enabled to implement a SCD Type 1 UPSERT logic.

Comment 1300362 by m79590530

Upvotes: 3

Selected Answer: C Why use overwrite for a table that has CDF enabled? Best is to modify it in a SCD Type 1 style by using UPSERTs via the MERGE INTO logic and then/also use its CDF system fields as the means to filter which records are current and which not by selecting only the records that have ‘insert’ or ‘update_postimage’ in their _change_type CDF system column.

vuthanhdatt's Second Brain

Explorer

8

Questions and Answers

Question qrrNao8PuTik19HKaZVV

Question

Choices

Comment 1222837 by Freyr

Comment 1335572 by HairyTorso

Comment 1277909 by pppppppppie

Comment 1277905 by pppppppppie

Comment 1221139 by MDWPartners

Question KCFKe3rJ5QnVm0ksNseR

Question

Choices

Comment 1553215 by a85becd

Comment 1332257 by OnlyPraveen

Comment 1319318 by e904bf4

Comment 1304874 by Huepig

Comment 1256275 by HelixAbdu

Comment 1236283 by Ati1362

Comment 1224761 by BrianNguyen95

Comment 1222841 by Freyr

Question ZYlCl9a4PBryUdOQnQXj

Question

Choices

Comment 1307253 by benni_ale

Comment 1288816 by RyanAck24

Comment 1288190 by shaojunni

Comment 1268169 by fe3b2fc

Comment 1222863 by Freyr

Question grzKEzJUKeZKovXuBFwA

Question

Choices

Comment 1341580 by RandomForest

Comment 1297175 by 17kyu

Comment 1221146 by MDWPartners

Question Oo2uVxBuPVvr2gJ6ZBlk

Question

Choices

Comment 1341585 by RandomForest

Comment 1300362 by m79590530

Graph View

Table of Contents