Questions and Answers
Question juMfFfxwOBnubvdwstdC
Question
A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in the geo_lookup table.
Before executing the code, running SHOW TABLES on the current database indicates the database contains only two tables: geo_lookup and sales.
//IMG//
What will be the outcome of executing these command cells m order m an interactive notebook?
Choices
- A: Both commands will succeed. Executing SHOW TABLES will show that countries_af and sales_af have been registered as views.
- B: Cmd 1 will succeed. Cmd 2 will search all accessible databases for a table or view named countries_af: if this entity exists, Cmd 2 will succeed.
- C: Cmd 1 will succeed and Cmd 2 will fail. countries_af will be a Python variable representing a PySpark DataFrame.
- D: Cmd 1 will succeed and Cmd 2 will fail. countries_af will be a Python variable containing a list of strings.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1339901 by lene
- Upvotes: 1
Selected Answer: D D is correct
Comment 1303561 by nedlo
- Upvotes: 1
i wonder if not B? I guess name of view can be the same as name of Python variable?
Comment 1255633 by Hadiler
- Upvotes: 4
Selected Answer: D D is correct. It will not be DataFrame since collect() will change it to list of strings
Comment 1251593 by vexor3
- Upvotes: 1
Selected Answer: D D is correct
Comment 1228728 by hpkr
- Upvotes: 1
Selected Answer: D D is correct
Question PNZRk9m5O3zXLJTKH9qr
Question
The data science team has requested assistance in accelerating queries on free-form text from user reviews. The data is currently stored in Parquet with the below schema:
item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING
The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.
A junior data engineer suggests converting this data to Delta Lake will improve query performance.
Which response to the junior data engineer’s suggestion is correct?
Choices
- A: Delta Lake statistics are not optimized for free text fields with high cardinality.
- B: Delta Lake statistics are only collected on the first 4 columns in a table.
- C: ZORDER ON review will need to be run to see performance gains.
- D: The Delta log creates a term matrix for free text fields to support selective filtering.
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1300340 by m79590530
- Upvotes: 1
Selected Answer: A Delta Lake optimizations are not well suited for long TIMESTAMP or STRING fields and can not provide good indexing, data skipping or statistics logging for them.
Question LNwuzZxpzCf9WMaRHmww
Question
The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.
The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization.
The compliance officer has recently learned about Delta Lake’s time travel functionality. They are concerned that this might allow continued access to deleted data.
Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?
Choices
- A: Because the VACUUM command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.
- B: Because the default data retention threshold is 24 hours, data files containing deleted records will be retained until the VACUUM job is run the following day.
- C: Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the VACUUM job is run 8 days later.
- D: Because Delta Lake’s delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.
answer?
Answer: C Answer_ET: C Community answer C (86%) 14% Discussion
Comment 1296960 by cales
- Upvotes: 1
Selected Answer: C Is C since by default Vacuum retains files no more referenced in the current table version for 7 days. https://docs.databricks.com/en/delta/history.html#configure-data-retention-for-time-travel-queries
Comment 1255636 by Hadiler
- Upvotes: 2
Selected Answer: C C is the correct answer
Comment 1252396 by vexor3
- Upvotes: 3
Selected Answer: C C is correct
Comment 1237887 by 03355a2
- Upvotes: 1
Selected Answer: A Since the team is expecting last week’s data to be deleted on Sunday at 1am to 2am. The data will be available for approx 24hrs until the vacuum command is run on Monday at 3am.
Question 0bVHT1yruwiMyPgpZaSP
Question
Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?
Choices
- A: configure
- B: fs
- C: workspace
- D: libraries
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1251596 by vexor3
- Upvotes: 1
Selected Answer: B B is correct
Question QE6kFMBDuqVYIfky05Zb
Question
An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id. For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour. Which solution meets these requirements?
Choices
- A: Create a separate history table for each pk_id resolve the current state of the table by running a union all filtering the history tables for the most recent state.
- B: Use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a bronze table, then propagate all changes throughout the system.
- C: Iterate through an ordered set of changes to the table, applying each in turn; rely on Delta Lake’s versioning ability to create an audit log.
- D: Use Delta Lake’s change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse.
- E: Ingest all log information into a bronze table; use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a silver table to recreate the current table state.
answer?
Answer: E Answer_ET: E Community answer E (82%) D (18%) Discussion
Comment 991557 by Eertyy
- Upvotes: 10
The answer given is correct
Comment 1222457 by imatheushenrique
- Upvotes: 5
E. This is the correct answer because it meets the requirements of maintaining a full record of all values that have ever been valid in the source system and recreating the current table state with only the most recent value for each record. The code ingests all log information into a bronze table, which preserves the raw CDC data as it is. Then, it uses merge into to perform an upsert operation on a silver table, which means it will insert new records or update or delete existing records based on the change type and the pk_id columns. This way, the silver table will always reflect the current state of the source table, while the bronze table will keep the history of all changes.
Comment 1361624 by Tedet
- Upvotes: 2
Selected Answer: E Bronze Table (Raw Ingest): You start by ingesting all the change data capture (CDC) records into a bronze table. Silver Table (Processed State): The silver table represents the most recent state of the data. You would use the MERGE INTO command to process the changes from the bronze table and update the silver table accordingly. Audit Trail: Since you’re ingesting all the data into the bronze table, you maintain a full history of changes that have occurred over time, which satisfies the auditing requirement.
Comment 1337965 by mwynn
- Upvotes: 1
Selected Answer: E If getting External CDC Data (Kafka, etc) no need for CDF! Just ingest to Bronze with (pipelines.reset.allowed = false)
Comment 1334751 by arekm
- Upvotes: 1
On top of that - CDC with CDF is not automatic. You still need SQL or Python to read the changes and put them somewhere.
Comment 1311825 by benni_ale
- Upvotes: 1
Selected Answer: E You can only read the change data feed for enabled tables. You must explicitly enable the change data feed option using one of the following methods: TBLPROPERTIES (delta.enableChangeDataFeed = true) . this means it is a delta feature or in other words it is a feature supported by delta tables. the data to process in the question is external so it is not a delta table ⇒ can’t be B… Hopefully I am correct.
Comment 1297457 by benni_ale
- Upvotes: 1
Selected Answer: E E . databricks cdc is not set to process external cdc. if u have external cdc u could send to bronze for auditing purposes and use bronze to get silver where u have only valid records
Comment 1285468 by databrick_work
- Upvotes: 1
E is correct
Comment 1130898 by spaceexplorer
- Upvotes: 2
Selected Answer: E The answer is E
Comment 1119489 by RafaelCFC
- Upvotes: 1
Selected Answer: E Complimenting kz_data’s response, be aware that the data that is being consumed is not a Databrick’s CDC data feed object, but rather, CDC coming from somewhere else, that is, just regular data. So, indeed, it can’t be processed without another tool.
Comment 1118573 by kz_data
- Upvotes: 2
Selected Answer: E Answer E is correct, as the CDC captured from the external database may contain duplicates for the same pk_id (key) due to multiple updates within the processed hour, we need to take the most recent update for the pk_id, and then MERGE into a silver table.
Comment 1117617 by a560fe1
- Upvotes: 2
CDF captures changes only from a Delta table and is only forward-looking once enabled. The CDC logs are writing to object storage. So you would need to ingestion those and merge into downstream tables, hence the answer is E
Comment 1086072 by hamzaKhribi
- Upvotes: 3
Selected Answer: D For me the answer is D, the question states that CDC logs are emitted on an external storage meaning it can be ingested into the bronze layer on a table with CDF enabled. In this case we let databricks handle the complexity of following changes and only worry about data quality. meaning with CDF enabled databricks will already work the audit data for us with the table_changes of the pre-image and post-image and also give us the last updated value for our use case. here is a similar example: https://www.databricks.com/blog/2021/06/09/how-to-simplify-cdc-with-delta-lakes-change-data-feed.html