Questions and Answers
Question aRwr8ByLVNMYRW3M0lQj
Question
A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure. The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications. The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields. Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?
Choices
- A: The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.
- B: Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.
- C: Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.
- D: Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.
- E: Schema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1114380 by RafaelCFC
- Upvotes: 14
Selected Answer: D A is wrong, because Tungsten is a project around improving Spark’s efficiency on memory and CPU usage; B is wrong because Parquet does not support file editing, it only supports overwrite and create operations by itself; C is wrong because completely automating schema declaration for tables will incur in reduced previsibility for data types and data quality; E is false because unlucky sampling can yield bad inferences by Spark;
Comment 1173179 by hal2401me
- Upvotes: 7
from my exam today, both C & D are no longer available, so they can’t be correct. E & A are available. E states “always accurate” so I hesitate to choose it. There is a new option stating like “delta lake indexes first 32column in delta log for Z order and optimization”(not sure I remember exactly, it looks statementfully correct). and I chosed this “new” option. Because, this should impact the schema decision by putting high-usage field in the first 32 columns.
Comment 1363435 by Tedet
- Upvotes: 1
Selected Answer: D Explanation: Databricks can infer schema when reading data, but automatic schema inference doesn’t always guarantee the accuracy of data types. For complex or highly-nested structures, schema inference might not always align with the actual data quality or the needs of downstream applications, and manual type definition ensures that the schema is more consistent and predictable. While automatic inference is useful for quick analysis or exploratory work, manual schema definition provides better data quality assurance in production workloads, especially when dealing with large, complex data structures. Conclusion: This statement correctly emphasizes the importance of manual schema declaration to ensure data quality enforcement and consistency, especially when dealing with complex structures. Best option.
Comment 1141038 by guillesd
- Upvotes: 1
Selected Answer: D Only answer that makes sense
Comment 1134100 by AziLa
- Upvotes: 1
Correct Ans is D
Comment 1044827 by sturcu
- Upvotes: 2
Selected Answer: D correct
Comment 1007012 by hammer_1234_h
- Upvotes: 2
D is correct. we can use
schema hintto enforce the schema information that we know and expect on an inferred schema.
Question XkEQx9YFLGPipAGsgml0
Question
The data engineering team maintains the following code: //IMG//
Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?
Choices
- A: A batch job will update the enriched_itemized_orders_by_account table, replacing only those rows that have different values than the current version of the table, using accountID as the primary key.
- B: The enriched_itemized_orders_by_account table will be overwritten using the current valid version of data in each of the three tables referenced in the join logic.
- C: An incremental job will leverage information in the state store to identify unjoined rows in the source tables and write these rows to the enriched_iteinized_orders_by_account table.
- D: An incremental job will detect if new rows have been written to any of the source tables; if new rows are detected, all results will be recalculated and used to overwrite the enriched_itemized_orders_by_account table.
- E: No computation will occur until enriched_itemized_orders_by_account is queried; upon query materialization, results will be calculated using the current valid version of data in each of the three tables referenced in the join logic.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1334822 by arekm
- Upvotes: 1
Selected Answer: B B - it is a batch overwrite, which means: whatever was there is gone.
Comment 1302995 by nedlo
- Upvotes: 1
Selected Answer: B i agree. Cannot be E because write itself is action
Comment 1270025 by AndreFR
- Upvotes: 1
Selected Answer: B B because code has : .mode(“Overwrite”)
Comment 1223843 by imatheushenrique
- Upvotes: 1
B is correct
Comment 1134104 by AziLa
- Upvotes: 2
Correct Ans is B
Comment 1121967 by Jay_98_11
- Upvotes: 2
Selected Answer: B correct
Comment 1044830 by sturcu
- Upvotes: 3
Selected Answer: B B is correct
Question rWc51GIkEqmmtq58Q2kw
Question
The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels. The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams. Which statement exemplifies best practices for implementing this system?
Choices
- A: Isolating tables in separate databases based on data quality tiers allows for easy permissions management through database ACLs and allows physical separation of default storage locations for managed tables.
- B: Because databases on Databricks are merely a logical construct, choices around database organization do not impact security or discoverability in the Lakehouse.
- C: Storing all production tables in a single database provides a unified view of all data assets available throughout the Lakehouse, simplifying discoverability by granting all users view privileges on this database.
- D: Working in the default Databricks database provides the greatest security when working with managed tables, as these will be created in the DBFS root.
- E: Because all tables must live in the same storage containers used for the database they’re created in, organizations should be prepared to create between dozens and thousands of databases depending on their data isolation requirements.
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1334824 by arekm
- Upvotes: 2
Selected Answer: A A - most logical B - it is a logical construct, but under default settings tables are stored where the database is so there is a security component to it C - never a good idea to store everything in one db since db allows to group tables with similar area of interest and allows to manage permissions (like groups in Entra and assigning permissions to groups) D - not default database does not mean we cannot use managed tables and you can specify your location still; I do not think that storing anything on DBFS is a good idea - even Databricks suggests to use workspaces for your code, not to mention the data. E - thousand databases - nonsense; you can specify the location of individual tables.
Comment 1231912 by strayda
- Upvotes: 1
Selected Answer: A The most logical answer is A
Comment 1223844 by imatheushenrique
- Upvotes: 2
A is correct
Comment 1149885 by ojudz08
- Upvotes: 1
Selected Answer: A answer is A
Comment 1134147 by AziLa
- Upvotes: 2
Correct Ans is A
Comment 1080126 by Enduresoul
- Upvotes: 2
Selected Answer: A A is correct
Question 5e6xud5YDaQ5cXy0TSTv
Question
The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables. Which approach will ensure that this requirement is met?
Choices
- A: Whenever a database is being created, make sure that the LOCATION keyword is used
- B: When configuring an external data warehouse for all table storage, leverage Databricks for all ELT.
- C: Whenever a table is being created, make sure that the LOCATION keyword is used.
- D: When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement.
- E: When the workspace is being configured, make sure that external cloud object storage has been mounted.
answer?
Answer: C Answer_ET: C Community answer C (90%) 10% Discussion
Comment 1325620 by Sriramiyer92
- Upvotes: 2
Selected Answer: C Note: External keyword is not mandatory. Location is mandatory the presence implies, that the table is external
Comment 1324742 by carah
- Upvotes: 3
Selected Answer: C A. is not correct: having schema with LOCATION CREATE SCHEMA my_schema LOCATION ‘s3://
/my_schema’; Table Location Scenarios:
Table Without LOCATION:
CREATE TABLE my_schema.my_table (id INT);
The table will be stored in the default warehouse directory (e.g., dbfs:/user/hive/warehouse/), not the schema’s LOCATION.
Table With Explicit LOCATION: If you want the table to be stored under the schema’s LOCATION, you need to specify the location explicitly:
CREATE TABLE my_schema.my_table (id INT) LOCATION ‘s3://
/my_schema/my_table/’; So, if you want all tables under the schema to use the schema’s LOCATION, explicitly specify the LOCATION for each table during creation.
Comment 1308645 by y2kal
- Upvotes: 1
It should be A, as the question states “all tables”. Once an external DB is created, then all the tables in that would be by default be external.
Comment 1306496 by akashdesarda
- Upvotes: 1
Selected Answer: A A is correct. If a database is created using location keyword then by default all the tables created in it will use that location. They folows
/_unity_catalog/tables/ Comment 1154661 by leopedroso1
- Upvotes: 2
C is the correct answer. According to the documentation only the LOCATION is needed to make a table external. Moreover, we can also assume the keyword EXTERNAL is optional in the SQL statement.
https://docs.databricks.com/en/sql/language-manual/sql-ref-external-tables.html
Comment 1146837 by CY
- Upvotes: 3
‘A’ seems more appropriate. All the tables in Delta lake house should be marked as external.. which can be achieved using location keyword at database level instead of each table level.
Comment 1105614 by Yogi05
- Upvotes: 2
Why not D? i know both C and D are same, but D is more precise
Comment 1072451 by Laraujo2022
- Upvotes: 1
If you set a location in a database level, all tables under this database are automatically external table, in my opinion is A is correct.
Comment 1057414 by Quadronoid
- Upvotes: 4
Selected Answer: C C is correct. Location keyword should be in create script of the table
Comment 1052836 by mouad_attaqi
- Upvotes: 3
C is correct, the key word to be used is Location, the keyword external is optional
Comment 1049282 by chokthewa
- Upvotes: 2
The correct is D
Question k4Edq9ImnquKaHGz0RZN
Question
To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical queries. The data engineering team has been made aware of new requirements from a customer-facing application, which is the only downstream workload they manage entirely. As a result, an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added. Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed?
Choices
- A: Send all users notice that the schema for the table will be changing; include in the communication the logic necessary to revert the new table schema to match historic queries.
- B: Configure a new table with all the requisite fields and new names and use this as the source for the customer-facing application; create a view that maintains the original data schema and table name by aliasing select fields from the new table.
- C: Create a new table with the required schema and new fields and use Delta Lake’s deep clone functionality to sync up changes committed to one table to the corresponding table.
- D: Replace the current table definition with a logical view defined with the query logic currently writing the aggregate table; create a new table to power the customer-facing application.
- E: Add a table comment warning all users that the table schema and field names will be changing on a given date; overwrite the table in place to the specifications of the customer-facing application.
answer?
Answer: B Answer_ET: B Community answer B (59%) D (37%) 5% Discussion
Comment 1141070 by guillesd
- Upvotes: 7
Selected Answer: B B makes way more sense, the number of tables managed do not increase since the old table won’t be used anymore, then the view on top of this table is not another table to manage, just maintains the “original API” of the table to avoid breaking changes in downstream applications
Comment 1172555 by alexvno
- Upvotes: 7
Selected Answer: D Create view. Can’t be B as → without increasing the number of tables that need to be managed
Comment 1359147 by EZZALDIN
- Upvotes: 1
Selected Answer: B it recommends creating a view that maps the new table back to the original schema. This view lets other teams continue using the table as they always have, with no changes to their queries.
Comment 1339913 by RandomForest
- Upvotes: 1
Selected Answer: B This approach achieves the following key goals:
- Minimizes Disruption: By creating a view that mirrors the original schema, existing workloads that depend on the current schema remain uninterrupted. Other teams can continue their queries without needing to adjust their logic for the schema change.
- Meets New Requirements: The new table accommodates the changes required by the customer-facing application, ensuring that the application’s updated requirements are fulfilled.
- Avoids Table Duplication: Instead of maintaining multiple tables for the same dataset, this approach uses a combination of a new table and a view, reducing the overall management burden. 4Flexibility for Future Changes: Views can be adjusted as needed, providing a layer of abstraction. Future schema updates can be handled similarly without directly impacting dependent systems.
Comment 1335222 by HairyTorso
- Upvotes: 1
Selected Answer: B Create view → number of tables stay the same. Option D has overhead
Comment 1334828 by arekm
- Upvotes: 1
Selected Answer: B B - but I was wondering between B & D.
I do not like D since you replace the table with a view (query costs + you need to change the currently working workflow). Additionally, you create a table that does similar thing to the view - why?
Comment 1317128 by Nicks_name
- Upvotes: 1
E. not D because, by converting the aggregate table into a view, might introduce performance overheads as every access now potentially involves running complex query logic to reconstruct the desired dataset on-the-fly. This might not be ideal for performance-sensitive applications like business intelligence dashboards.
Comment 1312853 by vish007
- Upvotes: 2
Selected Answer: B Option D will increase the Compute cost significantly as all the downstream teams will run the view which has logic for Aggregate table. Option B make more sense with less impact to storage and compute cost which is the original ask for the data engineering team in the question.
Comment 1309919 by benni_ale
- Upvotes: 1
Selected Answer: D I am not sure whether B or D… I believe B increases the number of managed Tables as it states that a CREATE TABLE statement is run before a CREATE VIEW … the fact that the CREATE VIEW will replace the current table is not really specified… still one could argue that it would be dumb not do it but at this point i would say that D is more precise
Comment 1308870 by b.b.da.costa
- Upvotes: 1
The problem with this question is if the order of the sentence matters.
B: Create a table then create a view. Teams are interrupted after the creation of the table. D: Create a view then create a table. Teams are not interrupted because they are consuming the view first.
Comment 1307530 by kimberlyvsmith
- Upvotes: 1
Selected Answer: B B is Correct. It does not create additional tables. The view mimics the old schema so not to interrupt downstream consumers. It ensures the aggregates are persisted to save on compute.
D is incorrect mostly due to the aggregates being baked into the view which is not optimal as each time downstream users query the view the joins and aggregates have to be recomputed.
Comment 1295720 by shaojunni
- Upvotes: 1
Selected Answer: D D will not increate the number of table. It will create a new table and replace the aggregation table with a view. B will create a new table, a new view match old table name and schema, aggregation table still there.
Comment 1283497 by KB_Ai_Champ
- Upvotes: 2
option D is correct docs : https://docs.databricks.com/en/delta/update-schema.html also they specifically says that they dont want to increase managed tables!
Comment 1270071 by AndreFR
- Upvotes: 1
Selected Answer: B B is correct, no new tables, and minimally interrupting other teams in the organization A & E excluded, because they interrupt other teams in the organisation, usually answer that require user communication are wrong answers. C excluded, because it’s used for table creation, not after creation D excluded because it increases the number of tables
Comment 1270030 by fe3b2fc
- Upvotes: 2
Selected Answer: A B,C and D all state creating a new table, therefore increasing the number of tables to manage. This is exactly what the question says to avoid.
“minimally interrupting other teams in the organization without increasing the number of tables that need to be managed”
Answer A is the only one that makes sense and is pretty standard operation procedure for databases. E is wrong because you would never update a column comment to inform users of anything.
Comment 1260288 by faraaz132
- Upvotes: 2
Selected Answer: B B is correct. Why not D: Because it will create interruption when you replace the current table with a view and question says minimal interruption
Comment 1248749 by pravieee
- Upvotes: 2
Selected Answer: B I would go for B.
With option B you will run the aggregations once and store in in a table, then present these aggregations in the old schema in a view.
With D the aggregations will be done twice, for the old schema view and for the new table.
Comment 1202409 by ThoBustos
- Upvotes: 1
Selected Answer: B to me it’s b because by creating a new table + the view that will substitute the previous table we still have 1 table. It seems to be the most efficient way to solve this. Not 100% sure though
Comment 1173180 by hal2401me
- Upvotes: 2
Selected Answer: D in my exam today I chose D.
Comment 1118257 by IWantCerts
- Upvotes: 2
Selected Answer: B I think it’s B. D replaces original table definition with a view, which will run up compute costs for queries using the table.
Comment 1100122 by aksand13
- Upvotes: 4
Selected Answer: D D. B has new table and view created.
Comment 1057432 by Quadronoid
- Upvotes: 1
Selected Answer: B B is definitely the best option
Comment 1049968 by chokthewa
- Upvotes: 2
B is suitable for fact , don’t interrupt the end-user , just managed by technical term. The technical team will create view refer field mapping .
Comment 1044835 by sturcu
- Upvotes: 1
Selected Answer: B B is correct.
Comment 1025261 by TheGhost21
- Upvotes: 1
Answer is D