Questions and Answers
Question S9crePGbIfGIhbaKyCBi
Question
A Delta Lake table was created with the below query: //IMG//
Realizing that the original query had a typographical error, the below code was executed: ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store Which result will occur after running the second command?
Choices
- A: The table reference in the metastore is updated and no data is changed.
- B: The table name change is recorded in the Delta transaction log.
- C: All related files and metadata are dropped and recreated in a single ACID transaction.
- D: The table reference in the metastore is updated and all data files are moved.
- E: A new Delta transaction log Is created for the renamed table.
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1167199 by hal2401me
- Upvotes: 8
Selected Answer: A did a test. No data is changed. dir & filename not changed. the rename is not recorded in transition log neither.
Comment 1166820 by Tamele001
- Upvotes: 2
B is the correct answer. When you alter a table name in Delta Lake, the change is logged in the transaction log that Delta Lake uses to maintain a versioned history of all changes to the table. This is how Delta Lake maintains ACID properties and ensures a consistent view of the data. The transaction log is key to supporting features like time travel, auditing, and rollbacks in Delta Lake. The metadata and the actual data remain intact, and the reference to the table in the metastore is updated to reflect the new name.
Comment 1135723 by adenis
- Upvotes: 4
Selected Answer: A A is Correct
Question AEGUCcKZpEIGyDwGTOGW
Question
The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database. After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active user. They then modify their code to the following (leaving all other variables unchanged). //IMG//
Which statement describes what will happen when the above code is executed?
Choices
- A: The connection to the external table will fail; the string “REDACTED” will be printed.
- B: An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the encoded password will be saved to DBFS.
- C: An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the password will be printed in plain text.
- D: The connection to the external table will succeed; the string value of password will be printed in plain text.
- E: The connection to the external table will succeed; the string “REDACTED” will be printed.
answer?
Answer: E Answer_ET: E Community answer E (100%) Discussion
Comment 1294571 by benni_ale
- Upvotes: 1
Selected Answer: E Shave like a bomber
Comment 1290646 by akashdesarda
- Upvotes: 4
Selected Answer: E Whatever we read using dbutls.secret module is always printed as ‘[REDACTED]’, but when consumed in code, underlying vales are passed.
Comment 1224437 by imatheushenrique
- Upvotes: 1
E. The connection to the external table will succeed; the string “REDACTED” will be printed.
Comment 1144395 by PrashantTiwari
- Upvotes: 1
E is correct
Comment 1128062 by AziLa
- Upvotes: 1
correct ans is E
Comment 1121593 by Jay_98_11
- Upvotes: 2
Selected Answer: E E is correct
Comment 1114981 by ATLTennis
- Upvotes: 2
Selected Answer: E E is correct
Comment 1102676 by kz_data
- Upvotes: 2
Selected Answer: E Correct answer E
Comment 1040236 by sturcu
- Upvotes: 4
Selected Answer: E Correct: https://docs.databricks.com/en/external-data/jdbc.html
Comment 980601 by Brian9
- Upvotes: 3
https://learn.microsoft.com/en-us/azure/databricks/security/secrets/redaction
Option E - which is selected answer seems correct.
Comment 973574 by 8605246
- Upvotes: 1
This option is correct, although the password won’t be printed out, the connection will still succeed.
Question UCAiGnwsC7FtrRe2PPSq
Question
The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is named store_saies_summary and the schema is as follows: //IMG//
The table daily_store_sales contains all the information needed to update store_sales_summary. The schema for this table is: store_id INT, sales_date DATE, total_sales FLOAT If daily_store_sales is implemented as a Type 1 table and the total_sales column might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in the store_sales_summary table?
Choices
- A: Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.
- B: Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.
- C: Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.
- D: Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.
- E: Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.
answer?
Answer: C Answer_ET: C Community answer C (49%) A (43%) 9% Discussion
Comment 1004270 by hammer_1234_h
- Upvotes: 12
The answer should be A. it is the safest to generate accurate report
Comment 1410848 by ultimomassimo
- Upvotes: 1
people that claim type 1 scd is not maintained by upsert need to make some more reading before posting anything here…
Comment 1366256 by Jamuro
- Upvotes: 1
Selected Answer: C It makes no sense to recalculate all aggregations for all historical data and then overwrite. It would lead to a non scalable solution because all data should be recalculated to not “delete” rows in target table. Option C lets us filter by date ranges or even partitions to merge only target periods and get the same results in a much more performant, safer and cheaper way.
Comment 1328241 by AlejandroU
- Upvotes: 1
Selected Answer: C The answer is C. Option A is correct in ensuring accuracy, as it recalculates the entire store_sales_summary table based on the full historical data in daily_store_sales. However, it is computationally expensive and may not scale well. Option C (upsert logic) could be a better choice in most real-world scenarios, as it focuses only on the records that have changed, reducing computational costs and minimizing disruption for downstream systems.
Comment 1326022 by Sriramiyer92
- Upvotes: 3
Selected Answer: A A it is SCD Type 1, so clearly Append and Upsert logic should not be used.
Comment 1322419 by benni_ale
- Upvotes: 1
Selected Answer: A A as the table does not require history
Comment 1296208 by shaojunni
- Upvotes: 1
Selected Answer: A daily_store_sales is type1 table, no history is maintained. You have to treat every record as new record and do aggregation for every store. Overwrite is much efficient than upsert.
Comment 1236963 by Ati1362
- Upvotes: 3
Selected Answer: C I will go with c. upsert
Comment 1220888 by MDWPartners
- Upvotes: 4
Selected Answer: C A is not correct because the table is daily. If you overwrite you delete all history. You need to insert/update to keep history.
Comment 1195899 by ThoBustos
- Upvotes: 4
Selected Answer: A Not sure if that’s right but I would go for A. What do you think?
Type1: Data is overwritten Type 2: History is maintained, new data is inserted as new rows Type 3: Stores two versions per record: a previous and a current value
A. batch + overwrite → Match Type 1 requirements. YES B: batch + append new rows → Would be for type 2. NO C. Batch + Upsert → Data is not being overwritten (which is required for Type 1). NO D. ReadStream + Upsert → Data is not being overwritten (which is required for Type 1). NO E. Change Data Feed to update → Problem is manual edits + not overwriting (required for type 1). No
I have doubts around “which approach is the safest”. Maybe because due to some manual changes it is hard to track changes or do upsert, so to make sure that the stats are right overwriting is safer.
Comment 1169411 by vikram12apr
- Upvotes: 2
Selected Answer: C Not A because overwriting will only provide a daily based data not the history of it. Not B because it will not fix the issue of incorrect sales amount As these data are fit for natch processing so neither D or E. C will only upsert the changes while making sure we are updating the records based on sales_date & store_id
Comment 1136213 by Rinscy
- Upvotes: 1
E definitely because it say that the total_sales column may be change by manual auditing so not via a job, so streaming with CDF is the only option here !
Comment 1135999 by Somesh512
- Upvotes: 3
Selected Answer: A I would go with Option A. Because it has manual auditing hence values can change. Uses type 1 hence replace original data
Comment 1132830 by spaceexplorer
- Upvotes: 1
Selected Answer: E It should be E, as structure streaming has built-in fault-tolerance feature.
Comment 1132061 by Rinscy
- Upvotes: 2
It said type 1 so A is the correct answer !
Comment 1110895 by divingbell17
- Upvotes: 2
The question is unclear whether the aggregated table needs to support a rolling history. Note the aggregated table does not have a date column to distinguish which date the summary is generated for so one could assume the table is maintained only for the current snapshot.
Assuming the above - A would be the safest option as all stores and aggregates would need to be refreshed nightly
Comment 1108032 by dmov
- Upvotes: 3
Selected Answer: A A is correct because it’s a static table that is written nightly through a batch job. The summary table does not maintain history and so an upsert results in having extra, unecessary records. Overwrite it nightly with updated aggregates for the required time period.
Comment 1107219 by Luv4data
- Upvotes: 3
The answer is A. Note that the target table has columns which stores quarter to date,previous day sates etc, which will result in daily updates, i.e. large volume of records will be updated, hence better to overwirte than to update large volume of records.
Comment 1099520 by alexvno
- Upvotes: 3
Selected Answer: C Batch processing so you need to update and insert - C
Comment 1080817 by Enduresoul
- Upvotes: 3
Selected Answer: C Answer C is correct. Answer E would do the job too, but the table schema and the question indicates, that there will be only one update daily needed. Therefore a structured streaming job is way too expensive to archive the outcome.
Comment 1062321 by Syd
- Upvotes: 1
Correct answer A Type 1 data is overwritten https://streamsets.com/blog/slowly-changing-dimensions-vs-change-data-capture/#:~:text=In%20Type%201%2C%20any%20new,change%20to%20maintain%20a%20history.
Comment 1044902 by sturcu
- Upvotes: 2
Selected Answer: E I would say that it is E. If daily_store_sales table is implemented as a Type 1 table, this means that values are overwritten, and we do not keep the history. So we would need to create a streaming from CDF and apply those changes into the aggregated table.
Question tmkarXIgodNeHCWFJwEv
Question
A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.
//IMG//
Which command should be removed from the notebook before scheduling it as a job?
Choices
- A: Cmd 2
- B: Cmd 3
- C: Cmd 4
- D: Cmd 5
- E: Cmd 6
answer?
Answer: E Answer_ET: E Community answer E (89%) 11% Discussion
Comment 1084674 by petrv
- Upvotes: 9
Selected Answer: E When scheduling a Databricks notebook as a job, it’s generally recommended to remove or modify commands that involve displaying output, such as using the display() function. Displaying data using display() is an interactive feature designed for exploration and visualization within the notebook interface and may not work well in a production job context.
The finalDF.explain() command, which provides the execution plan of the DataFrame transformations and actions, is often useful for debugging and optimizing queries. While it doesn’t display interactive visualizations like display(), it can still be informative for understanding how Spark is executing the operations on your DataFrame.
Comment 1303045 by Carkeys
- Upvotes: 1
Selected Answer: D Cmd 5 (finalDF.explain()) is used for debugging and understanding the logical and physical plans of a DataFrame. It provides insights into how Spark plans to execute the query but does not produce output that is necessary for the scheduled job. Including this command in a scheduled job is unnecessary and could clutter the job logs without adding value to the final output.
Comment 1299625 by benni_ale
- Upvotes: 1
Selected Answer: E if i was multiple solutions than i would have gone for .explain method and print schema as well as they do not contribute in any sort of ETL operation but as a rule of thumb display should always be omitted first so → E
Comment 1265803 by 71dfab9
- Upvotes: 1
Selected Answer: E I agree with petrv and KhoaLe, but I will add that not displaying the finalDF would be wise as it could display and log PII data and that to me is why I choose E. Like hal2401 said, commands 2, 5 & 6 can be removed as they don’t manipulate the data.
Comment 1159782 by hal2401me
- Upvotes: 1
Selected Answer: E perhaps it’s a multi-choice question in exam. I’ll select E and D. if single choice then E.
Comment 1143938 by KhoaLe
- Upvotes: 2
Selected Answer: E Looking through at all steps, Cmd 2,5,6 can be eliminated without impacting to the whole process. However, in terms of duration cost, Cmd 2 and 5 does not impact much as they only show the current results of logical query plan. In contrast, display() in Cmd6 is actually a transformation, which will take much time to run.
Comment 1099521 by alexvno
- Upvotes: 3
Selected Answer: E No display()
Comment 1070832 by 60ties
- Upvotes: 1
Selected Answer: D No actions on production scripts. D is best
Comment 1061164 by Karen1232123
- Upvotes: 2
Why not D?
Question HLH8a1LIrgeotGnTdQ8N
Question
The business reporting team requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms, and loads the data for their pipeline runs in 10 minutes.
Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?
Choices
- A: Manually trigger a job anytime the business reporting team refreshes their dashboards
- B: Schedule a job to execute the pipeline once an hour on a new job cluster
- C: Schedule a Structured Streaming job with a trigger interval of 60 minutes
- D: Schedule a job to execute the pipeline once an hour on a dedicated interactive cluster
- E: Configure a job that executes every time new data lands in a given directory
answer?
Answer: B Answer_ET: B Community answer B (87%) 13% Discussion
Comment 1110920 by divingbell17
- Upvotes: 10
Selected Answer: B B is correct I think. With option C, the cluster remains on 24/7 with trigger = 60 mins which is more costly
If there is an option with structure streaming with trigger = availablenow, and job scheduled per hour, that would be even more efficient. https://www.databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html
Comment 1269666 by robodog
- Upvotes: 1
Selected Answer: C C. The lowest cost is obtained by using job cluster
Comment 1160110 by Curious76
- Upvotes: 2
Selected Answer: C Databricks recommends using Structured Streaming with trigger AvailableNow for incremental workloads that do not have low latency requirements.
Comment 1131149 by spaceexplorer
- Upvotes: 4
Selected Answer: B B is correct
Comment 1099529 by alexvno
- Upvotes: 4
Selected Answer: B B : Job cluster is cheap , hourly = 60 minutes
Comment 1076493 by aragorn_brego
- Upvotes: 2
Selected Answer: B Scheduling a job to execute the pipeline on an hourly basis aligns with the requirement for data to be updated every hour. Using a job cluster (which is brought up for the job and torn down upon completion) rather than a dedicated interactive cluster will usually be more cost-effective. This is because you are only paying for the compute resources when the job is running, which is 10 minutes out of every hour, rather than paying for an interactive cluster that would be up and running (and incurring costs) continuously.
Comment 1072406 by ofed
- Upvotes: 1
It’s either B or D. I think B, because we want the lowest cost.