Questions and Answers

Question S9crePGbIfGIhbaKyCBi

Question

A Delta Lake table was created with the below query: //IMG//

Realizing that the original query had a typographical error, the below code was executed: ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store Which result will occur after running the second command?

Choices

A: The table reference in the metastore is updated and no data is changed.
B: The table name change is recorded in the Delta transaction log.
C: All related files and metadata are dropped and recreated in a single ACID transaction.
D: The table reference in the metastore is updated and all data files are moved.
E: A new Delta transaction log Is created for the renamed table.

answer?

Answer: A Answer_ET: A Community answer A (100%) Discussion

Comment 1167199 by hal2401me

Upvotes: 8

Selected Answer: A did a test. No data is changed. dir & filename not changed. the rename is not recorded in transition log neither.

Comment 1166820 by Tamele001

Upvotes: 2

B is the correct answer. When you alter a table name in Delta Lake, the change is logged in the transaction log that Delta Lake uses to maintain a versioned history of all changes to the table. This is how Delta Lake maintains ACID properties and ensures a consistent view of the data. The transaction log is key to supporting features like time travel, auditing, and rollbacks in Delta Lake. The metadata and the actual data remain intact, and the reference to the table in the metastore is updated to reflect the new name.

Comment 1135723 by adenis

Upvotes: 4

Selected Answer: A A is Correct

Question AEGUCcKZpEIGyDwGTOGW

Question

The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database. After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active user. They then modify their code to the following (leaving all other variables unchanged). //IMG//

Which statement describes what will happen when the above code is executed?

Choices

A: The connection to the external table will fail; the string “REDACTED” will be printed.
B: An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the encoded password will be saved to DBFS.
C: An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the password will be printed in plain text.
D: The connection to the external table will succeed; the string value of password will be printed in plain text.
E: The connection to the external table will succeed; the string “REDACTED” will be printed.

answer?

Answer: E Answer_ET: E Community answer E (100%) Discussion

Comment 1294571 by benni_ale

Upvotes: 1

Selected Answer: E Shave like a bomber

Comment 1290646 by akashdesarda

Upvotes: 4

Selected Answer: E Whatever we read using dbutls.secret module is always printed as ‘[REDACTED]’, but when consumed in code, underlying vales are passed.

Comment 1224437 by imatheushenrique

Upvotes: 1

E. The connection to the external table will succeed; the string “REDACTED” will be printed.

Comment 1144395 by PrashantTiwari

Upvotes: 1

E is correct

Comment 1128062 by AziLa

Upvotes: 1

correct ans is E

Comment 1121593 by Jay_98_11

Upvotes: 2

Selected Answer: E E is correct

Comment 1114981 by ATLTennis

Upvotes: 2

Selected Answer: E E is correct

Comment 1102676 by kz_data

Upvotes: 2

Selected Answer: E Correct answer E

Comment 1040236 by sturcu

Upvotes: 4

Selected Answer: E Correct: https://docs.databricks.com/en/external-data/jdbc.html

Comment 980601 by Brian9

Upvotes: 3

https://learn.microsoft.com/en-us/azure/databricks/security/secrets/redaction

Option E - which is selected answer seems correct.

Comment 973574 by 8605246

Upvotes: 1

This option is correct, although the password won’t be printed out, the connection will still succeed.

Question UCAiGnwsC7FtrRe2PPSq

Question

The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is named store_saies_summary and the schema is as follows: //IMG//

The table daily_store_sales contains all the information needed to update store_sales_summary. The schema for this table is: store_id INT, sales_date DATE, total_sales FLOAT If daily_store_sales is implemented as a Type 1 table and the total_sales column might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in the store_sales_summary table?

Choices

A: Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.
B: Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.
C: Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.
D: Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.
E: Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.

answer?

Answer: C Answer_ET: C Community answer C (49%) A (43%) 9% Discussion

Comment 1004270 by hammer_1234_h

Upvotes: 12

The answer should be A. it is the safest to generate accurate report

Comment 1410848 by ultimomassimo

Upvotes: 1

people that claim type 1 scd is not maintained by upsert need to make some more reading before posting anything here…

Comment 1366256 by Jamuro

Upvotes: 1

Selected Answer: C It makes no sense to recalculate all aggregations for all historical data and then overwrite. It would lead to a non scalable solution because all data should be recalculated to not “delete” rows in target table. Option C lets us filter by date ranges or even partitions to merge only target periods and get the same results in a much more performant, safer and cheaper way.

Comment 1328241 by AlejandroU

Upvotes: 1

Selected Answer: C The answer is C. Option A is correct in ensuring accuracy, as it recalculates the entire store_sales_summary table based on the full historical data in daily_store_sales. However, it is computationally expensive and may not scale well. Option C (upsert logic) could be a better choice in most real-world scenarios, as it focuses only on the records that have changed, reducing computational costs and minimizing disruption for downstream systems.

Comment 1326022 by Sriramiyer92

Upvotes: 3

Selected Answer: A A it is SCD Type 1, so clearly Append and Upsert logic should not be used.

Comment 1322419 by benni_ale

Upvotes: 1

Selected Answer: A A as the table does not require history

Comment 1296208 by shaojunni

Upvotes: 1

Selected Answer: A daily_store_sales is type1 table, no history is maintained. You have to treat every record as new record and do aggregation for every store. Overwrite is much efficient than upsert.

Comment 1236963 by Ati1362

Upvotes: 3

Selected Answer: C I will go with c. upsert

Comment 1220888 by MDWPartners

Upvotes: 4

Selected Answer: C A is not correct because the table is daily. If you overwrite you delete all history. You need to insert/update to keep history.

Comment 1195899 by ThoBustos

Upvotes: 4

Selected Answer: A Not sure if that’s right but I would go for A. What do you think?

Type1: Data is overwritten Type 2: History is maintained, new data is inserted as new rows Type 3: Stores two versions per record: a previous and a current value

A. batch + overwrite → Match Type 1 requirements. YES B: batch + append new rows → Would be for type 2. NO C. Batch + Upsert → Data is not being overwritten (which is required for Type 1). NO D. ReadStream + Upsert → Data is not being overwritten (which is required for Type 1). NO E. Change Data Feed to update → Problem is manual edits + not overwriting (required for type 1). No

I have doubts around “which approach is the safest”. Maybe because due to some manual changes it is hard to track changes or do upsert, so to make sure that the stats are right overwriting is safer.

Comment 1169411 by vikram12apr

Upvotes: 2

Selected Answer: C Not A because overwriting will only provide a daily based data not the history of it. Not B because it will not fix the issue of incorrect sales amount As these data are fit for natch processing so neither D or E. C will only upsert the changes while making sure we are updating the records based on sales_date & store_id

Comment 1136213 by Rinscy

Upvotes: 1

E definitely because it say that the total_sales column may be change by manual auditing so not via a job, so streaming with CDF is the only option here !

Comment 1135999 by Somesh512

Upvotes: 3

Selected Answer: A I would go with Option A. Because it has manual auditing hence values can change. Uses type 1 hence replace original data

Comment 1132830 by spaceexplorer

Upvotes: 1

Selected Answer: E It should be E, as structure streaming has built-in fault-tolerance feature.

Comment 1132061 by Rinscy

Upvotes: 2

It said type 1 so A is the correct answer !

Comment 1110895 by divingbell17

Upvotes: 2

The question is unclear whether the aggregated table needs to support a rolling history. Note the aggregated table does not have a date column to distinguish which date the summary is generated for so one could assume the table is maintained only for the current snapshot.

Assuming the above - A would be the safest option as all stores and aggregates would need to be refreshed nightly

Comment 1108032 by dmov

Upvotes: 3

Selected Answer: A A is correct because it’s a static table that is written nightly through a batch job. The summary table does not maintain history and so an upsert results in having extra, unecessary records. Overwrite it nightly with updated aggregates for the required time period.

Comment 1107219 by Luv4data

Upvotes: 3

The answer is A. Note that the target table has columns which stores quarter to date,previous day sates etc, which will result in daily updates, i.e. large volume of records will be updated, hence better to overwirte than to update large volume of records.

Comment 1099520 by alexvno

Upvotes: 3

Selected Answer: C Batch processing so you need to update and insert - C

Comment 1080817 by Enduresoul

Upvotes: 3

Selected Answer: C Answer C is correct. Answer E would do the job too, but the table schema and the question indicates, that there will be only one update daily needed. Therefore a structured streaming job is way too expensive to archive the outcome.

Comment 1062321 by Syd

Upvotes: 1

Correct answer A Type 1 data is overwritten https://streamsets.com/blog/slowly-changing-dimensions-vs-change-data-capture/#:~:text=In%20Type%201%2C%20any%20new,change%20to%20maintain%20a%20history.

Comment 1044902 by sturcu

Upvotes: 2

Selected Answer: E I would say that it is E. If daily_store_sales table is implemented as a Type 1 table, this means that values are overwritten, and we do not keep the history. So we would need to create a streaming from CDF and apply those changes into the aggregated table.

Question tmkarXIgodNeHCWFJwEv

Question

A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.

//IMG//

Which command should be removed from the notebook before scheduling it as a job?

Choices

A: Cmd 2
B: Cmd 3
C: Cmd 4
D: Cmd 5
E: Cmd 6

answer?

Answer: E Answer_ET: E Community answer E (89%) 11% Discussion

Comment 1084674 by petrv

Upvotes: 9

Selected Answer: E When scheduling a Databricks notebook as a job, it’s generally recommended to remove or modify commands that involve displaying output, such as using the display() function. Displaying data using display() is an interactive feature designed for exploration and visualization within the notebook interface and may not work well in a production job context.

The finalDF.explain() command, which provides the execution plan of the DataFrame transformations and actions, is often useful for debugging and optimizing queries. While it doesn’t display interactive visualizations like display(), it can still be informative for understanding how Spark is executing the operations on your DataFrame.

Comment 1303045 by Carkeys

Upvotes: 1

Selected Answer: D Cmd 5 (finalDF.explain()) is used for debugging and understanding the logical and physical plans of a DataFrame. It provides insights into how Spark plans to execute the query but does not produce output that is necessary for the scheduled job. Including this command in a scheduled job is unnecessary and could clutter the job logs without adding value to the final output.

Comment 1299625 by benni_ale

Upvotes: 1

Selected Answer: E if i was multiple solutions than i would have gone for .explain method and print schema as well as they do not contribute in any sort of ETL operation but as a rule of thumb display should always be omitted first so → E

Comment 1265803 by 71dfab9

Upvotes: 1

Selected Answer: E I agree with petrv and KhoaLe, but I will add that not displaying the finalDF would be wise as it could display and log PII data and that to me is why I choose E. Like hal2401 said, commands 2, 5 & 6 can be removed as they don’t manipulate the data.

Comment 1159782 by hal2401me

Upvotes: 1

Selected Answer: E perhaps it’s a multi-choice question in exam. I’ll select E and D. if single choice then E.

Comment 1143938 by KhoaLe

Upvotes: 2

Selected Answer: E Looking through at all steps, Cmd 2,5,6 can be eliminated without impacting to the whole process. However, in terms of duration cost, Cmd 2 and 5 does not impact much as they only show the current results of logical query plan. In contrast, display() in Cmd6 is actually a transformation, which will take much time to run.

Comment 1099521 by alexvno

Upvotes: 3

Selected Answer: E No display()

Comment 1070832 by 60ties

Upvotes: 1

Selected Answer: D No actions on production scripts. D is best

Comment 1061164 by Karen1232123

Upvotes: 2

Why not D?

Question HLH8a1LIrgeotGnTdQ8N

Question

The business reporting team requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms, and loads the data for their pipeline runs in 10 minutes.

Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

Choices

A: Manually trigger a job anytime the business reporting team refreshes their dashboards
B: Schedule a job to execute the pipeline once an hour on a new job cluster
C: Schedule a Structured Streaming job with a trigger interval of 60 minutes
D: Schedule a job to execute the pipeline once an hour on a dedicated interactive cluster
E: Configure a job that executes every time new data lands in a given directory

answer?

Answer: B Answer_ET: B Community answer B (87%) 13% Discussion

Comment 1110920 by divingbell17

Upvotes: 10

Selected Answer: B B is correct I think. With option C, the cluster remains on 24/7 with trigger = 60 mins which is more costly

If there is an option with structure streaming with trigger = availablenow, and job scheduled per hour, that would be even more efficient. https://www.databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html

Comment 1269666 by robodog

Upvotes: 1

Selected Answer: C C. The lowest cost is obtained by using job cluster

Comment 1160110 by Curious76

Upvotes: 2

Selected Answer: C Databricks recommends using Structured Streaming with trigger AvailableNow for incremental workloads that do not have low latency requirements.

Comment 1131149 by spaceexplorer

Upvotes: 4

Selected Answer: B B is correct

Comment 1099529 by alexvno

Upvotes: 4

Selected Answer: B B : Job cluster is cheap , hourly = 60 minutes

Comment 1076493 by aragorn_brego

Upvotes: 2

Selected Answer: B Scheduling a job to execute the pipeline on an hourly basis aligns with the requirement for data to be updated every hour. Using a job cluster (which is brought up for the job and torn down upon completion) rather than a dedicated interactive cluster will usually be more cost-effective. This is because you are only paying for the compute resources when the job is running, which is 10 minutes out of every hour, rather than paying for an interactive cluster that would be up and running (and incurring costs) continuously.

Comment 1072406 by ofed

Upvotes: 1

It’s either B or D. I think B, because we want the lowest cost.

vuthanhdatt's Second Brain

Explorer

32

Questions and Answers

Question S9crePGbIfGIhbaKyCBi

Question

Choices

Comment 1167199 by hal2401me

Comment 1166820 by Tamele001

Comment 1135723 by adenis

Question AEGUCcKZpEIGyDwGTOGW

Question

Choices

Comment 1294571 by benni_ale

Comment 1290646 by akashdesarda

Comment 1224437 by imatheushenrique

Comment 1144395 by PrashantTiwari

Comment 1128062 by AziLa

Comment 1121593 by Jay_98_11

Comment 1114981 by ATLTennis

Comment 1102676 by kz_data

Comment 1040236 by sturcu

Comment 980601 by Brian9

Comment 973574 by 8605246

Question UCAiGnwsC7FtrRe2PPSq

Question

Choices

Comment 1004270 by hammer_1234_h

Comment 1410848 by ultimomassimo

Comment 1366256 by Jamuro

Comment 1328241 by AlejandroU

Comment 1326022 by Sriramiyer92

Comment 1322419 by benni_ale

Comment 1296208 by shaojunni

Comment 1236963 by Ati1362

Comment 1220888 by MDWPartners

Comment 1195899 by ThoBustos

Comment 1169411 by vikram12apr

Comment 1136213 by Rinscy

Comment 1135999 by Somesh512

Comment 1132830 by spaceexplorer

Comment 1132061 by Rinscy

Comment 1110895 by divingbell17

Comment 1108032 by dmov

Comment 1107219 by Luv4data

Comment 1099520 by alexvno

Comment 1080817 by Enduresoul

Comment 1062321 by Syd

Comment 1044902 by sturcu

Question tmkarXIgodNeHCWFJwEv

Question

Choices

Comment 1084674 by petrv

Comment 1303045 by Carkeys

Comment 1299625 by benni_ale

Comment 1265803 by 71dfab9

Comment 1159782 by hal2401me

Comment 1143938 by KhoaLe

Comment 1099521 by alexvno

Comment 1070832 by 60ties

Comment 1061164 by Karen1232123

Question HLH8a1LIrgeotGnTdQ8N

Question

Choices

Comment 1110920 by divingbell17

Comment 1269666 by robodog

Comment 1160110 by Curious76

Comment 1131149 by spaceexplorer

Comment 1099529 by alexvno

Comment 1076493 by aragorn_brego

Comment 1072406 by ofed

Graph View

Table of Contents