Questions and Answers
Question ntXzzV0x7LobKSuTWfDf
Question
A view is registered with the following code:
//IMG//
Both users and orders are Delta Lake tables.
Which statement describes the results of querying recent_orders?
Choices
- A: All logic will execute when the view is defined and store the result of joining tables to the DBFS; this stored data will be returned when the view is queried.
- B: Results will be computed and cached when the view is defined; these cached results will incrementally update as new records are inserted into source tables.
- C: All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.
- D: All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.
answer?
Answer: D Answer_ET: D Community answer D (89%) 11% Discussion
Comment 1222889 by Freyr
- Upvotes: 5
Selected Answer: D Correct Answer: D Correct because this option correctly describes the behavior of SQL views in Databricks. The view’s query is executed against the current state of the data in the source tables at the moment the query begins. This means that any changes to the data that are committed while the query is running will not be reflected in the results of the query currently executing.
Comment 1307257 by benni_ale
- Upvotes: 2
Selected Answer: D its D u scums
Comment 1226817 by Isio05
- Upvotes: 1
Selected Answer: D Also voting for D, such view results are recalculated each time when called
Comment 1221438 by MDWPartners
- Upvotes: 1
Selected Answer: B It should be B
Question sdTAJVROJYfRIR3gv1nL
Question
A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.
Which strategy will yield the best performance without shuffling data?
Choices
- A: Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
- B: Set spark.sql.shuffle.partitions to 2,048 partitions (1TB10241024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
- C: Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB10241024/512), and then write to parquet.
- D: Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.
answer?
Answer: A Answer_ET: A Community answer A (52%) D (38%) 10% Discussion
Comment 1341591 by RandomForest
- Upvotes: 1
Selected Answer: D Correct answer is D: Why Not Other Options?: A. Set spark.sql.files.maxPartitionBytes: This configuration controls how many bytes Spark reads per input partition during a file scan, not the output file size. It does not help in controlling Parquet file sizes during writing. B. Set spark.sql.shuffle.partitions and sort data: While sorting data can optimize performance in some cases, it introduces unnecessary overhead for this scenario. Additionally, spark.sql.shuffle.partitions controls the number of shuffle partitions, not directly the output partitioning of the data. C. Use spark.sql.adaptive.advisoryPartitionSizeInBytes: Adaptive Query Execution (AQE) optimizes queries at runtime, but this configuration does not directly control Parquet file sizes. It dynamically adjusts partition sizes for shuffle stages, not for the write output.
Comment 1339918 by lene
- Upvotes: 1
Selected Answer: A arekm explanation
Comment 1335531 by arekm
- Upvotes: 3
Selected Answer: A Definitely A - no repartitioning and subsequent shuffle (which the question is asking about). The parameter defines how many bytes per partition to read, tasks will read in those chunks, since only narrow operations performed (per definition - no shuffle), we just write what we read. The target files size is 512MBs and we did not shuffle.
Comment 1320978 by temple1305
- Upvotes: 1
Selected Answer: C I think, “execute the narrow transformations, coalesce to” is key words here - because coalesce is not cause shuffling.
Comment 1310073 by cf56faf
- Upvotes: 2
Selected Answer: D It’s D, because A primarily affects the reading of the data
Comment 1302374 by Jugiboss
- Upvotes: 2
Selected Answer: A A does not shuffle while D shuffles
Comment 1300477 by m79590530
- Upvotes: 2
Selected Answer: A Answer A as narrow transformations like union, filter and map do not cause shuffle across partitions.
Comment 1297926 by Colje
- Upvotes: 1
Selected Answer: D The correct answer is D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB * 1024 * 1024 / 512), and then write to Parquet.
Explanation: In this case, the goal is to write a 1 TB dataset to Parquet with a target file size of 512 MB without incurring the overhead of data shuffling. To achieve optimal performance, we must balance the number of partitions to match the file size requirements while avoiding expensive shuffle operations.
Narrow transformations: These transformations (such as map, filter) don’t require shuffling the data, which keeps the operation efficient. Repartition to 2,048 partitions: Given that the desired part-file size is 512 MB and the total dataset size is 1 TB, repartitioning the dataset into 2,048 partitions ensures that each partition will be approximately 512 MB in size, which matches the target file size. This avoids shuffle operations and allows for an efficient write.
Comment 1288996 by pk07
- Upvotes: 2
Selected Answer: D Not A because spark.sql.files.maxPartitionBytes primarily affects the reading of data, not the writing. It determines the maximum size of a partition when reading files, not when writing them.
Comment 1288267 by shaojunni
- Upvotes: 1
Selected Answer: C A, D will not prevent shuffling data. C using coalesce to reduce shuffling data.
Comment 1237812 by 03355a2
- Upvotes: 3
Selected Answer: A best performance without shuffling data
Comment 1229282 by hpkr
- Upvotes: 1
Selected Answer: D option D
Comment 1222892 by Freyr
- Upvotes: 1
Selected Answer: D Correct Answer D: Repartition to 2,048 partitions and write to Parquet
This option directly controls the number of output files by repartitioning the data into 2,048 partitions, assuming that 1TB/512MB per file roughly translates to 2,048 files. Repartitioning the data involves shuffling, but it’s a deliberate shuffle designed to achieve a specific partitioning beneficial for writing. After repartitioning, the data is written to Parquet files, each expected to be approximately 512 MB if the data is uniformly distributed across partitions.
Question Wrj7BXqh47XQHk2pI9Lh
Question
Which statement regarding stream-static joins and static Delta tables is correct?
Choices
- A: The checkpoint directory will be used to track updates to the static Delta table.
- B: Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job’s initialization.
- C: The checkpoint directory will be used to track state information for the unique keys present in the join.
- D: Stream-static joins cannot use static Delta tables because of consistency issues.
answer?
Answer: B Answer_ET: B Community answer B (83%) A (17%) Discussion
Comment 1335534 by arekm
- Upvotes: 2
Selected Answer: B All answers are wrong: A - checkpoint directory to track changes to Delta table? B - microbatch uses the state of the table at the time the query is executed, not at initialization C - unique keys? - stream-static joins are not stateful, so we are only looking at the current batch of records D - you can totally have stream-static joins, see: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#support-matrix-for-joins-in-streaming-queries
I believe they made a typo in the B, that seems to be the only logical explanation.
Comment 1307707 by benni_ale
- Upvotes: 1
Selected Answer: A If you look at question 18 you find that the correct solution should be Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch. This is not listed here meaning that B could not be correct leading to A being the only possible solution… The wrong part about B is that the latest version of the static delta table is returned at each micro-batch rather than as of job initialisation…
Comment 1221151 by MDWPartners
- Upvotes: 3
Selected Answer: B When Databricks processes a micro-batch of data in a stream-static join, the latest valid version of data from the static Delta table joins with the records present in the current micro-batch. Because the join is stateless, you do not need to configure watermarking and can process results with low latency. The data in the static Delta table used in the join should be slowly-changing. https://docs.databricks.com/en/transform/join.html#stream-static
Question a6wMp9vjIQJJsNXzy9WN
Question
A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.
Streaming DataFrame df has the following schema:
“device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT”
Code block:
//IMG//
Which line of code correctly fills in the blank within the code block to complete this task?
Choices
- A: to_interval(“event_time”, “5 minutes”).alias(“time”)
- B: window(“event_time”, “5 minutes”).alias(“time”)
- C: “event_time”
- D: lag(“event_time”, “10 minutes”).alias(“time”)
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1341594 by RandomForest
- Upvotes: 1
Selected Answer: B Correct answer is B: use the window-function
Comment 1300497 by m79590530
- Upvotes: 2
Selected Answer: B This is the standard syntax to do non-overlapping time interval Window-ed grouping by the time field in a dataset in Structured Streaming. .withWatermatk() function defines the staging buffers after which delayed records will be dropped/ignored.
Question l7nbfkY1EoyT51MvKbJj
Question
A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each microbatch of data is processed in less than 3s; at least 12 times per minute, a microbatch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution.
Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?
Choices
- A: Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs.
- B: Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.
- C: Set the trigger interval to 10 minutes; each batch calls APIs in the source storage account, so decreasing trigger frequency to maximum allowable threshold should minimize this cost.
- D: Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger interval ensures that the source is not queried too frequently.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1230601 by Isio05
- Upvotes: 4
Selected Answer: C C, A - incorrect explanation B - trigger once is not correct option here D - 500 miliseconds is already used, it’s default trigger interval
Comment 1229289 by hpkr
- Upvotes: 1
Selected Answer: C Option C