Questions and Answers
Question a4QZRzjWXsISJCJ1pN5z
Question
Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?
Choices
- A: configure
- B: fs
- C: jobs
- D: libraries
- E: workspace
answer?
Answer: B Answer_ET: B Community answer B (77%) D (18%) 5% Discussion
Comment 1182254 by arik90
- Upvotes: 6
Selected Answer: B databricks fs cp dist/<…>.whl dbfs:/some/place/appropriate
Comment 1328294 by AlejandroU
- Upvotes: 1
Selected Answer: D Answer D. The Databricks CLI libraries command is used to manage libraries, including installing custom Python wheels. Specifically, the install subcommand can be used to install a wheel. In contrast, Option B. fs: This command interacts with the Databricks File System (DBFS) to manage files, but it is primarily used for basic file operations (like cp, ls, rm), not specifically for uploading libraries.
Comment 1160168 by Curious76
- Upvotes: 1
Selected Answer: D Here’s how you can use the libraries command to upload your wheel: Bash
databricks libraries upload —file <path_to_wheel_file> —name <library_name>
Comment 1150831 by ojudz08
- Upvotes: 1
Selected Answer: C this is a bit tricky, question is asked to upload custom Python Wheel, you can use fs command, but since it’ll be used in production job, job command might be needed to perform databricks jobs operations? https://docs.databricks.com/en/dev-tools/cli/commands.html
Comment 1136011 by Somesh512
- Upvotes: 2
Selected Answer: B Its asking to upload to DBFS and not install on cluster
Comment 1084732 by petrv
- Upvotes: 3
Selected Answer: B the question is about copying the file not about installing.
Comment 1080851 by Enduresoul
- Upvotes: 4
Selected Answer: B Answer B is corrent: ”… which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS …” The question asks, how to upload the wheel. Not install it or configure it in a job. https://docs.databricks.com/en/archive/dev-tools/cli/dbfs-cli.html
Comment 1076509 by aragorn_brego
- Upvotes: 2
Selected Answer: B The Databricks CLI fs command is used for interacting with the Databricks File System (DBFS). You can use it to put files into DBFS, which includes uploading custom Python Wheels to a directory in DBFS. The fs command has subcommands like cp that can be used to copy files from your local file system to DBFS, which is backed by an object storage mounted with dbutils.fs.mount().
databricks fs cp my_package.whl dbfs:/mnt/my-mount-point/my_package.whl
Comment 1054639 by mouad_attaqi
- Upvotes: 1
Selected Answer: D It is done using the command: databricks libraries install
Comment 1053023 by sturcu
- Upvotes: 1
Selected Answer: D you can add a library section to the jobs command, but you can install a wheel with the library command
Question BiVI8jUjHORB8Pz7IHFU
Question
The business intelligence team has a dashboard configured to track various summary metrics for retail stores. This includes total sales for the previous day alongside totals and averages for a variety of time periods. The fields required to populate this dashboard have the following schema:
//IMG//
For demand forecasting, the Lakehouse contains a validated table of all itemized sales updated incrementally in near real-time. This table, named products_per_order, includes the following fields:
//IMG//
Because reporting on long-term sales trends is less volatile, analysts using the new dashboard only require data to be refreshed once daily. Because the dashboard will be queried interactively by many users throughout a normal business day, it should return results quickly and reduce total compute associated with each materialization.
Which solution meets the expectations of the end users while controlling and limiting possible costs?
Choices
- A: Populate the dashboard by configuring a nightly batch job to save the required values as a table overwritten with each update.
- B: Use Structured Streaming to configure a live dashboard against the products_per_order table within a Databricks notebook.
- C: Configure a webhook to execute an incremental read against products_per_order each time the dashboard is refreshed.
- D: Use the Delta Cache to persist the products_per_order table in memory to quickly update the dashboard with each query.
- E: Define a view against the products_per_order table and define the dashboard against this view.
answer?
Answer: A Answer_ET: A Community answer A (93%) 7% Discussion
Comment 1108784 by dmov
- Upvotes: 13
Selected Answer: A looks like A to me, as long as they only need the data for the aggregates based on the previous day only
Comment 1332579 by srinivasa
- Upvotes: 1
Selected Answer: D Delta cache avoids having to read data from the table every time it’s queried during the day.
Question 418bK9t6VeYfcKLEbtG4
Question
The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema “customer_id LONG, predictions DOUBLE, date DATE”. //IMG//
The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day. Which code block accomplishes this task while minimizing potential compute costs?
Choices
- A: preds.write.mode(“append”).saveAsTable(“churn_preds”)
- B: preds.write.format(“delta”).save(“/preds/churn_preds”)
- C:
- D:
- E:
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1001405 by thxsgod
- Upvotes: 13
Selected Answer: A You need:
- Batch operation since it is at most once a day
- Append, since you need to keep track of past predictions
A is the correct answer. You don’t need to specify “format” when you use saveAsTable.
Comment 1294572 by benni_ale
- Upvotes: 1
Selected Answer: A Batch, Append
Comment 1213850 by coercion
- Upvotes: 1
Selected Answer: A default table format is delta so no need to specify the format. As per the requirement, “append” mode is required to maintain the history. Default mode is “ErrorIfExists”
Comment 1121595 by Jay_98_11
- Upvotes: 1
Selected Answer: A A is correct
Comment 1102680 by kz_data
- Upvotes: 1
Selected Answer: A A is correct
Comment 1040240 by sturcu
- Upvotes: 1
Selected Answer: A Correct
Comment 1040239 by sturcu
- Upvotes: 1
Correct
Comment 1013178 by Eertyy
- Upvotes: 2
answer is B
Comment 992014 by buggumaster
- Upvotes: 1
Selected answer is wrong, not write Format is specified in A.
Comment 992013 by buggumaster
- Upvotes: 1
Selected answer is wrong, not writeMode is specified in A.
Question X61fAqNn7kVxeq2PS6Lf
Question
A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.
Which strategy will yield the best performance without shuffling data?
Choices
- A: Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
- B: Set spark.sql.shuffle.partitions to 2,048 partitions (1TB10241024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
- C: Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB10241024/512), and then write to parquet.
- D: Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.
- E: Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.
answer?
Answer: A Answer_ET: A Community answer A (57%) D (24%) Other Discussion
Comment 1076528 by aragorn_brego
- Upvotes: 10
Selected Answer: A This strategy aims to control the size of the output Parquet files without shuffling the data. The spark.sql.files.maxPartitionBytes parameter sets the maximum size of a partition that Spark will read. By setting it to 512 MB, you are aligning the read partition size with the desired output file size. Since the transformations are narrow (meaning they do not require shuffling), the number of partitions should roughly correspond to the number of output files when writing out to Parquet, assuming the data is evenly distributed and there is no data expansion during processing.
Comment 1130323 by Def21
- Upvotes: 6
Selected Answer: D D is the only one that does the trick.
Note, we can not do shuffling.
Wrong answers:
A: spark.sql.files.maxPartitionBytes is about reading, not writing.(The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC. )
B: spark.sql.adaptive.advisoryPartitionSizeInBytes takes effect while shuffling and sorting does not make sense (The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.)
C: Would work but spark.sql.adaptive.advisoryPartitionSizeInBytes would need shuffling.
E. spark.sql.shuffle.partitions (Configures the number of partitions to use when shuffling data for joins or aggregations.) is not about writing.
Comment 1330235 by AlejandroU
- Upvotes: 1
Selected Answer: D Answer D. Explicitly repartitioning to 2,048 partitions ensures that the output files are close to the desired size of 512 MB, provided the data distribution is relatively even. Repartitioning directly addresses the problem by controlling the number of partitions, which directly affects the output file size Why not option A ? Misinterpretation of spark.sql.files.maxPartitionBytes in Option A: The assessment incorrectly states that this configuration controls the maximum size of files when writing to Parquet. This setting controls the size of partitions when reading data, not during writing.
Comment 1323738 by temple1305
- Upvotes: 1
Selected Answer: C spark.sql.adaptive.advisoryPartitionSizeInBytes - The advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition. And then we do coalesce - without shuffle - so have to work!
Comment 1303293 by nedlo
- Upvotes: 2
Selected Answer: A I though D, but default num of partitions is 200, so you cant do coalesce (2048) (you cant increase numb of partitions through coalesce), so its not possible to do it without repartitioning and shuffle. Only A can be done without Shuffle
Comment 1284189 by sdas1
- Upvotes: 2
Option A spark.sql.files.maxPartitionBytes controls the maximum size of partitions during reading on the Spark cluster, and that reducing this value could lead to more partitions and thus potentially more output files. The key point is that it works best when no shuffles occur, which aligns with the scenario of having narrow transformations only.
Comment 1169434 by vikram12apr
- Upvotes: 1
Selected Answer: A D is not correct as it will create 2048 target files of 0.5 MB each Only A will do the job as it will read this file in 2 partition ( 1 TB = 512*2 MB) and as we are not doing any shuffling(not mentioned in option) it will create those many partition file i.e 2 part files
Comment 1167675 by hal2401me
- Upvotes: 1
Selected Answer: D ChatGPT says D: This strategy directly addresses the desired part-file size by repartitioning the data. It avoids shuffling during narrow transformations. Recommended for achieving the desired part-file size without unnecessary shuffling.
Comment 1162035 by Curious76
- Upvotes: 1
Selected Answer: D D is mot suitable.
Comment 1143743 by vctrhugo
- Upvotes: 3
Selected Answer: A This approach ensures that each partition will be approximately the target part-file size, which can improve the efficiency of the data write. It also avoids the need for a shuffle operation, which can be expensive in terms of performance.
Comment 1136782 by adenis
- Upvotes: 1
Selected Answer: C С is correct
Comment 1131156 by spaceexplorer
- Upvotes: 2
Selected Answer: A Rest of the answers trigger shuffles
Comment 1111424 by divingbell17
- Upvotes: 2
Selected Answer: A A is correct. The question states Which strategy will yield the best performance without shuffling data. The other options involve shuffling either manually or through AQE
Comment 1101365 by 911land
- Upvotes: 1
C is correct answer
Comment 1099580 by alexvno
- Upvotes: 1
Selected Answer: A
- spark.sql.files.maxPartitionBytes: 128MB (The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.)
Comment 1084755 by petrv
- Upvotes: 3
Selected Answer: C Here’s a breakdown of the reasons:
spark.sql.adaptive.advisoryPartitionSizeInBytes: This configuration parameter is designed to provide advisory partition sizes for the adaptive query execution framework. It can help in controlling the partition sizes without triggering unnecessary shuffling.
coalesce(2048): Coalescing to a specific number of partitions after the narrow transformations allows you to control the number of output files without triggering a shuffle. This helps achieve the target part-file size without incurring the overhead of a full shuffle.
Setting a specific target: The strategy outlines the goal of achieving a target part-file size of 512 MB, which aligns with the requirement.
Comment 1070716 by ocaj90
- Upvotes: 1
obviously D. It allows you to control both the number of partitions and the final part-file size, which aligns with the requirements. Option B shuffles partitions, which is not allowed.
Comment 1053027 by sturcu
- Upvotes: 2
Selected Answer: B The number of output files saved to the disk is equal to the number of partitions in the Spark executors when the write operation is performed.
Question NlfQlHKNB3JTkMFHr8hQ
Question
A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.
Streaming DataFrame df has the following schema:
“device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT”
Code block:
//IMG//
Choose the response that correctly fills in the blank within the code block to complete this task.
Choices
- A: withWatermark(“event_time”, “10 minutes”)
- B: awaitArrival(“event_time”, “10 minutes”)
- C: await(“event_time + ‘10 minutes’”)
- D: slidingWindow(“event_time”, “10 minutes”)
- E: delayWrite(“event_time”, “10 minutes”)
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1076530 by aragorn_brego
- Upvotes: 9
Selected Answer: A To handle late-arriving data in a streaming aggregation, you need to specify a watermark, which tells the streaming query how long to wait for late data. The withWatermark method is used for this purpose in Spark Structured Streaming. It defines the threshold for how late the data can be relative to the latest data that has been seen in the same window.
Comment 1053462 by sturcu
- Upvotes: 9
Selected Answer: A withWatermark. There sliding window is doe through the window function
Comment 1265967 by 71dfab9
- Upvotes: 1
Selected Answer: A The withWatermark method is used in streaming DataFrames when processing real-time data streams. This method helps in managing stateful operations, such as aggregations, by specifying a time column to use for watermarking. Watermarking is a mechanism to handle late data (data that arrives later than expected) by defining a threshold time window beyond which late data is considered too late to be included in aggregations.
The slidingWindow function mentioned in D is not a standard function in Databricks or Apache Spark.
Comment 1066244 by Dileepvikram
- Upvotes: 3
Answer is A