Questions and Answers

Question a4QZRzjWXsISJCJ1pN5z

Question

Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?

Choices

  • A: configure
  • B: fs
  • C: jobs
  • D: libraries
  • E: workspace

Question BiVI8jUjHORB8Pz7IHFU

Question

The business intelligence team has a dashboard configured to track various summary metrics for retail stores. This includes total sales for the previous day alongside totals and averages for a variety of time periods. The fields required to populate this dashboard have the following schema:

//IMG//

For demand forecasting, the Lakehouse contains a validated table of all itemized sales updated incrementally in near real-time. This table, named products_per_order, includes the following fields:

//IMG//

Because reporting on long-term sales trends is less volatile, analysts using the new dashboard only require data to be refreshed once daily. Because the dashboard will be queried interactively by many users throughout a normal business day, it should return results quickly and reduce total compute associated with each materialization.

Which solution meets the expectations of the end users while controlling and limiting possible costs?

Choices

  • A: Populate the dashboard by configuring a nightly batch job to save the required values as a table overwritten with each update.
  • B: Use Structured Streaming to configure a live dashboard against the products_per_order table within a Databricks notebook.
  • C: Configure a webhook to execute an incremental read against products_per_order each time the dashboard is refreshed.
  • D: Use the Delta Cache to persist the products_per_order table in memory to quickly update the dashboard with each query.
  • E: Define a view against the products_per_order table and define the dashboard against this view.

Question 418bK9t6VeYfcKLEbtG4

Question

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema “customer_id LONG, predictions DOUBLE, date DATE”. //IMG//

The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day. Which code block accomplishes this task while minimizing potential compute costs?

Choices

  • A: preds.write.mode(“append”).saveAsTable(“churn_preds”)
  • B: preds.write.format(“delta”).save(“/preds/churn_preds”)
  • C:
  • D:
  • E:

Question X61fAqNn7kVxeq2PS6Lf

Question

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

Choices

  • A: Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
  • B: Set spark.sql.shuffle.partitions to 2,048 partitions (1TB10241024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
  • C: Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB10241024/512), and then write to parquet.
  • D: Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.
  • E: Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

Question NlfQlHKNB3JTkMFHr8hQ

Question

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.

Streaming DataFrame df has the following schema:

“device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT”

Code block:

//IMG//

Choose the response that correctly fills in the blank within the code block to complete this task.

Choices

  • A: withWatermark(“event_time”, “10 minutes”)
  • B: awaitArrival(“event_time”, “10 minutes”)
  • C: await(“event_time + ‘10 minutes’”)
  • D: slidingWindow(“event_time”, “10 minutes”)
  • E: delayWrite(“event_time”, “10 minutes”)