Questions and Answers

Question OYJFSCV0O9Cc6OvRp1Ct

Question

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM 160 total cores and only one Executor per VM.

Given an extremely long-running job for which completion must be guaranteed, which cluster configuration will be able to guarantee completion of the job in light of one or more VM failures?

Choices

  • A: • Total VMs: 8 • 50 GB per Executor • 20 Cores / Executor
  • B: • Total VMs: 16 • 25 GB per Executor • 10 Cores / Executor
  • C: • Total VMs: 1 • 400 GB per Executor • 160 Cores/Executor
  • D: • Total VMs: 4 • 100 GB per Executor • 40 Cores / Executor
  • E: • Total VMs: 2 • 200 GB per Executor • 80 Cores / Executor

Question sW5DCS0Awk0HAQFbWz5s

Question

A Delta Lake table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

Immediately after each update succeeds, the data engineering team would like to determine the difference between the new version and the previous version of the table.

Given the current implementation, which method can be used?

Choices

  • A: Execute a query to calculate the difference between the new version and the previous version using Delta Lake’s built-in versioning and lime travel functionality.
  • B: Parse the Delta Lake transaction log to identify all newly written data files.
  • C: Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.
  • D: Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.
  • E: Use Delta Lake’s change data feed to identify those records that have been updated, inserted, or deleted.

Question ld7n3IalZrV7ZqbOYmBm

Question

A data team’s Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new promotion, and they would like to add a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows. Note that proposed changes are in bold.

Original query:

//IMG//

Proposed query:

//IMG//

Which step must also be completed to put the proposed query into production?

Choices

  • A: Specify a new checkpointLocation
  • B: Remove .option(‘mergeSchema’, ‘true’) from the streaming write
  • C: Increase the shuffle partitions to account for additional aggregates
  • D: Run REFRESH TABLE delta.‛/item_agg‛

Question fZMUiuaH8CJZW9rLweqT

Question

When using CLI or REST API to get results from jobs with multiple tasks, which statement correctly describes the response structure?

Choices

  • A: Each run of a job will have a unique job_id; all tasks within this job will have a unique job_id
  • B: Each run of a job will have a unique job_id; all tasks within this job will have a unique task_id
  • C: Each run of a job will have a unique orchestration_id; all tasks within this job will have a unique run_id
  • D: Each run of a job will have a unique run_id; all tasks within this job will have a unique task_id
  • E: Each run of a job will have a unique run_id; all tasks within this job will also have a unique run_id

Question XwirdpOBHjcFO9IfAUM2

Question

The data engineering team is configuring environments for development, testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team wants to develop and test against data as similar to production data as possible.

A junior data engineer suggests that production data can be mounted to the development and testing environments, allowing pre-production code to execute against production data. Because all users have admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team.

Which statement captures best practices for this situation?

Choices

  • A: All development, testing, and production code and data should exist in a single, unified workspace; creating separate environments for testing and development complicates administrative overhead.
  • B: In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.
  • C: As long as code in the development environment declares USE dev_db at the top of each notebook, there is no possibility of inadvertently committing changes back to production data sources.
  • D: Because Delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data; as such, it is generally safe to mount production data anywhere.
  • E: Because access to production data will always be verified using passthrough credentials, it is safe to mount data to any Databricks development environment.