Questions and Answers
Question OYJFSCV0O9Cc6OvRp1Ct
Question
Each configuration below is identical to the extent that each cluster has 400 GB total of RAM 160 total cores and only one Executor per VM.
Given an extremely long-running job for which completion must be guaranteed, which cluster configuration will be able to guarantee completion of the job in light of one or more VM failures?
Choices
- A: • Total VMs: 8 • 50 GB per Executor • 20 Cores / Executor
- B: • Total VMs: 16 • 25 GB per Executor • 10 Cores / Executor
- C: • Total VMs: 1 • 400 GB per Executor • 160 Cores/Executor
- D: • Total VMs: 4 • 100 GB per Executor • 40 Cores / Executor
- E: • Total VMs: 2 • 200 GB per Executor • 80 Cores / Executor
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1349180 by 91d511b
- Upvotes: 1
Selected Answer: B Total VMs = 16
Resources per VM: 25 GB RAM and 10 cores per executor. Impact of a VM Failure: Losing one VM means losing only 6.25% of the cluster’s resources. Fault Tolerance: Excellent fault tolerance. The cluster can handle multiple VM failures (up to ~3 VMs) and still function effectively. Best Balance: With smaller VMs, the job remains highly fault-tolerant while using resources efficiently.
Comment 1286992 by shaojunni
- Upvotes: 2
16 core provides more redundancy, fault tolerance and more parallelism. But if dataset is huge, 8VM maybe better. The question is missing some information.
Comment 1244363 by c00ccb7
- Upvotes: 2
Selected Answer: B This setup ensures that the job can continue running and complete even if some VMs fail, as there are more VMs available to handle the workload
Comment 1217499 by ChayV
- Upvotes: 3
Selected Answer: B If VM is down, performance is degraded, so opting for vm’s which has distributed memory per executor and optimal cores per executor.
Comment 1173156 by hal2401me
- Upvotes: 4
Selected Answer: B in my exam today, i chose B, 16VM, because the “extremely long-run”.
Question sW5DCS0Awk0HAQFbWz5s
Question
A Delta Lake table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.
Immediately after each update succeeds, the data engineering team would like to determine the difference between the new version and the previous version of the table.
Given the current implementation, which method can be used?
Choices
- A: Execute a query to calculate the difference between the new version and the previous version using Delta Lake’s built-in versioning and lime travel functionality.
- B: Parse the Delta Lake transaction log to identify all newly written data files.
- C: Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.
- D: Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.
- E: Use Delta Lake’s change data feed to identify those records that have been updated, inserted, or deleted.
answer?
Answer: A Answer_ET: A Community answer A (80%) E (20%) Discussion
Comment 1335492 by arekm
- Upvotes: 1
Selected Answer: A D - see the discussion under Jugiboss comment.
Comment 1326376 by Sriramiyer92
- Upvotes: 2
Selected Answer: A CDF is particularly useful in Incremental loads. In our case it is overwrite. Hence A.
Comment 1302361 by Jugiboss
- Upvotes: 1
Selected Answer: E he best method to determine the difference between the new version and the previous version of the customer_churn_params table in Delta Lake is:
E. Use Delta Lake’s change data feed to identify those records that have been updated, inserted, or deleted.
This approach leverages Delta Lake’s built-in functionality to track changes at the record level, providing a clear view of what has changed between versions.
Comment 1296880 by cales
- Upvotes: 1
Selected Answer: E Change data feed allows to check for changes between versions
Comment 1270611 by robodog
- Upvotes: 2
Selected Answer: A Answer is A. The easy way to get the difference between those tables is by travel time by version
Comment 1256245 by HelixAbdu
- Upvotes: 2
Answer is A. There is no clue that CDF is enabled for the table
Comment 1244424 by c00ccb7
- Upvotes: 3
Selected Answer: A Answer A
Comment 1224722 by Deb9753
- Upvotes: 2
Answer : E
Question ld7n3IalZrV7ZqbOYmBm
Question
A data team’s Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new promotion, and they would like to add a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows. Note that proposed changes are in bold.
Original query:
//IMG//
Proposed query:
//IMG//
Which step must also be completed to put the proposed query into production?
Choices
- A: Specify a new checkpointLocation
- B: Remove .option(‘mergeSchema’, ‘true’) from the streaming write
- C: Increase the shuffle partitions to account for additional aggregates
- D: Run REFRESH TABLE delta.‛/item_agg‛
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1224724 by Deb9753
- Upvotes: 5
Answer: A When updating the schema of a streaming job, specifying a new checkpoint location ensures that the streaming query starts fresh with the new schema. This avoids issues that might arise from schema mismatches between the previous state and the new schema. This is especially relevant when adding new fields because the existing state might not be compatible with the new schema.
Comment 1342070 by SeRo42
- Upvotes: 1
Selected Answer: A Are filter conditions in count() allowed?
Comment 1270614 by robodog
- Upvotes: 1
Selected Answer: A A answer
Comment 1221093 by MDWPartners
- Upvotes: 2
Selected Answer: A This checkpoint location preserves all of the essential information that identifies a query. Each query must have a different checkpoint location. Multiple queries should never have the same location. For more information, see the Structured Streaming Programming Guide. https://docs.databricks.com/en/structured-streaming/query-recovery.html
Question fZMUiuaH8CJZW9rLweqT
Question
When using CLI or REST API to get results from jobs with multiple tasks, which statement correctly describes the response structure?
Choices
- A: Each run of a job will have a unique job_id; all tasks within this job will have a unique job_id
- B: Each run of a job will have a unique job_id; all tasks within this job will have a unique task_id
- C: Each run of a job will have a unique orchestration_id; all tasks within this job will have a unique run_id
- D: Each run of a job will have a unique run_id; all tasks within this job will have a unique task_id
- E: Each run of a job will have a unique run_id; all tasks within this job will also have a unique run_id
answer?
Answer: E Answer_ET: E Community answer E (65%) D (25%) 10% Discussion
Comment 1296897 by cales
- Upvotes: 6
Selected Answer: E The correct answer is E. https://docs.databricks.com/api/workspace/jobs/getrun You can visit the link and observe the response sample of the API { “job_id”: 11223344, “run_id”: 455644833,
job_id is the unique id of the job. run_id is the unique id of the run
Then each task will have its unique run id:
“tasks”: [ { … “run_id”: 2112892, …
Comment 1373422 by mohadjhamad
- Upvotes: 1
Selected Answer: D Each job execution (run) gets a unique run_id
This run_id identifies a specific instance of a job run. It allows tracking of job execution details, logs, and results. Each task within that job run has a unique task_id
Multi-task jobs have multiple tasks, each assigned a distinct task_id. The task_id helps in monitoring and retrieving individual task details.
Comment 1335497 by arekm
- Upvotes: 1
Selected Answer: E E - but between D & E is a play on the words. Still E seems a tiny bit more explicit. The bottom line the run_id attribute of each task within the job will be different.
Comment 1332157 by AlejandroU
- Upvotes: 2
Selected Answer: E Answer E. Job Run (run_id): Each execution of a job has a unique run_id for the entire job. Task Run (run_id): For jobs with multiple tasks, each task also gets its own run_id, which is distinct from the job’s run_id. This run_id for tasks can be used to retrieve individual task outputs.
https://docs.databricks.com/api/workspace/jobs/getrun
“tasks: The list of tasks performed by the run. Each task has its own run_id which you can use to call JobsGetOutput to retrieve the run results. “
Comment 1320180 by Thameur01
- Upvotes: 2
Selected Answer: B B if the question is about /api/2.1/jobs/get D if the question /api/2.1/jobs/runs/get
Comment 1320178 by Thameur01
- Upvotes: 1
Selected Answer: D The question should clearly specify, get jobs or get job run, there differences in response
Comment 1310020 by cf56faf
- Upvotes: 1
Selected Answer: D Must be D.
Comment 1304496 by benni_ale
- Upvotes: 2
Selected Answer: E E
Comment 1298591 by Kreshu
- Upvotes: 2
Selected Answer: E Correct answer is E
Comment 1290757 by thelio_team
- Upvotes: 3
tested answer is E : ‘tasks’: [{‘run_id’: *****, ‘task_key’: ’######’, …
Comment 1221096 by MDWPartners
- Upvotes: 2
Selected Answer: D Seems right
Question XwirdpOBHjcFO9IfAUM2
Question
The data engineering team is configuring environments for development, testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team wants to develop and test against data as similar to production data as possible.
A junior data engineer suggests that production data can be mounted to the development and testing environments, allowing pre-production code to execute against production data. Because all users have admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team.
Which statement captures best practices for this situation?
Choices
- A: All development, testing, and production code and data should exist in a single, unified workspace; creating separate environments for testing and development complicates administrative overhead.
- B: In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.
- C: As long as code in the development environment declares USE dev_db at the top of each notebook, there is no possibility of inadvertently committing changes back to production data sources.
- D: Because Delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data; as such, it is generally safe to mount production data anywhere.
- E: Because access to production data will always be verified using passthrough credentials, it is safe to mount data to any Databricks development environment.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1221097 by MDWPartners
- Upvotes: 5
Selected Answer: B Seems right
Comment 1339888 by lene
- Upvotes: 1
Selected Answer: B Shallow clones of prod data can be created in dev for testing purposes
Comment 1297915 by Colje
- Upvotes: 1
Selected Answer: B The correct answer is B. In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.
Explanation: Best practices for managing production, development, and testing environments involve minimizing the risk of unintended data modifications or deletions, especially when dealing with production data. The ideal setup includes:
Limiting permissions: Production data should only be accessible with read permissions in development or testing environments to prevent accidental changes. Isolating environments: Creating separate databases for development, testing, and production environments ensures that there are clear boundaries and that development code cannot unintentionally affect production data.