Questions and Answers

Question OYJFSCV0O9Cc6OvRp1Ct

Question

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM 160 total cores and only one Executor per VM.

Given an extremely long-running job for which completion must be guaranteed, which cluster configuration will be able to guarantee completion of the job in light of one or more VM failures?

Choices

A: • Total VMs: 8 • 50 GB per Executor • 20 Cores / Executor
B: • Total VMs: 16 • 25 GB per Executor • 10 Cores / Executor
C: • Total VMs: 1 • 400 GB per Executor • 160 Cores/Executor
D: • Total VMs: 4 • 100 GB per Executor • 40 Cores / Executor
E: • Total VMs: 2 • 200 GB per Executor • 80 Cores / Executor

answer?

Answer: B Answer_ET: B Community answer B (100%) Discussion

Comment 1349180 by 91d511b

Upvotes: 1

Selected Answer: B Total VMs = 16

Resources per VM: 25 GB RAM and 10 cores per executor. Impact of a VM Failure: Losing one VM means losing only 6.25% of the cluster’s resources. Fault Tolerance: Excellent fault tolerance. The cluster can handle multiple VM failures (up to ~3 VMs) and still function effectively. Best Balance: With smaller VMs, the job remains highly fault-tolerant while using resources efficiently.

Comment 1286992 by shaojunni

Upvotes: 2

16 core provides more redundancy, fault tolerance and more parallelism. But if dataset is huge, 8VM maybe better. The question is missing some information.

Comment 1244363 by c00ccb7

Upvotes: 2

Selected Answer: B This setup ensures that the job can continue running and complete even if some VMs fail, as there are more VMs available to handle the workload

Comment 1217499 by ChayV

Upvotes: 3

Selected Answer: B If VM is down, performance is degraded, so opting for vm’s which has distributed memory per executor and optimal cores per executor.

Comment 1173156 by hal2401me

Upvotes: 4

Selected Answer: B in my exam today, i chose B, 16VM, because the “extremely long-run”.

Question sW5DCS0Awk0HAQFbWz5s

Question

A Delta Lake table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

Immediately after each update succeeds, the data engineering team would like to determine the difference between the new version and the previous version of the table.

Given the current implementation, which method can be used?

Choices

A: Execute a query to calculate the difference between the new version and the previous version using Delta Lake’s built-in versioning and lime travel functionality.
B: Parse the Delta Lake transaction log to identify all newly written data files.
C: Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.
D: Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.
E: Use Delta Lake’s change data feed to identify those records that have been updated, inserted, or deleted.

answer?

Answer: A Answer_ET: A Community answer A (80%) E (20%) Discussion

Comment 1335492 by arekm

Upvotes: 1

Selected Answer: A D - see the discussion under Jugiboss comment.

Comment 1326376 by Sriramiyer92

Upvotes: 2

Selected Answer: A CDF is particularly useful in Incremental loads. In our case it is overwrite. Hence A.

Comment 1302361 by Jugiboss

Upvotes: 1

Selected Answer: E he best method to determine the difference between the new version and the previous version of the customer_churn_params table in Delta Lake is:

E. Use Delta Lake’s change data feed to identify those records that have been updated, inserted, or deleted.

This approach leverages Delta Lake’s built-in functionality to track changes at the record level, providing a clear view of what has changed between versions.

Comment 1296880 by cales

Upvotes: 1

Selected Answer: E Change data feed allows to check for changes between versions

Comment 1270611 by robodog

Upvotes: 2

Selected Answer: A Answer is A. The easy way to get the difference between those tables is by travel time by version

Comment 1256245 by HelixAbdu

Upvotes: 2

Answer is A. There is no clue that CDF is enabled for the table

Comment 1244424 by c00ccb7

Upvotes: 3

Selected Answer: A Answer A

Comment 1224722 by Deb9753

Upvotes: 2

Answer : E

Question ld7n3IalZrV7ZqbOYmBm

Question

A data team’s Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new promotion, and they would like to add a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows. Note that proposed changes are in bold.

Original query:

//IMG//

Proposed query:

//IMG//

Which step must also be completed to put the proposed query into production?

Choices

A: Specify a new checkpointLocation
B: Remove .option(‘mergeSchema’, ‘true’) from the streaming write
C: Increase the shuffle partitions to account for additional aggregates
D: Run REFRESH TABLE delta.‛/item_agg‛

answer?

Answer: A Answer_ET: A Community answer A (100%) Discussion

Comment 1224724 by Deb9753

Upvotes: 5

Answer: A When updating the schema of a streaming job, specifying a new checkpoint location ensures that the streaming query starts fresh with the new schema. This avoids issues that might arise from schema mismatches between the previous state and the new schema. This is especially relevant when adding new fields because the existing state might not be compatible with the new schema.

Comment 1342070 by SeRo42

Upvotes: 1

Selected Answer: A Are filter conditions in count() allowed?

Comment 1270614 by robodog

Upvotes: 1

Selected Answer: A A answer

Comment 1221093 by MDWPartners

Upvotes: 2

Selected Answer: A This checkpoint location preserves all of the essential information that identifies a query. Each query must have a different checkpoint location. Multiple queries should never have the same location. For more information, see the Structured Streaming Programming Guide. https://docs.databricks.com/en/structured-streaming/query-recovery.html

Question fZMUiuaH8CJZW9rLweqT

Question

When using CLI or REST API to get results from jobs with multiple tasks, which statement correctly describes the response structure?

Choices

A: Each run of a job will have a unique job_id; all tasks within this job will have a unique job_id
B: Each run of a job will have a unique job_id; all tasks within this job will have a unique task_id
C: Each run of a job will have a unique orchestration_id; all tasks within this job will have a unique run_id
D: Each run of a job will have a unique run_id; all tasks within this job will have a unique task_id
E: Each run of a job will have a unique run_id; all tasks within this job will also have a unique run_id

answer?

Answer: E Answer_ET: E Community answer E (65%) D (25%) 10% Discussion

Comment 1296897 by cales

Upvotes: 6

Selected Answer: E The correct answer is E. https://docs.databricks.com/api/workspace/jobs/getrun You can visit the link and observe the response sample of the API { “job_id”: 11223344, “run_id”: 455644833,

job_id is the unique id of the job. run_id is the unique id of the run

Then each task will have its unique run id:

“tasks”: [ { … “run_id”: 2112892, …

Comment 1373422 by mohadjhamad

Upvotes: 1

Selected Answer: D Each job execution (run) gets a unique run_id

This run_id identifies a specific instance of a job run. It allows tracking of job execution details, logs, and results. Each task within that job run has a unique task_id

Multi-task jobs have multiple tasks, each assigned a distinct task_id. The task_id helps in monitoring and retrieving individual task details.

Comment 1335497 by arekm

Upvotes: 1

Selected Answer: E E - but between D & E is a play on the words. Still E seems a tiny bit more explicit. The bottom line the run_id attribute of each task within the job will be different.

Comment 1332157 by AlejandroU

Upvotes: 2

Selected Answer: E Answer E. Job Run (run_id): Each execution of a job has a unique run_id for the entire job. Task Run (run_id): For jobs with multiple tasks, each task also gets its own run_id, which is distinct from the job’s run_id. This run_id for tasks can be used to retrieve individual task outputs.

https://docs.databricks.com/api/workspace/jobs/getrun

“tasks: The list of tasks performed by the run. Each task has its own run_id which you can use to call JobsGetOutput to retrieve the run results. “

Comment 1320180 by Thameur01

Upvotes: 2

Selected Answer: B B if the question is about /api/2.1/jobs/get D if the question /api/2.1/jobs/runs/get

Comment 1320178 by Thameur01

Upvotes: 1

Selected Answer: D The question should clearly specify, get jobs or get job run, there differences in response

Comment 1310020 by cf56faf

Upvotes: 1

Selected Answer: D Must be D.

Comment 1304496 by benni_ale

Upvotes: 2

Selected Answer: E E

Comment 1298591 by Kreshu

Upvotes: 2

Selected Answer: E Correct answer is E

Comment 1290757 by thelio_team

Upvotes: 3

tested answer is E : ‘tasks’: [{‘run_id’: *****, ‘task_key’: ’######’, …

Comment 1221096 by MDWPartners

Upvotes: 2

Selected Answer: D Seems right

Question XwirdpOBHjcFO9IfAUM2

Question

The data engineering team is configuring environments for development, testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team wants to develop and test against data as similar to production data as possible.

A junior data engineer suggests that production data can be mounted to the development and testing environments, allowing pre-production code to execute against production data. Because all users have admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team.

Which statement captures best practices for this situation?

Choices

A: All development, testing, and production code and data should exist in a single, unified workspace; creating separate environments for testing and development complicates administrative overhead.
B: In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.
C: As long as code in the development environment declares USE dev_db at the top of each notebook, there is no possibility of inadvertently committing changes back to production data sources.
D: Because Delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data; as such, it is generally safe to mount production data anywhere.
E: Because access to production data will always be verified using passthrough credentials, it is safe to mount data to any Databricks development environment.

answer?

Answer: B Answer_ET: B Community answer B (100%) Discussion

Comment 1221097 by MDWPartners

Upvotes: 5

Selected Answer: B Seems right

Comment 1339888 by lene

Upvotes: 1

Selected Answer: B Shallow clones of prod data can be created in dev for testing purposes

Comment 1297915 by Colje

Upvotes: 1

Selected Answer: B The correct answer is B. In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.

Explanation: Best practices for managing production, development, and testing environments involve minimizing the risk of unintended data modifications or deletions, especially when dealing with production data. The ideal setup includes:

Limiting permissions: Production data should only be accessible with read permissions in development or testing environments to prevent accidental changes. Isolating environments: Creating separate databases for development, testing, and production environments ensures that there are clear boundaries and that development code cannot unintentionally affect production data.

vuthanhdatt's Second Brain

Explorer

4

Questions and Answers

Question OYJFSCV0O9Cc6OvRp1Ct

Question

Choices

Comment 1349180 by 91d511b

Comment 1286992 by shaojunni

Comment 1244363 by c00ccb7

Comment 1217499 by ChayV

Comment 1173156 by hal2401me

Question sW5DCS0Awk0HAQFbWz5s

Question

Choices

Comment 1335492 by arekm

Comment 1326376 by Sriramiyer92

Comment 1302361 by Jugiboss

Comment 1296880 by cales

Comment 1270611 by robodog

Comment 1256245 by HelixAbdu

Comment 1244424 by c00ccb7

Comment 1224722 by Deb9753

Question ld7n3IalZrV7ZqbOYmBm

Question

Choices

Comment 1224724 by Deb9753

Comment 1342070 by SeRo42

Comment 1270614 by robodog

Comment 1221093 by MDWPartners

Question fZMUiuaH8CJZW9rLweqT

Question

Choices

Comment 1296897 by cales

Comment 1373422 by mohadjhamad

Comment 1335497 by arekm

Comment 1332157 by AlejandroU

Comment 1320180 by Thameur01

Comment 1320178 by Thameur01

Comment 1310020 by cf56faf

Comment 1304496 by benni_ale

Comment 1298591 by Kreshu

Comment 1290757 by thelio_team

Comment 1221096 by MDWPartners

Question XwirdpOBHjcFO9IfAUM2

Question

Choices

Comment 1221097 by MDWPartners

Comment 1339888 by lene

Comment 1297915 by Colje

Graph View

Table of Contents