Questions and Answers

Question DQGxvIYeVozLgJQu8WDe

Question

A data engineer wants to refactor the following DLT code, which includes multiple table definitions with very similar code.

//IMG//

In an attempt to programmatically create these tables using a parameterized table definition, the data engineer writes the following code.

//IMG//

The pipeline runs an update with this refactored code, but generates a different DAG showing incorrect configuration values for these tables.

How can the data engineer fix this?

Choices

A: Wrap the for loop inside another table definition, using generalized names and properties to replace with those from the inner table definition.
B: Convert the list of configuration values to a dictionary of table settings, using table names as keys.
C: Move the table definition into a separate function, and make calls to this function using different input parameters inside the for loop.
D: Load the configuration values for these tables from a separate file, located at a path provided by a pipeline parameter.

answer?

Answer: C Answer_ET: C Community answer C (100%) Discussion

Comment 1325505 by Thameur01

Upvotes: 2

Selected Answer: C here is a correct implementation: def create_table(t): @dlt.table(name=f”{t}_dataset”) def table_definition(): return spark.read.table(t)

tables = [“t1”, “t2”, “t3”] for t in tables: create_table(t)

Comment 1323778 by benni_ale

Upvotes: 2

Selected Answer: C Problem seems to be the fact that the new_table function has no parameter and the t variable won’t be recognized as a variable.. However i have not tested it as DLT is not available in free membership. :(

Question ncAplwMCDiQwucWNcTcd

Question

Which statement characterizes the general programming model used by Spark Structured Streaming?

Choices

A: Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.
B: Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.
C: Structured Streaming uses specialized hardware and I/O streams to achieve sub-second latency for data transfer.
D: Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.
E: Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.

answer?

Answer: D Answer_ET: D Community answer D (100%) Discussion

Comment 1334796 by arekm

Upvotes: 1

Selected Answer: D D - see explanation of 8605246

Comment 1222475 by imatheushenrique

Upvotes: 2

D. Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.

Comment 1161219 by mardigras

Upvotes: 2

Selected Answer: D Yes. answer is D

Comment 1121953 by Jay_98_11

Upvotes: 1

Selected Answer: D vote for D

Comment 1040345 by sturcu

Upvotes: 2

Selected Answer: D Correct. Structured streaming needs to be considered as a table with append

Comment 973661 by 8605246

Upvotes: 4

correct; The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended. This leads to a new stream processing model that is very similar to a batch processing model. You will express your streaming computation as standard batch-like query as on a static table, and Spark runs it as an incremental query on the unbounded input table. Let’s understand this model in more detail. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Question aaeirAs5Cm4derdxGkLE

Question

Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?

Choices

A: spark.sql.files.maxPartitionBytes
B: spark.sql.autoBroadcastJoinThreshold
C: spark.sql.files.openCostInBytes
D: spark.sql.adaptive.coalescePartitions.minPartitionNum
E: spark.sql.adaptive.advisoryPartitionSizeInBytes

answer?

Answer: A Answer_ET: A Community answer A (100%) Discussion

Comment 973662 by 8605246

Upvotes: 5

correct; The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC. https://spark.apache.org/docs/latest/sql-performance-tuning.html

Comment 1121954 by Jay_98_11

Upvotes: 3

Selected Answer: A correct

Comment 1040372 by sturcu

Upvotes: 1

Selected Answer: A from the provided list, this fits best. In reality partition size/number can be influenced my many settings

Question fAKTHx7trU7vYFzbUQXB

Question

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum. Which situation is causing increased duration of the overall job?

Choices

A: Task queueing resulting from improper thread pool assignment.
B: Spill resulting from attached volume storage being too small.
C: Network latency due to some cluster nodes being in different regions from the source data
D: Skew caused by more data being assigned to a subset of spark-partitions.
E: Credential validation errors while pulling data from an external system.

answer?

Answer: D Answer_ET: D Community answer D (100%) Discussion

Comment 1334801 by arekm

Upvotes: 2

Selected Answer: D D - other answers don’t make sense. In particular C - all nodes of the cluster must be in the same region (at least on AWS and Azure; GCP - I don’t know, but they have networks spanning regions, so maybe it is possible).

Comment 1319131 by benni_ale

Upvotes: 1

Selected Answer: D D is correct

Comment 1269674 by AndreFR

Upvotes: 1

A excluded because task queueing does not increase the duration of a task B excluded, spill is writing to storage when a memory is insufficient (not storage insufficient) C excluded, region cannot have a 100 times impact on duration E excluded, no errors mentioned in question

Comment 1222478 by imatheushenrique

Upvotes: 1

D. Skew caused by more data being assigned to a subset of spark-partitions.

Comment 1168595 by vikram12apr

Upvotes: 3

Selected Answer: D because a particular executors are executing majority of data while rest are processing very less. The total execution time depends upon the slowest executors. Answer is D.

Comment 1121955 by Jay_98_11

Upvotes: 1

Selected Answer: D correct

Comment 1118666 by kz_data

Upvotes: 1

Selected Answer: D I think D is correct

Comment 1040373 by sturcu

Upvotes: 1

Selected Answer: D D is correct

Comment 1013363 by Eertyy

Upvotes: 3

D is the correct answer

Question bUUaAnJWxDHlnnXRLqNh

Question

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM. Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance?

Choices

A: • Total VMs; 1 • 400 GB per Executor • 160 Cores / Executor
B: • Total VMs: 8 • 50 GB per Executor • 20 Cores / Executor
C: • Total VMs: 16 • 25 GB per Executor • 10 Cores/Executor
D: • Total VMs: 4 • 100 GB per Executor • 40 Cores/Executor
E: • Total VMs:2 • 200 GB per Executor • 80 Cores / Executor

answer?

Answer: B Answer_ET: B Community answer B (44%) A (29%) D (15%) 12% Discussion

Comment 988395 by robson90

Upvotes: 43

Option A, question is about maximum performance. Wide transformation will result in often expensive shuffle. With one executor this problem will be resolved. https://docs.databricks.com/en/clusters/cluster-config-best-practices.html#complex-batch-etl

Comment 1422587 by Ashok_Choudhary_CT

Upvotes: 1

Selected Answer: C How Option (C) Excels? ✅ More Executors (16 vs. 8 in Option B) → Faster parallel execution. ✅ Fewer Cores per Executor (10 vs. 20 in Option B) → Prevents CPU contention and scheduling delays. ✅ Better Memory Management (25GB vs. 50GB in Option B) → Reduces GC overhead.

Final Verdict Option (C) is the “Best” configuration for handling a job with wide transformations.

Comment 1410164 by capt2101akash

Upvotes: 1

Selected Answer: A The question talks about higher performance for one large wide transformation. This needs fewer large VMs/Executor. Therefore, one needs to choose the largest possible option.

Comment 1353500 by shaswat1404

Upvotes: 1

Selected Answer: C overly large executors are bad due to large Garbage Collection (GC) overhead and inefficient parallelism option C provides the best balance of parallelism, memory utilization and performance efficiency

Comment 1351152 by fabiospont

Upvotes: 1

Selected Answer: A A is correct only one VM per Job.

Comment 1335397 by hassan_1

Upvotes: 2

Selected Answer: B as the question states 1 executer per VM and the recommendation is not to use single worker in production so the answer should be B

Comment 1335203 by HairyTorso

Upvotes: 4

Selected Answer: B Number of workers Choosing the right number of workers requires some trials and iterations to figure out the compute and memory needs of a Spark job. Here are some guidelines to help you start:

Never choose a single worker for a production job, as it will be the single point for failure Start with 2-4 workers for small workloads (for example, a job with no wide transformations like joins and aggregations) Start with 8-10 workers for medium to big workloads that involve wide transformations like joins and aggregations, then scale up if necessary

https://www.databricks.com/discover/pages/optimize-data-workloads-guide#number-workers

Comment 1334804 by arekm

Upvotes: 1

Selected Answer: A Maximum performance - A guarantees no shuffles between nodes in the cluster. Only processes on one VM.

Comment 1325025 by AlejandroU

Upvotes: 1

Selected Answer: B Answer B offers a good balance with 8 executors, providing a decent amount of memory and cores per executor, allowing for significant parallel processing. Option C increases the number of executors further but at the cost of reduced memory and cores per executor, which might not be as effective for wide transformations.

Comment 1324187 by janeZ

Upvotes: 2

Selected Answer: C for wide transformations, leveraging multiple executors typically results in better performance, resource utilization, and fault tolerance.

Comment 1320335 by Shakmak

Upvotes: 2

Selected Answer: B B is a correct Answer based on https://www.databricks.com/discover/pages/optimize-data-workloads-guide#all-purpose

Comment 1309988 by AndreFR

Upvotes: 3

Selected Answer: B Besides that A & E do not provide enough parallelism & fault tolerance, I can’t explain why, but the correct answer is B. I got the same question during the exam and got 100% at tooling with answer B. (B is the answer provided by other sites similar to examtopics)

Choosing between B, C & D is tricky !

Comment 1307463 by kimberlyvsmith

Upvotes: 3

Selected Answer: B B

“Number of workers Choosing the right number of workers requires some trials and iterations to figure out the compute and memory needs of a Spark job. Here are some guidelines to help you start:

Never choose a single worker for a production job, as it will be the single point for failure Start with 2-4 workers for small workloads (for example, a job with no wide transformations like joins and aggregations) Start with 8-10 workers for medium to big workloads that involve wide transformations like joins and aggregations, then scale up if necessary”

Comment 1184214 by arik90

Upvotes: 1

Selected Answer: A Wide transformation falls under complex etl which means Option A is correct in the documentation didn’t mention to do otherwise in this scenario.

Comment 1145269 by PrashantTiwari

Upvotes: 1

A is correct

Comment 1127760 by vikrampatel5

Upvotes: 3

Selected Answer: A Option A: https://docs.databricks.com/en/clusters/cluster-config-best-practices.html#complex-batch-etl

Comment 1116490 by RafaelCFC

Upvotes: 1

Selected Answer: A robson90’s response explains it perfectly and has documentation to support it.

Comment 1065040 by ofed

Upvotes: 2

Option A

Comment 1062114 by ismoshkov

Upvotes: 2

Selected Answer: A Our goal is top performance. Vertical scaling is more performant rather that horizontal. Especially we know that we need cross VM exchange. Option A.

Comment 1061975 by dp_learner

Upvotes: 1

response A. as of Complex batch ETL

” More complex ETL jobs, such as processing that requires unions and joins across multiple tables, will probably work best when you can minimize the amount of data shuffled. Since reducing the number of workers in a cluster will help minimize shuffles, you should consider a smaller cluster like cluster A in the following diagram over a larger cluster like cluster D. “

Comment 1013509 by Santitoxic

Upvotes: 4

Selected Answer: D Considering the need for both memory and parallelism, option D seems to offer the best balance between resources and parallel processing. It provides a reasonable amount of memory and cores per Executor while maintaining a sufficient level of parallelism with 4 Executors. This configuration is likely to result in maximum performance for a job with at least one wide transformation.

Comment 1009649 by mwyopme

Upvotes: 3

Sorry Response C = 16VM for maximing Wide Transformation

Comment 1009648 by mwyopme

Upvotes: 1

Key message is : Given a job with at least one wide transformation Performance, should max the number of concurrent VM, Selecting response B. 160/10 = 16 VM

Comment 988351 by taif12340

Upvotes: 1

Selected Answer: D Considering the need for both memory and parallelism, option D seems to offer the best balance between resources and parallel processing. It provides a reasonable amount of memory and cores per Executor while maintaining a sufficient level of parallelism with 4 Executors. This configuration is likely to result in maximum performance for a job with at least one wide transformation.

Comment 985731 by BrianNguyen95

Upvotes: 1

correct answer is E: Option E provides a substantial amount of memory and cores per executor, allowing the job to handle wide transformations efficiently. However, performance can also be influenced by factors like the nature of your specific workload, data distribution, and overall cluster utilization. It’s a good practice to conduct benchmarking and performance testing with various configurations to determine the optimal setup for your specific use case.

Comment 975880 by stuart_gta1

Upvotes: 2

C. More VMs helps to distribute the workload across the cluster, which results in better fault tolerance and increase the chances of job completion.

Comment 970043 by asmayassineg

Upvotes: 1

answer should be E. if at least one transformation is wide, so 1 executor of 200GB can do the job, rest of tasks can be carried out on the other node

vuthanhdatt's Second Brain

Explorer

24

Questions and Answers

Question DQGxvIYeVozLgJQu8WDe

Question

Choices

Comment 1325505 by Thameur01

Comment 1323778 by benni_ale

Question ncAplwMCDiQwucWNcTcd

Question

Choices

Comment 1334796 by arekm

Comment 1222475 by imatheushenrique

Comment 1161219 by mardigras

Comment 1121953 by Jay_98_11

Comment 1040345 by sturcu

Comment 973661 by 8605246

Question aaeirAs5Cm4derdxGkLE

Question

Choices

Comment 973662 by 8605246

Comment 1121954 by Jay_98_11

Comment 1040372 by sturcu

Question fAKTHx7trU7vYFzbUQXB

Question

Choices

Comment 1334801 by arekm

Comment 1319131 by benni_ale

Comment 1269674 by AndreFR

Comment 1222478 by imatheushenrique

Comment 1168595 by vikram12apr

Comment 1121955 by Jay_98_11

Comment 1118666 by kz_data

Comment 1040373 by sturcu

Comment 1013363 by Eertyy

Question bUUaAnJWxDHlnnXRLqNh

Question

Choices

Comment 988395 by robson90

Comment 1422587 by Ashok_Choudhary_CT

Comment 1410164 by capt2101akash

Comment 1353500 by shaswat1404

Comment 1351152 by fabiospont

Comment 1335397 by hassan_1

Comment 1335203 by HairyTorso

Comment 1334804 by arekm

Comment 1325025 by AlejandroU

Comment 1324187 by janeZ

Comment 1320335 by Shakmak

Comment 1309988 by AndreFR

Comment 1307463 by kimberlyvsmith

Comment 1184214 by arik90

Comment 1145269 by PrashantTiwari

Comment 1127760 by vikrampatel5

Comment 1116490 by RafaelCFC

Comment 1065040 by ofed

Comment 1062114 by ismoshkov

Comment 1061975 by dp_learner

Comment 1013509 by Santitoxic

Comment 1009649 by mwyopme

Comment 1009648 by mwyopme

Comment 988351 by taif12340

Comment 985731 by BrianNguyen95

Comment 975880 by stuart_gta1

Comment 970043 by asmayassineg

Graph View

Table of Contents