Questions and Answers
Question DQGxvIYeVozLgJQu8WDe
Question
A data engineer wants to refactor the following DLT code, which includes multiple table definitions with very similar code.
//IMG//
In an attempt to programmatically create these tables using a parameterized table definition, the data engineer writes the following code.
//IMG//
The pipeline runs an update with this refactored code, but generates a different DAG showing incorrect configuration values for these tables.
How can the data engineer fix this?
Choices
- A: Wrap the for loop inside another table definition, using generalized names and properties to replace with those from the inner table definition.
- B: Convert the list of configuration values to a dictionary of table settings, using table names as keys.
- C: Move the table definition into a separate function, and make calls to this function using different input parameters inside the for loop.
- D: Load the configuration values for these tables from a separate file, located at a path provided by a pipeline parameter.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1325505 by Thameur01
- Upvotes: 2
Selected Answer: C here is a correct implementation: def create_table(t): @dlt.table(name=f”{t}_dataset”) def table_definition(): return spark.read.table(t)
tables = [“t1”, “t2”, “t3”] for t in tables: create_table(t)
Comment 1323778 by benni_ale
- Upvotes: 2
Selected Answer: C Problem seems to be the fact that the new_table function has no parameter and the t variable won’t be recognized as a variable.. However i have not tested it as DLT is not available in free membership. :(
Question ncAplwMCDiQwucWNcTcd
Question
Which statement characterizes the general programming model used by Spark Structured Streaming?
Choices
- A: Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.
- B: Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.
- C: Structured Streaming uses specialized hardware and I/O streams to achieve sub-second latency for data transfer.
- D: Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.
- E: Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1334796 by arekm
- Upvotes: 1
Selected Answer: D D - see explanation of 8605246
Comment 1222475 by imatheushenrique
- Upvotes: 2
D. Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.
Comment 1161219 by mardigras
- Upvotes: 2
Selected Answer: D Yes. answer is D
Comment 1121953 by Jay_98_11
- Upvotes: 1
Selected Answer: D vote for D
Comment 1040345 by sturcu
- Upvotes: 2
Selected Answer: D Correct. Structured streaming needs to be considered as a table with append
Comment 973661 by 8605246
- Upvotes: 4
correct; The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended. This leads to a new stream processing model that is very similar to a batch processing model. You will express your streaming computation as standard batch-like query as on a static table, and Spark runs it as an incremental query on the unbounded input table. Let’s understand this model in more detail. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Question aaeirAs5Cm4derdxGkLE
Question
Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?
Choices
- A: spark.sql.files.maxPartitionBytes
- B: spark.sql.autoBroadcastJoinThreshold
- C: spark.sql.files.openCostInBytes
- D: spark.sql.adaptive.coalescePartitions.minPartitionNum
- E: spark.sql.adaptive.advisoryPartitionSizeInBytes
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 973662 by 8605246
- Upvotes: 5
correct; The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC. https://spark.apache.org/docs/latest/sql-performance-tuning.html
Comment 1121954 by Jay_98_11
- Upvotes: 3
Selected Answer: A correct
Comment 1040372 by sturcu
- Upvotes: 1
Selected Answer: A from the provided list, this fits best. In reality partition size/number can be influenced my many settings
Question fAKTHx7trU7vYFzbUQXB
Question
A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum. Which situation is causing increased duration of the overall job?
Choices
- A: Task queueing resulting from improper thread pool assignment.
- B: Spill resulting from attached volume storage being too small.
- C: Network latency due to some cluster nodes being in different regions from the source data
- D: Skew caused by more data being assigned to a subset of spark-partitions.
- E: Credential validation errors while pulling data from an external system.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1334801 by arekm
- Upvotes: 2
Selected Answer: D D - other answers don’t make sense. In particular C - all nodes of the cluster must be in the same region (at least on AWS and Azure; GCP - I don’t know, but they have networks spanning regions, so maybe it is possible).
Comment 1319131 by benni_ale
- Upvotes: 1
Selected Answer: D D is correct
Comment 1269674 by AndreFR
- Upvotes: 1
A excluded because task queueing does not increase the duration of a task B excluded, spill is writing to storage when a memory is insufficient (not storage insufficient) C excluded, region cannot have a 100 times impact on duration E excluded, no errors mentioned in question
Comment 1222478 by imatheushenrique
- Upvotes: 1
D. Skew caused by more data being assigned to a subset of spark-partitions.
Comment 1168595 by vikram12apr
- Upvotes: 3
Selected Answer: D because a particular executors are executing majority of data while rest are processing very less. The total execution time depends upon the slowest executors. Answer is D.
Comment 1121955 by Jay_98_11
- Upvotes: 1
Selected Answer: D correct
Comment 1118666 by kz_data
- Upvotes: 1
Selected Answer: D I think D is correct
Comment 1040373 by sturcu
- Upvotes: 1
Selected Answer: D D is correct
Comment 1013363 by Eertyy
- Upvotes: 3
D is the correct answer
Question bUUaAnJWxDHlnnXRLqNh
Question
Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM. Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance?
Choices
- A: • Total VMs; 1 • 400 GB per Executor • 160 Cores / Executor
- B: • Total VMs: 8 • 50 GB per Executor • 20 Cores / Executor
- C: • Total VMs: 16 • 25 GB per Executor • 10 Cores/Executor
- D: • Total VMs: 4 • 100 GB per Executor • 40 Cores/Executor
- E: • Total VMs:2 • 200 GB per Executor • 80 Cores / Executor
answer?
Answer: B Answer_ET: B Community answer B (44%) A (29%) D (15%) 12% Discussion
Comment 988395 by robson90
- Upvotes: 43
Option A, question is about maximum performance. Wide transformation will result in often expensive shuffle. With one executor this problem will be resolved. https://docs.databricks.com/en/clusters/cluster-config-best-practices.html#complex-batch-etl
Comment 1422587 by Ashok_Choudhary_CT
- Upvotes: 1
Selected Answer: C How Option (C) Excels? ✅ More Executors (16 vs. 8 in Option B) → Faster parallel execution. ✅ Fewer Cores per Executor (10 vs. 20 in Option B) → Prevents CPU contention and scheduling delays. ✅ Better Memory Management (25GB vs. 50GB in Option B) → Reduces GC overhead.
Final Verdict Option (C) is the “Best” configuration for handling a job with wide transformations.
Comment 1410164 by capt2101akash
- Upvotes: 1
Selected Answer: A The question talks about higher performance for one large wide transformation. This needs fewer large VMs/Executor. Therefore, one needs to choose the largest possible option.
Comment 1353500 by shaswat1404
- Upvotes: 1
Selected Answer: C overly large executors are bad due to large Garbage Collection (GC) overhead and inefficient parallelism option C provides the best balance of parallelism, memory utilization and performance efficiency
Comment 1351152 by fabiospont
- Upvotes: 1
Selected Answer: A A is correct only one VM per Job.
Comment 1335397 by hassan_1
- Upvotes: 2
Selected Answer: B as the question states 1 executer per VM and the recommendation is not to use single worker in production so the answer should be B
Comment 1335203 by HairyTorso
- Upvotes: 4
Selected Answer: B Number of workers Choosing the right number of workers requires some trials and iterations to figure out the compute and memory needs of a Spark job. Here are some guidelines to help you start:
Never choose a single worker for a production job, as it will be the single point for failure Start with 2-4 workers for small workloads (for example, a job with no wide transformations like joins and aggregations) Start with 8-10 workers for medium to big workloads that involve wide transformations like joins and aggregations, then scale up if necessary
https://www.databricks.com/discover/pages/optimize-data-workloads-guide#number-workers
Comment 1334804 by arekm
- Upvotes: 1
Selected Answer: A Maximum performance - A guarantees no shuffles between nodes in the cluster. Only processes on one VM.
Comment 1325025 by AlejandroU
- Upvotes: 1
Selected Answer: B Answer B offers a good balance with 8 executors, providing a decent amount of memory and cores per executor, allowing for significant parallel processing. Option C increases the number of executors further but at the cost of reduced memory and cores per executor, which might not be as effective for wide transformations.
Comment 1324187 by janeZ
- Upvotes: 2
Selected Answer: C for wide transformations, leveraging multiple executors typically results in better performance, resource utilization, and fault tolerance.
Comment 1320335 by Shakmak
- Upvotes: 2
Selected Answer: B B is a correct Answer based on https://www.databricks.com/discover/pages/optimize-data-workloads-guide#all-purpose
Comment 1309988 by AndreFR
- Upvotes: 3
Selected Answer: B Besides that A & E do not provide enough parallelism & fault tolerance, I can’t explain why, but the correct answer is B. I got the same question during the exam and got 100% at tooling with answer B. (B is the answer provided by other sites similar to examtopics)
Choosing between B, C & D is tricky !
Comment 1307463 by kimberlyvsmith
- Upvotes: 3
Selected Answer: B B
“Number of workers Choosing the right number of workers requires some trials and iterations to figure out the compute and memory needs of a Spark job. Here are some guidelines to help you start:
Never choose a single worker for a production job, as it will be the single point for failure Start with 2-4 workers for small workloads (for example, a job with no wide transformations like joins and aggregations) Start with 8-10 workers for medium to big workloads that involve wide transformations like joins and aggregations, then scale up if necessary”
Comment 1184214 by arik90
- Upvotes: 1
Selected Answer: A Wide transformation falls under complex etl which means Option A is correct in the documentation didn’t mention to do otherwise in this scenario.
Comment 1145269 by PrashantTiwari
- Upvotes: 1
A is correct
Comment 1127760 by vikrampatel5
- Upvotes: 3
Selected Answer: A Option A: https://docs.databricks.com/en/clusters/cluster-config-best-practices.html#complex-batch-etl
Comment 1116490 by RafaelCFC
- Upvotes: 1
Selected Answer: A robson90’s response explains it perfectly and has documentation to support it.
Comment 1065040 by ofed
- Upvotes: 2
Option A
Comment 1062114 by ismoshkov
- Upvotes: 2
Selected Answer: A Our goal is top performance. Vertical scaling is more performant rather that horizontal. Especially we know that we need cross VM exchange. Option A.
Comment 1061975 by dp_learner
- Upvotes: 1
response A. as of Complex batch ETL
” More complex ETL jobs, such as processing that requires unions and joins across multiple tables, will probably work best when you can minimize the amount of data shuffled. Since reducing the number of workers in a cluster will help minimize shuffles, you should consider a smaller cluster like cluster A in the following diagram over a larger cluster like cluster D. “
Comment 1013509 by Santitoxic
- Upvotes: 4
Selected Answer: D Considering the need for both memory and parallelism, option D seems to offer the best balance between resources and parallel processing. It provides a reasonable amount of memory and cores per Executor while maintaining a sufficient level of parallelism with 4 Executors. This configuration is likely to result in maximum performance for a job with at least one wide transformation.
Comment 1009649 by mwyopme
- Upvotes: 3
Sorry Response C = 16VM for maximing Wide Transformation
Comment 1009648 by mwyopme
- Upvotes: 1
Key message is : Given a job with at least one wide transformation Performance, should max the number of concurrent VM, Selecting response B. 160/10 = 16 VM
Comment 988351 by taif12340
- Upvotes: 1
Selected Answer: D Considering the need for both memory and parallelism, option D seems to offer the best balance between resources and parallel processing. It provides a reasonable amount of memory and cores per Executor while maintaining a sufficient level of parallelism with 4 Executors. This configuration is likely to result in maximum performance for a job with at least one wide transformation.
Comment 985731 by BrianNguyen95
- Upvotes: 1
correct answer is E: Option E provides a substantial amount of memory and cores per executor, allowing the job to handle wide transformations efficiently. However, performance can also be influenced by factors like the nature of your specific workload, data distribution, and overall cluster utilization. It’s a good practice to conduct benchmarking and performance testing with various configurations to determine the optimal setup for your specific use case.
Comment 975880 by stuart_gta1
- Upvotes: 2
C. More VMs helps to distribute the workload across the cluster, which results in better fault tolerance and increase the chances of job completion.
Comment 970043 by asmayassineg
- Upvotes: 1
answer should be E. if at least one transformation is wide, so 1 executor of 200GB can do the job, rest of tasks can be carried out on the other node