Questions and Answers
Question xaEQtKPQWV8nDENSCbXq
Question
A junior developer complains that the code in their notebook isn’t producing the correct results in the development environment. A shared screenshot reveals that while they’re using a notebook versioned with Databricks Repos, they’re using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown. Which approach will allow this developer to review the current logic for this notebook?
Choices
- A: Use Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9
- B: Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.
- C: Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch
- D: Merge all changes back to the main branch in the remote Git repository and clone the repo again
- E: Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1506718 by codebender
- Upvotes: 1
Selected Answer: B First step is to pull from the latest commits
Comment 1294569 by benni_ale
- Upvotes: 1
Selected Answer: B I would also say B but could anyone explain how to pick that branch if it is not available from dropdown?
Comment 1286663 by benni_ale
- Upvotes: 1
Selected Answer: B I would say B
Comment 1224436 by imatheushenrique
- Upvotes: 2
B. Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.
Comment 1128060 by AziLa
- Upvotes: 1
correct ans is B
Comment 1121592 by Jay_98_11
- Upvotes: 2
Selected Answer: B vote for B also
Comment 1044666 by sturcu
- Upvotes: 1
Selected Answer: B B is correct
Question 20N1c7BqnikzdlqTkbxP
Question
A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor. When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?
Choices
- A: The five Minute Load Average remains consistent/flat
- B: Bytes Received never exceeds 80 million bytes per second
- C: Total Disk Space remains constant
- D: Network I/O never spikes
- E: Overall cluster CPU utilization is around 25%
answer?
Answer: E Answer_ET: E Community answer E (48%) D (33%) A (19%) Discussion
Comment 991509 by BrianNguyen95
- Upvotes: 19
Option E: In a Spark cluster, the driver node is responsible for managing the execution of the Spark application, including scheduling tasks, managing the execution plan, and interacting with the cluster manager. If the overall cluster CPU utilization is low (e.g., around 25%), it may indicate that the driver node is not utilizing the available resources effectively and might be a bottleneck.
Comment 1364294 by Tedet
- Upvotes: 2
Selected Answer: A When you see the “Five Minute Load Average” remain consistent or flat, it could indicate that the driver is under heavy load and is struggling to keep up with the workload. In the case of a Spark cluster, if the driver is handling too much work, it can become a bottleneck and prevent the overall job from progressing efficiently.
Comment 1332442 by srinivasa
- Upvotes: 3
Selected Answer: A Consistent/Flat Five Minute Load Average: If the load average on the driver node remains consistent and does not fluctuate, it suggests that the driver is under constant, significant load. This could be a sign that the driver is performing a lot of work, potentially leading to a bottleneck.
Comment 1326817 by AlejandroU
- Upvotes: 2
Selected Answer: E Answer E. A low CPU usage could indicate that the driver isn’t working as efficiently as expected, which can lead to underutilization of the cluster and slower processing times.
Comment 1318289 by JB90
- Upvotes: 1
Selected Answer: E Only when the driver does all or most the work will the overall cluster CPU util be this low since the driver cpu is 25% of the overall cluster CPU amount
Comment 1303250 by nedlo
- Upvotes: 2
Selected Answer: E bottleneck means data skew means one of the nodes is doing majority of work while other is idle, so E is correct
Comment 1299700 by m79590530
- Upvotes: 1
Selected Answer: E D also means that Driver never send big data chunks to the Worker nodes but as it is not mentioned to be 0 then it has a constant flow of data going in & out between the Driver node and the Worker nodes. Therefore it is not a measure of Driver bottleneck. However Answer E means one of the 4 cluster nodes is always working at 100% which can not be other than the Driver node as it is always working and coordinating work across Executors.
Comment 1270130 by fe3b2fc
- Upvotes: 2
Selected Answer: D Executors talk between each other and between nodes, if the code/driver is working as intended you would see a spike in I/O while transferring data. If the code/driver was the issue you would see a spike in CPU usage and little network traffic between nodes. The correct answer is D.
Comment 1227690 by lophonos
- Upvotes: 1
Selected Answer: E E is correct
Comment 1143155 by guillesd
- Upvotes: 1
Selected Answer: D If there’s no IO between driver and executor nodes then the executor nodes are not working
Comment 1108701 by Patito
- Upvotes: 2
Selected Answer: D D seems to be right
Comment 1091957 by rok21
- Upvotes: 1
Selected Answer: E E is correct
Comment 1091782 by azurelearn2020
- Upvotes: 2
Selected Answer: E 25% indicates Cluster CPU under-utilized
Comment 1052869 by sturcu
- Upvotes: 3
Selected Answer: E If the overall cluster CPU utilization is around 25%, it means that only one out of the four nodes (driver + 3 executors) is using its full CPU capacity, while the other three nodes are idle or underutilized
Comment 1044878 by sturcu
- Upvotes: 4
Selected Answer: D If the overall cluster CPU utilization is around 25%, it means that only one out of the four nodes (driver + 3 executors) is using its full CPU capacity, while the other three nodes are idle or underutilized
Question rruD0fJCk76pvMu50jx7
Question
Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?
Choices
- A: In the Executor’s log file, by grepping for “predicate push-down”
- B: In the Stage’s Detail screen, in the Completed Stages table, by noting the size of data read from the Input column
- C: In the Storage Detail screen, by noting which RDDs are not stored on disk
- D: In the Delta Lake transaction log. by noting the column statistics
- E: In the Query Detail screen, by interpreting the Physical Plan
answer?
Answer: E Answer_ET: E Community answer E (83%) B (17%) Discussion
Comment 1364299 by Tedet
- Upvotes: 1
Selected Answer: E Predicate push-down is an optimization where conditions (such as filters) are pushed as close to the data source as possible (often to the database or file system level), reducing the amount of data read and processed. If predicate push-down isn’t being leveraged, it can result in reading unnecessary data, leading to performance degradation. Execute a query ⇒ Click View and go to Spark UI ⇒ Navigate to SQL/DataFrame tab in SparkUI ⇒ Click on any stage ⇒ Navigate to details to find Physical Plan
Comment 1353918 by shaswat1404
- Upvotes: 1
Selected Answer: B when predicated pushdown is working properly, the amount of data read should be much lower because the data source is able to filter out the rows at read time based on the query predicates. if predicate pushdown is not levaraged, stages might read a much larger volume of data than necessary, which can be observed in the input column in the stage detail screen therefore B is the correct option not A : executor logs might contain some information, but they are niot the most direct way to assess predicate push-down performance not C : used to check RDD caching and persistence, not predicate push-down not D : it holds meta data and statistics but is not viewed via the spark UI for diagnosing query performance not E : while physical plan in the query detail screen might filter push-down, interpreting it requires more expertise, and the metric on the input data size(option B) is more straight forward indicator.
Comment 1306886 by benni_ale
- Upvotes: 1
Selected Answer: E E
Comment 1293856 by dd1192d
- Upvotes: 2
Selected Answer: E E is correct : https://docs.datastax.com/en/dse/6.9/spark/predicate-push-down.html
Comment 1143091 by P1314
- Upvotes: 1
Selected Answer: E Query plan. Correct is E
Question OMDtthj31Uvaf0fRLYnX
Question
Review the following error traceback: //IMG//
Which statement describes the error being raised?
Choices
- A: The code executed was PySpark but was executed in a Scala notebook.
- B: There is no column in the table named heartrateheartrateheartrate
- C: There is a type error because a column object cannot be multiplied.
- D: There is a type error because a DataFrame object cannot be multiplied.
- E: There is a syntax error because the heartrate column is not correctly identified as a column.
answer?
Answer: B Answer_ET: B Community answer B (75%) E (25%) Discussion
Comment 1005470 by CertPeople
- Upvotes: 8
Selected Answer: B It’s B, there is no column with that name
Comment 1091960 by rok21
- Upvotes: 5
Selected Answer: E E is correct
Comment 1143185 by guillesd
- Upvotes: 2
Selected Answer: B It’s B. Regarding E, a syntax error would mean that the query is not valid due to a wrongfully written SQL statement. However, this is not the case. The column just does not exist.
Comment 1121989 by Jay_98_11
- Upvotes: 1
Selected Answer: B https://sparkbyexamples.com/spark/spark-cannot-resolve-given-input-columns/
Comment 1088715 by Gulenur_GS
- Upvotes: 2
the answer is E, because df.select(3*df[‘heartrate’]).show() perfectly returns
Comment 1088603 by Gulenur
- Upvotes: 2
Answer is E df.select(3*df[‘heartrate’]) returns perfect result without error
Comment 1066322 by npc0001
- Upvotes: 2
Selected Answer: B Answer B
Comment 1066034 by Dileepvikram
- Upvotes: 2
Answer is B
Comment 1044884 by sturcu
- Upvotes: 2
Selected Answer: B No such column found
Question 84GCbbeykbwleCQ4zbKb
Question
Which distribution does Databricks support for installing custom Python code packages?
Choices
- A: sbt
- B: CRANC. npm
- C: Wheels
- D: jars
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1299558 by benni_ale
- Upvotes: 1
Selected Answer: D I think D is correct
Comment 1159649 by hal2401me
- Upvotes: 4
Selected Answer: D https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/how-to/use-python-wheels-in-workflows
Comment 1099944 by sodere
- Upvotes: 1
Selected Answer: D https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/how-to/use-python-wheels-in-workflows
Comment 1099478 by alexvno
- Upvotes: 2
Selected Answer: D Wheels should be ok