Questions and Answers
Question slSue2y8VE49lfg0pryf
Question
Which statement regarding Spark configuration on the Databricks platform is true?
Choices
- A: The Databricks REST API can be used to modify the Spark configuration properties for an interactive cluster without interrupting jobs currently running on the cluster.
- B: Spark configurations set within a notebook will affect all SparkSessions attached to the same interactive cluster.
- C: Spark configuration properties can only be set for an interactive cluster by creating a global init script.
- D: Spark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster.
- E: When the same Spark configuration property is set for an interactive cluster and a notebook attached to that cluster, the notebook setting will always be ignored.
answer?
Answer: D Answer_ET: D Community answer D (90%) 5% Discussion
Comment 1087767 by hamzaKhribi
- Upvotes: 8
Selected Answer: D I tried it myself, setting a spark conf on the cluster ui, will impact all notebooks attached to that cluster, for example i set the number of shuffle partitions to 4, and in every notebook when i inspect the number of partitions i find 4.
Comment 1335237 by Sriramiyer92
- Upvotes: 2
Selected Answer: D D is correct. Point to Note: Precedence - (Spark configurations set programmatically in a notebook or script) > (Cluster-wide configurations set in the Clusters UI)
Comment 1321796 by benni_ale
- Upvotes: 1
Selected Answer: D Both via API or UI will trigger a restart of the cluster so A is wrong. D is correct given that “impact” does not mean the cluster can’t be restarted.
Comment 1160230 by Curious76
- Upvotes: 1
Selected Answer: B A. Incorrect: Modifying configurations through the Databricks REST API while jobs are running can lead to unexpected behavior or disruption. It’s generally not recommended. C. Incorrect: While global init scripts can be used, it’s not the only way. Configurations can also be set within notebooks. D. Incorrect: Configurations set through the Clusters UI apply to the entire cluster, but they might not necessarily override configurations set within notebooks attached to the cluster. E. Incorrect: Notebook configurations can take precedence over cluster-level configurations for the same property, offering finer-grained control at the notebook level.
Comment 1141664 by vctrhugo
- Upvotes: 3
Selected Answer: D These settings are applied at the cluster level and affect all SparkSessions on the cluster.
Comment 1100388 by alexvno
- Upvotes: 4
Selected Answer: D A wrong, cluster will restart → D
Comment 1085492 by petrv
- Upvotes: 1
Selected Answer: A In Databricks, you can use the Databricks REST API to modify Spark configuration properties for an interactive cluster without interrupting currently running jobs. This allows you to dynamically adjust Spark configurations to optimize performance or meet specific requirements without the need to restart the cluster.
Question tV7c3Y99bBZympTgKibZ
Question
A developer has successfully configured their credentials for Databricks Repos and cloned a remote Git repository. They do not have privileges to make changes to the main branch, which is the only branch currently visible in their workspace.
Which approach allows this user to share their code updates without the risk of overwriting the work of their teammates?
Choices
- A: Use Repos to checkout all changes and send the git diff log to the team.
- B: Use Repos to create a fork of the remote repository, commit all changes, and make a pull request on the source repository.
- C: Use Repos to pull changes from the remote Git repository; commit and push changes to a branch that appeared as changes were pulled.
- D: Use Repos to merge all differences and make a pull request back to the remote repository.
- E: Use Repos to create a new branch, commit all changes, and push changes to the remote Git repository.
answer?
Answer: E Answer_ET: E Community answer E (63%) B (37%) Discussion
Comment 1330810 by AlejandroU
- Upvotes: 1
Selected Answer: E Answer E. Creating a new branch within the repository is the most effective and recommended approach for developers who lack write access to the main branch in Databricks Repos. While forking is a valid strategy in some situations, it’s generally less common and potentially more complex within the context of Databricks Repos, especially when the primary goal is to contribute to the existing shared repository.
Comment 1301422 by benni_ale
- Upvotes: 1
Selected Answer: E I changed my mind i think is E
Comment 1299655 by benni_ale
- Upvotes: 1
Selected Answer: B Explanation:
The developer does not have privileges to make changes to the main branch of the remote repository, and it’s the only branch visible in their workspace. To share their code updates without risking overwriting their teammates’ work, the best approach is to:
Create a personal copy (fork) of the remote repository. This forked repository will be under the developer’s own account or workspace, allowing full control over it. Make Changes in the Fork:
Commit all code updates to the forked repository. Since the developer has full privileges on their fork, they can create branches, commit changes, and manage the repository as needed.
Option E: Creating a new branch and pushing changes to the remote repository requires write access to the repository, which the developer does not have.
By forking the repository, the developer avoids any permission issues and ensures that their work does not interfere with the main codebase until it is reviewed and approved by the team.
Comment 1166272 by hal2401me
- Upvotes: 4
Selected Answer: E E is the regular collaboration approach. B makes a fork so breaks away from the collaborating teamates. There’s no way they can make a pull request on the source repository after making change to a fork.
Comment 1160498 by Curious76
- Upvotes: 2
Selected Answer: B Databricks Repos itself does not currently support creating forks directly within the platform. However, you can achieve a similar workflow using the following steps:
Use the git clone command in a terminal or IDE to create a local copy of the remote repository. This effectively creates a local fork. Make your changes in the local copy. Use git push to push your changes to a new remote repository you create on a Git hosting service like GitHub. Create a pull request from your new remote repository to the original repository on Databricks Repos.
This approach allows you to make changes to your own copy of the code, collaborate with others through code reviews, and propose your changes for integration into the main branch without directly modifying it
Comment 1141661 by vctrhugo
- Upvotes: 3
Selected Answer: E This is a common workflow in collaborative development environments. The developer can create a new branch in their local repository, make changes, and then push the branch to the remote repository. This way, they can share their updates without modifying the main branch directly. After pushing the changes, they can create a pull request on the remote repository, allowing their teammates to review the changes before merging them into the main branch. This process ensures that the main branch remains stable and that all changes are reviewed and approved before they’re incorporated. It also prevents any accidental overwrites of teammates’ work.
Comment 1138381 by PrincipalJoe
- Upvotes: 3
Databricks Repos cannot be used to fork a repository
Comment 1137369 by adenis
- Upvotes: 3
Selected Answer: E E is correct
Comment 1136242 by Rinscy
- Upvotes: 2
E is correct if you create a fork you create another repository so not B
Comment 1134620 by Crocjun
- Upvotes: 2
Selected Answer: B B is correct
Comment 1131768 by spaceexplorer
- Upvotes: 2
Selected Answer: B B is correct
Comment 1111275 by dmov
- Upvotes: 1
Isn’t this B?
Question 7tBseGd9qKWWrnsSJpYV
Question
In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both DEEP and SHALLOW CLONE, development tables are created using SHALLOW CLONE.
A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that VACUUM was run the day before.
Which statement describes why the cloned tables are no longer working?
Choices
- A: Because Type 1 changes overwrite existing records, Delta Lake cannot guarantee data consistency for cloned tables.
- B: Running VACUUM automatically invalidates any shallow clones of a table; DEEP CLONE should always be used when a cloned table will be repeatedly queried.
- C: Tables created with SHALLOW CLONE are automatically deleted after their default retention threshold of 7 days.
- D: The metadata created by the CLONE operation is referencing data files that were purged as invalid by the VACUUM command.
- E: The data files compacted by VACUUM are not tracked by the cloned metadata; running REFRESH on the cloned table will pull in recent changes.
answer?
Answer: D Answer_ET: D Community answer D (88%) 13% Discussion
Comment 1100398 by alexvno
- Upvotes: 8
Selected Answer: D Shallow clone: only duplicates the metadata of the table being cloned; the data files of the table itself are not copied. These clones are cheaper to create but are not self-contained and depend on the source from which they were cloned as the source of data. If the files in the source that the clone depends on are removed, for example with VACUUM, a shallow clone may become unusable. Therefore, shallow clones are typically used for short-lived use cases such as testing and experimentation.
Comment 1298677 by benni_ale
- Upvotes: 2
I was not sure whether B or D but somehow I think that running VACUUM comand does not invalidate SHALLOW CLONEs . I mean its just that the data referenced by the clone is no longer present. It can still happen that a SHALLOW CLONE is working even after a VACUUM command run on the cloned table (origin) . So B is not completely correct.
Comment 1141658 by vctrhugo
- Upvotes: 3
Selected Answer: D In Delta Lake, the VACUUM command deletes data files that are no longer referenced by a Delta table and are older than the retention threshold. When a table is cloned using SHALLOW CLONE, the clone references the same data files as the original table but creates a new transaction log. If VACUUM is run on the original table, it can delete data files that are still being referenced by the cloned table’s metadata, causing the cloned table to stop working. This is because the VACUUM command doesn’t know about the cloned table’s references to the data files. Therefore, it’s important to be cautious when running VACUUM on tables that have clones.
Comment 1131771 by spaceexplorer
- Upvotes: 1
Selected Answer: D D is correct
Comment 1075152 by AzureDE2522
- Upvotes: 2
Selected Answer: D Please refer: https://docs.databricks.com/en/delta/clone.html#what-are-the-semantics-of-delta-clone-operations
Comment 1071206 by 60ties
- Upvotes: 2
Selected Answer: B B is best
Question kUZpLaiA7Htn8FA1jRtV
Question
You are performing a join operation to combine values from a static userLookup table with a streaming DataFrame streamingDF.
Which code block attempts to perform an invalid stream-static join?
Choices
- A: userLookup.join(streamingDF, [“userid”], how=“inner”)
- B: streamingDF.join(userLookup, [“user_id”], how=“outer”)
- C: streamingDF.join(userLookup, [“user_id”], how=“left”)
- D: streamingDF.join(userLookup, [“userid”], how=“inner”)
- E: userLookup.join(streamingDF, [“user_id”], how=“right”)
answer?
Answer: B Answer_ET: B Community answer B (77%) E (18%) 5% Discussion
Comment 1080923 by Enduresoul
- Upvotes: 11
Selected Answer: B Answer B is correct: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#support-matrix-for-joins-in-streaming-queries
When we take a look in the supported join matrix between static and stream inputs, we can identify, that Stream-Static + outer is not supported. Answer E is wrong, because the Static-Stream + right join is supported.
Comment 1339443 by lene
- Upvotes: 1
Selected Answer: B according to the Support matrix for joins in streaming queries
Comment 1331033 by AlejandroU
- Upvotes: 1
Selected Answer: B Answer B. We can directly discard options C and D since the streaming DataFrame (streamingDF) must be the left table in the join and the join type must be inner join or left outer join. Thus, the most directly invalid code block due to an unsupported join type is B.
Comment 1222431 by imatheushenrique
- Upvotes: 2
B. We match all the records from a static DataFrame on the left with a stream DataFrame on the right. If records do not match from the static DF (Left) to stream DF (Right), then the system cannot return null since the data changes on stream DF (Right), and we cannot guarantee if we will get matching records. That is why full_outer join is not supported.
Comment 1173210 by hal2401me
- Upvotes: 4
Selected Answer: E in my exam today, BCD are removed. i chose E, because I recall that stream-static right join are less supported.
Comment 1160250 by Curious76
- Upvotes: 1
Selected Answer: B b is correct
Comment 1141655 by vctrhugo
- Upvotes: 1
Selected Answer: B Specifically, outer joins are not supported with a static DataFrame on the right and a streaming DataFrame on the left. This is because it’s not possible to guarantee all necessary rows will be available in the streaming DataFrame for every micro-batch.
Comment 1120773 by kz_data
- Upvotes: 1
Selected Answer: D I think the correct answer is D.
Comment 1114416 by lexaneon
- Upvotes: 2
Selected Answer: B believe B is correct as provided below
Question 659pbR3NuntI8jfmEI8H
Question
Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.
Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?
Choices
- A: Query’s detail screen and Job’s detail screen
- B: Stage’s detail screen and Executor’s log files
- C: Driver’s and Executor’s log files
- D: Executor’s detail screen and Executor’s log files
- E: Stage’s detail screen and Query’s detail screen
answer?
Answer: B Answer_ET: B Community answer B (93%) 7% Discussion
Comment 1071383 by 60ties
- Upvotes: 7
Selected Answer: B B is correct
Comment 1141651 by vctrhugo
- Upvotes: 5
Selected Answer: B In the Spark UI, the Stage’s detail screen provides key metrics about each stage of a job, including the amount of data that has been spilled to disk. If you see a high number in the “Spill (Memory)” or “Spill (Disk)” columns, it’s an indication that a partition is spilling to disk.
The Executor’s log files can also provide valuable information about spill. If a task is spilling a lot of data, you’ll see messages in the logs like “Spilling UnsafeExternalSorter to disk” or “Task memory spill”. These messages indicate that the task ran out of memory and had to spill data to disk.
Comment 1294050 by dd1192d
- Upvotes: 1
Selected Answer: B b
Comment 1101194 by jin1991
- Upvotes: 1
Selected Answer: E E is correct