Questions and Answers

Question slSue2y8VE49lfg0pryf

Question

Which statement regarding Spark configuration on the Databricks platform is true?

Choices

A: The Databricks REST API can be used to modify the Spark configuration properties for an interactive cluster without interrupting jobs currently running on the cluster.
B: Spark configurations set within a notebook will affect all SparkSessions attached to the same interactive cluster.
C: Spark configuration properties can only be set for an interactive cluster by creating a global init script.
D: Spark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster.
E: When the same Spark configuration property is set for an interactive cluster and a notebook attached to that cluster, the notebook setting will always be ignored.

answer?

Answer: D Answer_ET: D Community answer D (90%) 5% Discussion

Comment 1087767 by hamzaKhribi

Upvotes: 8

Selected Answer: D I tried it myself, setting a spark conf on the cluster ui, will impact all notebooks attached to that cluster, for example i set the number of shuffle partitions to 4, and in every notebook when i inspect the number of partitions i find 4.

Comment 1335237 by Sriramiyer92

Upvotes: 2

Selected Answer: D D is correct. Point to Note: Precedence - (Spark configurations set programmatically in a notebook or script) > (Cluster-wide configurations set in the Clusters UI)

Comment 1321796 by benni_ale

Upvotes: 1

Selected Answer: D Both via API or UI will trigger a restart of the cluster so A is wrong. D is correct given that “impact” does not mean the cluster can’t be restarted.

Comment 1160230 by Curious76

Upvotes: 1

Selected Answer: B A. Incorrect: Modifying configurations through the Databricks REST API while jobs are running can lead to unexpected behavior or disruption. It’s generally not recommended. C. Incorrect: While global init scripts can be used, it’s not the only way. Configurations can also be set within notebooks. D. Incorrect: Configurations set through the Clusters UI apply to the entire cluster, but they might not necessarily override configurations set within notebooks attached to the cluster. E. Incorrect: Notebook configurations can take precedence over cluster-level configurations for the same property, offering finer-grained control at the notebook level.

Comment 1141664 by vctrhugo

Upvotes: 3

Selected Answer: D These settings are applied at the cluster level and affect all SparkSessions on the cluster.

Comment 1100388 by alexvno

Upvotes: 4

Selected Answer: D A wrong, cluster will restart → D

Comment 1085492 by petrv

Upvotes: 1

Selected Answer: A In Databricks, you can use the Databricks REST API to modify Spark configuration properties for an interactive cluster without interrupting currently running jobs. This allows you to dynamically adjust Spark configurations to optimize performance or meet specific requirements without the need to restart the cluster.

Question tV7c3Y99bBZympTgKibZ

Question

A developer has successfully configured their credentials for Databricks Repos and cloned a remote Git repository. They do not have privileges to make changes to the main branch, which is the only branch currently visible in their workspace.

Which approach allows this user to share their code updates without the risk of overwriting the work of their teammates?

Choices

A: Use Repos to checkout all changes and send the git diff log to the team.
B: Use Repos to create a fork of the remote repository, commit all changes, and make a pull request on the source repository.
C: Use Repos to pull changes from the remote Git repository; commit and push changes to a branch that appeared as changes were pulled.
D: Use Repos to merge all differences and make a pull request back to the remote repository.
E: Use Repos to create a new branch, commit all changes, and push changes to the remote Git repository.

answer?

Answer: E Answer_ET: E Community answer E (63%) B (37%) Discussion

Comment 1330810 by AlejandroU

Upvotes: 1

Selected Answer: E Answer E. Creating a new branch within the repository is the most effective and recommended approach for developers who lack write access to the main branch in Databricks Repos. While forking is a valid strategy in some situations, it’s generally less common and potentially more complex within the context of Databricks Repos, especially when the primary goal is to contribute to the existing shared repository.

Comment 1301422 by benni_ale

Upvotes: 1

Selected Answer: E I changed my mind i think is E

Comment 1299655 by benni_ale

Upvotes: 1

Selected Answer: B Explanation:

The developer does not have privileges to make changes to the main branch of the remote repository, and it’s the only branch visible in their workspace. To share their code updates without risking overwriting their teammates’ work, the best approach is to:

Create a personal copy (fork) of the remote repository. This forked repository will be under the developer’s own account or workspace, allowing full control over it. Make Changes in the Fork:

Commit all code updates to the forked repository. Since the developer has full privileges on their fork, they can create branches, commit changes, and manage the repository as needed.

Option E: Creating a new branch and pushing changes to the remote repository requires write access to the repository, which the developer does not have.

By forking the repository, the developer avoids any permission issues and ensures that their work does not interfere with the main codebase until it is reviewed and approved by the team.

Comment 1166272 by hal2401me

Upvotes: 4

Selected Answer: E E is the regular collaboration approach. B makes a fork so breaks away from the collaborating teamates. There’s no way they can make a pull request on the source repository after making change to a fork.

Comment 1160498 by Curious76

Upvotes: 2

Selected Answer: B Databricks Repos itself does not currently support creating forks directly within the platform. However, you can achieve a similar workflow using the following steps:

Use the git clone command in a terminal or IDE to create a local copy of the remote repository. This effectively creates a local fork. Make your changes in the local copy. Use git push to push your changes to a new remote repository you create on a Git hosting service like GitHub. Create a pull request from your new remote repository to the original repository on Databricks Repos.

This approach allows you to make changes to your own copy of the code, collaborate with others through code reviews, and propose your changes for integration into the main branch without directly modifying it

Comment 1141661 by vctrhugo

Upvotes: 3

Selected Answer: E This is a common workflow in collaborative development environments. The developer can create a new branch in their local repository, make changes, and then push the branch to the remote repository. This way, they can share their updates without modifying the main branch directly. After pushing the changes, they can create a pull request on the remote repository, allowing their teammates to review the changes before merging them into the main branch. This process ensures that the main branch remains stable and that all changes are reviewed and approved before they’re incorporated. It also prevents any accidental overwrites of teammates’ work.

Comment 1138381 by PrincipalJoe

Upvotes: 3

Databricks Repos cannot be used to fork a repository

Comment 1137369 by adenis

Upvotes: 3

Selected Answer: E E is correct

Comment 1136242 by Rinscy

Upvotes: 2

E is correct if you create a fork you create another repository so not B

Comment 1134620 by Crocjun

Upvotes: 2

Selected Answer: B B is correct

Comment 1131768 by spaceexplorer

Upvotes: 2

Selected Answer: B B is correct

Comment 1111275 by dmov

Upvotes: 1

Isn’t this B?

Question 7tBseGd9qKWWrnsSJpYV

Question

In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both DEEP and SHALLOW CLONE, development tables are created using SHALLOW CLONE.

A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that VACUUM was run the day before.

Which statement describes why the cloned tables are no longer working?

Choices

A: Because Type 1 changes overwrite existing records, Delta Lake cannot guarantee data consistency for cloned tables.
B: Running VACUUM automatically invalidates any shallow clones of a table; DEEP CLONE should always be used when a cloned table will be repeatedly queried.
C: Tables created with SHALLOW CLONE are automatically deleted after their default retention threshold of 7 days.
D: The metadata created by the CLONE operation is referencing data files that were purged as invalid by the VACUUM command.
E: The data files compacted by VACUUM are not tracked by the cloned metadata; running REFRESH on the cloned table will pull in recent changes.

answer?

Answer: D Answer_ET: D Community answer D (88%) 13% Discussion

Comment 1100398 by alexvno

Upvotes: 8

Selected Answer: D Shallow clone: only duplicates the metadata of the table being cloned; the data files of the table itself are not copied. These clones are cheaper to create but are not self-contained and depend on the source from which they were cloned as the source of data. If the files in the source that the clone depends on are removed, for example with VACUUM, a shallow clone may become unusable. Therefore, shallow clones are typically used for short-lived use cases such as testing and experimentation.

Comment 1298677 by benni_ale

Upvotes: 2

I was not sure whether B or D but somehow I think that running VACUUM comand does not invalidate SHALLOW CLONEs . I mean its just that the data referenced by the clone is no longer present. It can still happen that a SHALLOW CLONE is working even after a VACUUM command run on the cloned table (origin) . So B is not completely correct.

Comment 1141658 by vctrhugo

Upvotes: 3

Selected Answer: D In Delta Lake, the VACUUM command deletes data files that are no longer referenced by a Delta table and are older than the retention threshold. When a table is cloned using SHALLOW CLONE, the clone references the same data files as the original table but creates a new transaction log. If VACUUM is run on the original table, it can delete data files that are still being referenced by the cloned table’s metadata, causing the cloned table to stop working. This is because the VACUUM command doesn’t know about the cloned table’s references to the data files. Therefore, it’s important to be cautious when running VACUUM on tables that have clones.

Comment 1131771 by spaceexplorer

Upvotes: 1

Selected Answer: D D is correct

Comment 1075152 by AzureDE2522

Upvotes: 2

Selected Answer: D Please refer: https://docs.databricks.com/en/delta/clone.html#what-are-the-semantics-of-delta-clone-operations

Comment 1071206 by 60ties

Upvotes: 2

Selected Answer: B B is best

Question kUZpLaiA7Htn8FA1jRtV

Question

You are performing a join operation to combine values from a static userLookup table with a streaming DataFrame streamingDF.

Which code block attempts to perform an invalid stream-static join?

Choices

A: userLookup.join(streamingDF, [“userid”], how=“inner”)
B: streamingDF.join(userLookup, [“user_id”], how=“outer”)
C: streamingDF.join(userLookup, [“user_id”], how=“left”)
D: streamingDF.join(userLookup, [“userid”], how=“inner”)
E: userLookup.join(streamingDF, [“user_id”], how=“right”)

answer?

Answer: B Answer_ET: B Community answer B (77%) E (18%) 5% Discussion

Comment 1080923 by Enduresoul

Upvotes: 11

Selected Answer: B Answer B is correct: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#support-matrix-for-joins-in-streaming-queries

When we take a look in the supported join matrix between static and stream inputs, we can identify, that Stream-Static + outer is not supported. Answer E is wrong, because the Static-Stream + right join is supported.

Comment 1339443 by lene

Upvotes: 1

Selected Answer: B according to the Support matrix for joins in streaming queries

Comment 1331033 by AlejandroU

Upvotes: 1

Selected Answer: B Answer B. We can directly discard options C and D since the streaming DataFrame (streamingDF) must be the left table in the join and the join type must be inner join or left outer join. Thus, the most directly invalid code block due to an unsupported join type is B.

Comment 1222431 by imatheushenrique

Upvotes: 2

B. We match all the records from a static DataFrame on the left with a stream DataFrame on the right. If records do not match from the static DF (Left) to stream DF (Right), then the system cannot return null since the data changes on stream DF (Right), and we cannot guarantee if we will get matching records. That is why full_outer join is not supported.

Comment 1173210 by hal2401me

Upvotes: 4

Selected Answer: E in my exam today, BCD are removed. i chose E, because I recall that stream-static right join are less supported.

Comment 1160250 by Curious76

Upvotes: 1

Selected Answer: B b is correct

Comment 1141655 by vctrhugo

Upvotes: 1

Selected Answer: B Specifically, outer joins are not supported with a static DataFrame on the right and a streaming DataFrame on the left. This is because it’s not possible to guarantee all necessary rows will be available in the streaming DataFrame for every micro-batch.

Comment 1120773 by kz_data

Upvotes: 1

Selected Answer: D I think the correct answer is D.

Comment 1114416 by lexaneon

Upvotes: 2

Selected Answer: B believe B is correct as provided below

Question 659pbR3NuntI8jfmEI8H

Question

Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.

Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?

Choices

A: Query’s detail screen and Job’s detail screen
B: Stage’s detail screen and Executor’s log files
C: Driver’s and Executor’s log files
D: Executor’s detail screen and Executor’s log files
E: Stage’s detail screen and Query’s detail screen

answer?

Answer: B Answer_ET: B Community answer B (93%) 7% Discussion

Comment 1071383 by 60ties

Upvotes: 7

Selected Answer: B B is correct

Comment 1141651 by vctrhugo

Upvotes: 5

Selected Answer: B In the Spark UI, the Stage’s detail screen provides key metrics about each stage of a job, including the amount of data that has been spilled to disk. If you see a high number in the “Spill (Memory)” or “Spill (Disk)” columns, it’s an indication that a partition is spilling to disk.

The Executor’s log files can also provide valuable information about spill. If a task is spilling a lot of data, you’ll see messages in the logs like “Spilling UnsafeExternalSorter to disk” or “Task memory spill”. These messages indicate that the task ran out of memory and had to spill data to disk.

Comment 1294050 by dd1192d

Upvotes: 1

Selected Answer: B b

Comment 1101194 by jin1991

Upvotes: 1

Selected Answer: E E is correct

vuthanhdatt's Second Brain

Explorer

39

Questions and Answers

Question slSue2y8VE49lfg0pryf

Question

Choices

Comment 1087767 by hamzaKhribi

Comment 1335237 by Sriramiyer92

Comment 1321796 by benni_ale

Comment 1160230 by Curious76

Comment 1141664 by vctrhugo

Comment 1100388 by alexvno

Comment 1085492 by petrv

Question tV7c3Y99bBZympTgKibZ

Question

Choices

Comment 1330810 by AlejandroU

Comment 1301422 by benni_ale

Comment 1299655 by benni_ale

Comment 1166272 by hal2401me

Comment 1160498 by Curious76

Comment 1141661 by vctrhugo

Comment 1138381 by PrincipalJoe

Comment 1137369 by adenis

Comment 1136242 by Rinscy

Comment 1134620 by Crocjun

Comment 1131768 by spaceexplorer

Comment 1111275 by dmov

Question 7tBseGd9qKWWrnsSJpYV

Question

Choices

Comment 1100398 by alexvno

Comment 1298677 by benni_ale

Comment 1141658 by vctrhugo

Comment 1131771 by spaceexplorer

Comment 1075152 by AzureDE2522

Comment 1071206 by 60ties

Question kUZpLaiA7Htn8FA1jRtV

Question

Choices

Comment 1080923 by Enduresoul

Comment 1339443 by lene

Comment 1331033 by AlejandroU

Comment 1222431 by imatheushenrique

Comment 1173210 by hal2401me

Comment 1160250 by Curious76

Comment 1141655 by vctrhugo

Comment 1120773 by kz_data

Comment 1114416 by lexaneon

Question 659pbR3NuntI8jfmEI8H

Question

Choices

Comment 1071383 by 60ties

Comment 1141651 by vctrhugo

Comment 1294050 by dd1192d

Comment 1101194 by jin1991

Graph View

Table of Contents