Questions and Answers
Question qlOV5Egbf5GobqM3iS2Q
Question
The data engineering team has been tasked with configuring connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user groups already created in Databricks that represent various teams within the company.
A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.
Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?
Choices
- A: No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.
- B: “Read” permissions should be set on a secret key mapped to those credentials that will be used by a given team.
- C: “Read” permissions should be set on a secret scope containing only those credentials that will be used by a given team.
- D: “Manage” permissions should be set on a secret scope containing only those credentials that will be used by a given team.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1229345 by hpkr
- Upvotes: 3
Selected Answer: C C is correct. Read permission on secret scope should work here.
Comment 1222914 by Freyr
- Upvotes: 3
Selected Answer: C Correct Answer: C This option is the best practice for managing access to sensitive data. By creating a secret scope dedicated to each team and setting “Read” permissions on the scope, you ensure that only the intended team members can access their respective credentials. This method aligns with security best practices by tightly controlling access based on group membership and reducing the risk of unauthorized access.
Comment 1221175 by MDWPartners
- Upvotes: 2
Selected Answer: C Seems C
Question fQWGrzJTwOcpJ8q6tSlP
Question
What is the retention of job run history?
Choices
- A: It is retained until you export or delete job run logs
- B: It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3
- C: It is retained for 60 days, during which you can export notebook run results to HTML
- D: It is retained for 60 days, after which logs are archived
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1230652 by Elotrovert
- Upvotes: 3
Selected Answer: C For most Databricks workspaces, the job run history is retained for 60 days.
Comment 1229346 by hpkr
- Upvotes: 2
Selected Answer: C C is correct
Question ZGMzTysKhRCXh2U2tBZ0
Question
A data engineer, User A, has promoted a new pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.
Which statement describes the contents of the workspace audit logs concerning these events?
Choices
- A: Because the REST API was used for job creation and triggering runs, a Service Principal will be automatically used to identify these events.
- B: Because User A created the jobs, their identity will be associated with both the job creation events and the job run events.
- C: Because these events are managed separately, User A will have their identity associated with the job creation events and User B will have their identity associated with the job run events.
- D: Because the REST API was used for job creation and triggering runs, user identity will not be captured in the audit logs.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1222416 by imatheushenrique
- Upvotes: 5
C, because the users has their own personal access tokens.
Comment 1329185 by UrcoIbz
- Upvotes: 1
Selected Answer: C There is some ambiguity in this question, as has not been explained how this orchestration tool works.
If the tool just call the job (run-now), this event is tracking the user who ‘triggers an on-demand job run’.
But we have other options, like modify/edit a job configuration, and set the trigger. If this situation happens, and the ‘run as’ is not populated, when the job is triggered automatically, there is not tracked who run the job. Then, the job run is associated to the owner of the job.
I’m assuming the the orchestration tool is doing a run-now, so option C should be the right option.
https://docs.databricks.com/api/azure/workspace/jobs/getrun https://docs.databricks.com/en/admin/account-settings/audit-logs.html#jobs
Comment 1302797 by Jugiboss
- Upvotes: 1
Selected Answer: C C is correct, same question earlier.
Comment 1300611 by m79590530
- Upvotes: 1
Selected Answer: C There is creator_user_name property of a job as well as runs_as property and both can contain different identities assigned.
Question qKh9sW3dFmBGiZk1bIpp
Question
A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB. Which of the following likely explains these smaller file sizes?
Choices
- A: Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations
- B: Z-order indices calculated on the table are preventing file compaction
- C: Bloom filter indices calculated on the table are preventing file compaction
- D: Databricks has autotuned to a smaller target file size based on the overall size of data in the table
- E: Databricks has autotuned to a smaller target file size based on the amount of data in each partition
answer?
Answer: A Answer_ET: A Community answer A (80%) E (20%) Discussion
Comment 999084 by cotardo2077
- Upvotes: 13
Selected Answer: A https://docs.databricks.com/en/delta/tune-file-size.html#autotune-table ‘Autotune file size based on workload’
Comment 1558229 by kishanu
- Upvotes: 1
Selected Answer: E Databricks Auto Optimize and Auto Compaction features are designed to optimize file sizes dynamically for better performance and efficiency in Delta Lake. These features do not use a fixed target file size like 1 GB, but instead autotune file sizes based on partition-level characteristics.
In this case:
Each partition has at least 1 GB of data, and the overall table is large (10+ TB), but… You see many small files <64 MB, which seems suboptimal at first. However, Databricks may intentionally use smaller file sizes within partitions when:
The data change rate is high (as in a streaming CDC feed). Smaller file sizes help with faster read times, reduced shuffle, and quicker MERGE operations during structured streaming. The amount of new data added per batch or microbatch is small, leading to many smaller files, especially when auto compaction determines this improves job performance at runtime. This makes option E the most accurate description of what’s happening.
Comment 1410846 by AlHerd
- Upvotes: 1
Selected Answer: A An always-on Structured Streaming job that applies updates from a Change Data Capture (CDC) feed uses frequent MERGE operations to apply changes (inserts, updates, deletes) to the Delta table.
Because these MERGE operations are constant and high-frequency, Databricks may autotune to a smaller target file size to reduce the duration and overhead of each merge. This behaviour is described explicitly in the documentation.
So, with this in view, the correct answer is A
Comment 1410279 by EZZALDIN
- Upvotes: 1
Selected Answer: E The primary goal of Auto Optimize and Auto Compaction in a streaming job isn’t specifically to reduce MERGE duration. Instead, these features adjust file sizes based on the incremental volume of data being ingested in each micro‐batch within a partition. Even though each partition contains around 1 GB of data (from the original OPTIMIZE), the streaming job writes small batches that are compacted into smaller files (often under 64 MB) because that’s the amount of new data per batch.
So, Option E is more accurate: Databricks auto-tunes the target file size based on the amount of data in each partition (from each micro-batch), not specifically to speed up MERGE operations.
Comment 1362384 by Tedet
- Upvotes: 3
Selected Answer: E Option E is more accurate because Delta Lake’s Auto Optimize and Auto Compaction are designed to adjust file sizes based on the streaming data partitioning, which inherently leads to smaller files over time. The system auto-tunes file sizes as new, incremental data is ingested and partitioned. Option A is plausible, but optimizing file sizes for MERGE operations is not the core focus of Auto Optimize in this case. The system’s auto-tuning mechanism is more about managing file sizes based on the streaming data’s partition size and maintaining efficient reads/writes, rather than directly optimizing for MERGE performance.
Comment 1362381 by Tedet
- Upvotes: 1
Selected Answer: A Options Behavior auto (recommended) Tunes target file size while respecting other autotuning functionality. Requires Databricks Runtime 10.4 LTS or above. legacy Alias for true. Requires Databricks Runtime 10.4 LTS or above. true Use 128 MB as the target file size. No dynamic sizing. false Turns off auto compaction. Can be set at the session level to override auto compaction for all Delta tables modified in the workload.
Comment 1357039 by rollno1
- Upvotes: 1
Selected Answer: E MERGE operations are not the main update mechanism in this scenario—it’s an incremental stream update, not batch MERGE. Larger partitions often result in smaller file sizes because: Frequent incremental writes cause small batch updates. Compaction happens at the partition level, not globally.
Comment 1261408 by Melik3
- Upvotes: 4
Selected Answer: A It is important here to understand the difference between the partition size and the data files. the partition size is 1GB which is caused by OPTIMIZE and also expected. In each partition are data files. Databricks did an attuning to these datafile and resized them to a small size to be able to do MERGE statements efficiently that’s why A is the correct answer
Comment 1222463 by imatheushenrique
- Upvotes: 1
One of the purposes of a optimize execution is the gain in merge oprations, so: A. Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations
Comment 1146010 by RiktRikt007
- Upvotes: 2
how A is correct ? While Databricks does have autotuning capabilities, it primarily considers the table size. In this case, the table is over 10 TB, which would typically lead to a target file size of 1 GB, not under 64 MB.
Comment 1145242 by PrashantTiwari
- Upvotes: 2
The target file size is based on the current size of the Delta table. For tables smaller than 2.56 TB, the autotuned target file size is 256 MB. For tables with a size between 2.56 TB and 10 TB, the target size will grow linearly from 256 MB to 1 GB. For tables larger than 10 TB, the target file size is 1 GB. Correct answer is A
Comment 1128123 by AziLa
- Upvotes: 1
correct ans is A
Comment 1121944 by Jay_98_11
- Upvotes: 2
Selected Answer: A A is correct
Comment 1118598 by kz_data
- Upvotes: 1
Selected Answer: A correct answer is A
Comment 1060812 by BIKRAM063
- Upvotes: 1
Selected Answer: A Auto Optimize reduces file size less than 128MB to facilitate quick merge
Comment 1049739 by sen411
- Upvotes: 1
E is the right answer, because the question is why there are small files
Comment 1040308 by sturcu
- Upvotes: 1
Selected Answer: A Correct
Comment 1003404 by azurearch
- Upvotes: 1
A is correct answer
Comment 993911 by Eertyy
- Upvotes: 4
E is right answer
Question 4S9uSaSZHs3IdoxyIszC
Question
A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive queries.
In which location can one review the timeline for cluster resizing events?
Choices
- A: Workspace audit logs
- B: Driver’s log file
- C: Ganglia
- D: Cluster Event Log
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1300612 by m79590530
- Upvotes: 1
Selected Answer: D Cluster lifecycle events are visible in the Cluster Event Log
Comment 1222413 by imatheushenrique
- Upvotes: 3
Its possible to see the metricks of compute with Ganglia, but the question is about a timeline so D, Cluster Event Log seems correct.