Questions and Answers

Question JzgN9me8Lptpl6ISTMsD

Question

Which statement describes Delta Lake Auto Compaction?

Choices

A: An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 1 GB.
B: Before a Jobs cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job.
C: Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.
D: Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.
E: An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 128 MB.

answer?

Answer: E Answer_ET: E Community answer E (56%) B (31%) 13% Discussion

Comment 1076321 by aragorn_brego

Upvotes: 5

Selected Answer: A Delta Lake’s Auto Compaction feature is designed to improve the efficiency of data storage by reducing the number of small files in a Delta table. After data is written to a Delta table, an asynchronous job can be triggered to evaluate the file sizes. If it determines that there are a significant number of small files, it will automatically run the OPTIMIZE command, which coalesces these small files into larger ones, typically aiming for files around 1 GB in size for optimal performance.

E is incorrect because the statement is similar to A but with an incorrect default file size target.

Comment 1339891 by RandomForest

Upvotes: 2

Selected Answer: E Delta Lake Auto Compaction is a feature that automatically detects opportunities to optimize small files. When a write operation is completed, an asynchronous job assesses whether the resulting files can be compacted into larger files (the default target size is 128 MB). If compaction is needed, the system executes an OPTIMIZE job in the background to improve file size and query performance.

This feature reduces the overhead of managing small files manually and improves storage and query efficiency. It aligns with Delta Lake’s goal of simplifying and optimizing data lake performance.

Comment 1338070 by mwynn

Upvotes: 4

Selected Answer: E I think it is E because they are just asking us to generally describe the feature - here’s some info I gleaned from a DB Academy video: ○ Compact small files on write with auto-optimize (tries to achieve file size of 128 MB) ○ Auto-Compact launches a new job after execution of first Spark job (i.e. async), where it will try to compress files closer to 128 MB

Comment 1324338 by Nicks_name

Upvotes: 1

Selected Answer: E typo in databricks documentation about sync job, but default size is explicitly mentioned as 128

Comment 1324217 by carah

Upvotes: 3

Selected Answer: B Table property: delta.autoOptimize.autoCompact B. correct, although https://docs.databricks.com/en/delta/tune-file-size.html#auto-compaction-for-delta-lake-on-databricks does not mention OPTIMIZE, it is best option A., E. wrong, auto compaction runs synchronously C. wrong, it describes Table setting: delta.autoOptimize.optimizeWrite D. wrong, not related to file compaction

Comment 1306361 by vish9

Upvotes: 3

There appears to be a typo in databricks documentation

Comment 1305499 by rrprofessional

Upvotes: 1

Enable auto compaction. By default will use 128 MB as the target file size.

Comment 1292179 by akashdesarda

Upvotes: 3

Selected Answer: B If you go through this docs - then one thing is clear that it is not async job, so we have to eliminate A & C. D is wrong. It has no special job wrt the partition. Also file size 0f 128 MB is legacy config, latest one is dynamic. So we are left with B

Comment 1290017 by pk07

Upvotes: 2

Selected Answer: E https://docs.databricks.com/en/delta/tune-file-size.html

Comment 1267036 by partha1022

Upvotes: 2

Selected Answer: B Auto compaction is synchronous job.

Comment 1252116 by Shailly

Upvotes: 4

Selected Answer: B A and E are wrong because auto compaction is synchronous operation!

I vote for B

As per documentation - “Auto compaction occurs after a write to a table has succeeded and runs synchronously on the cluster that has performed the write. Auto compaction only compacts files that haven’t been compacted previously.”

https://docs.delta.io/latest/optimizations-oss.html

Comment 1222474 by imatheushenrique

Upvotes: 1

E. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 128 MB. https://community.databricks.com/t5/data-engineering/what-is-the-difference-between-optimize-and-auto-optimize/td-p/21189

Comment 1149837 by ojudz08

Upvotes: 2

Selected Answer: E E is the answer. Enable the settings uses the 128 MB as the target file size https://learn.microsoft.com/en-us/azure/databricks/delta/tune-file-size

Comment 1136831 by DAN_H

Upvotes: 1

Selected Answer: E default file size is 128MB in auto compaction

Comment 1118630 by kz_data

Upvotes: 1

E is correct as the default file size is 128MB in auto compaction, not 1GB as normal OPTIMIZE statement.

Comment 1117283 by IWantCerts

Upvotes: 1

Selected Answer: E 128MB is the default.

Comment 1105598 by Yogi05

Upvotes: 1

Question is more on auto compaction hence the answer is E, as default size or auto compaction is 128 mb

Comment 1086152 by hamzaKhribi

Upvotes: 1

Selected Answer: E Optimize default target file size is 1Gb, however in this question we are dealing with auto compaction. Which when enabled runs optimize with 128MB file size by default.

Comment 1060821 by BIKRAM063

Upvotes: 1

Selected Answer: E E is correct. Auto compact tries to optimize to a file size of 128MB

Comment 1040341 by sturcu

Upvotes: 3

Selected Answer: E E is the best feet, although databricks says that auto compaction runs runs synchronously

Comment 1013356 by Eertyy

Upvotes: 1

correct answer is e

Comment 999180 by cotardo2077

Upvotes: 4

Selected Answer: E E fits best, but according to docs it is synchronous opeartion “Auto compaction occurs after a write to a table has succeeded and runs synchronously on the cluster that has performed the write. Auto compaction only compacts files that haven’t been compacted previously.”

Comment 988234 by taif12340

Upvotes: 3

Correct answer is E: Auto optimize consists of 2 complementary operations:

Optimized writes: with this feature enabled, Databricks attempts to write out 128 MB files for each table partition.

Auto compaction: this will check after an individual write, if files can further be compacted. If yes, it runs an OPTIMIZE job with 128 MB file sizes (instead of the 1 GB file size used in the standard OPTIMIZE)

Comment 983613 by BrianNguyen95

Upvotes: 1

correct answer is A

Comment 973658 by 8605246

Upvotes: 4

correct answer is E, the auto-compaction runs a asynchronous job to combine small files to a default of 128 MB https://learn.microsoft.com/en-us/azure/databricks/delta/tune-file-size

Question 5O19TS9IYkdPvmKA2Tq2

Question

Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?

Choices

A: In the Executor’s log file, by grepping for “predicate push-down”
B: In the Stage’s Detail screen, in the Completed Stages table, by noting the size of data read from the Input column
C: In the Query Detail screen, by interpreting the Physical Plan
D: In the Delta Lake transaction log. by noting the column statistics

answer?

Answer: C Answer_ET: C Community answer C (100%) Discussion

Comment 1387352 by lakime

Upvotes: 1

Selected Answer: C Option C is correct because you can diagnose performance issues related to predicate push-down by examining the Physical Plan in the Query Detail screen of the Spark UI. Predicate push-down is a feature where filtering conditions (predicates) are applied as early as possible, typically in the data scan operation, to reduce the amount of data being read. If predicate push-down is not happening, the physical plan will show that the filter operation is being applied after the data is read, leading to inefficient queries.

Comment 1323666 by benni_ale

Upvotes: 1

Selected Answer: C Look at the Physical Plan

Question 2s4qdbmVnfpoSUcYFnuw

Question

A data engineer needs to capture pipeline settings from an existing setting in the workspace, and use them to create and version a JSON file to create a new pipeline.

Which command should the data engineer enter in a web terminal configured with the Databricks CLI?

Choices

A: Use list pipelines to get the specs for all pipelines; get the pipeline spec from the returned results; parse and use this to create a pipeline
B: Stop the existing pipeline; use the returned settings in a reset command
C: Use the get command to capture the settings for the existing pipeline; remove the pipeline_id and rename the pipeline; use this in a create command
D: Use the clone command to create a copy of an existing pipeline; use the get JSON command to get the pipeline definition; save this to git

answer?

Answer: C Answer_ET: C Community answer C (100%) Discussion

Comment 1323672 by benni_ale

Upvotes: 2

Selected Answer: C I say C from common logical sense, however iI have not properly tested it… I just don’t see any problems with that

Question orB34JO9sP88f3dof2O1

Question

Which REST API call can be used to review the notebooks configured to run as tasks in a multi-task job?

Choices

A: /jobs/runs/list
B: /jobs/list
C: /jobs/runs/get
D: /jobs/get

answer?

Answer: D Answer_ET: D Community answer D (100%) Discussion

Comment 1322804 by temple1305

Upvotes: 1

Selected Answer: D multi-task: /jobs/get single-task: /jobs/runs/get

Question QLHg2dUpagOBFEL4qtbz

Question

A Data Engineer wants to run unit tests using common Python testing frameworks on Python functions defined across several Databricks notebooks currently used in production.

How can the data engineer run unit tests against functions that work with data in production?

Choices

A: Define and import unit test functions from a separate Databricks notebook
B: Define and unit test functions using Files in Repos
C: Run unit tests against non-production data that closely mirrors production
D: Define unit tests and functions within the same notebook

answer?

Answer: B Answer_ET: B Community answer B (75%) C (25%) Discussion

Comment 1335585 by arekm

Upvotes: 1

Selected Answer: B B - to test functions we need to import them in our unit tests. This means that storing functions in notebooks is not a good idea. You store them separately - as “files” - and import in notebooks the same way you import them in unit tests.

Comment 1324012 by Thameur01

Upvotes: 2

Selected Answer: B Databricks Repos is the recommended way to organize and manage code, including functions and unit tests, in a scalable and maintainable way. By defining your functions and unit tests in Files in Repos, you can:

Modularize Your Code:

Functions can be organized into separate Python files or modules, making them reusable and easier to test. Use Standard Testing Frameworks:

Frameworks like pytest or unittest can be used to write and execute unit tests against these functions. Version Control Integration:

Files in Repos can be version-controlled using Git, ensuring traceability and collaboration. Production Data Testing:

With proper safeguards, you can design unit tests to test production-like data while maintaining modularity and separation from production pipelines.

Comment 1323674 by benni_ale

Upvotes: 1

Selected Answer: C It is true that or Python and R notebooks, Databricks recommends storing functions and their unit tests outside of notebooks so given the question is in Python environment some could argue B being the correct solution. Nevertheless that is only an “advice” and Databricks more generally states that in general, it is a best practice to not run unit tests against functions that work with data in production. This is especially important for functions that add, remove, or otherwise change data. To protect your production data from being compromised by your unit tests in unexpected ways, you should run unit tests against non-production data. So I would got for option C. Alternatively I would have gone for B. https://docs.databricks.com/en/notebooks/testing.html?utm_source=chatgpt.com#write-unit-tests

vuthanhdatt's Second Brain

Explorer

23

Questions and Answers

Question JzgN9me8Lptpl6ISTMsD

Question

Choices

Comment 1076321 by aragorn_brego

Comment 1339891 by RandomForest

Comment 1338070 by mwynn

Comment 1324338 by Nicks_name

Comment 1324217 by carah

Comment 1306361 by vish9

Comment 1305499 by rrprofessional

Comment 1292179 by akashdesarda

Comment 1290017 by pk07

Comment 1267036 by partha1022

Comment 1252116 by Shailly

Comment 1222474 by imatheushenrique

Comment 1149837 by ojudz08

Comment 1136831 by DAN_H

Comment 1118630 by kz_data

Comment 1117283 by IWantCerts

Comment 1105598 by Yogi05

Comment 1086152 by hamzaKhribi

Comment 1060821 by BIKRAM063

Comment 1040341 by sturcu

Comment 1013356 by Eertyy

Comment 999180 by cotardo2077

Comment 988234 by taif12340

Comment 983613 by BrianNguyen95

Comment 973658 by 8605246

Question 5O19TS9IYkdPvmKA2Tq2

Question

Choices

Comment 1387352 by lakime

Comment 1323666 by benni_ale

Question 2s4qdbmVnfpoSUcYFnuw

Question

Choices

Comment 1323672 by benni_ale

Question orB34JO9sP88f3dof2O1

Question

Choices

Comment 1322804 by temple1305

Question QLHg2dUpagOBFEL4qtbz

Question

Choices

Comment 1335585 by arekm

Comment 1324012 by Thameur01

Comment 1323674 by benni_ale

Graph View

Table of Contents