Questions and Answers
Question JzgN9me8Lptpl6ISTMsD
Question
Which statement describes Delta Lake Auto Compaction?
Choices
- A: An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 1 GB.
- B: Before a Jobs cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job.
- C: Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.
- D: Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.
- E: An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 128 MB.
answer?
Answer: E Answer_ET: E Community answer E (56%) B (31%) 13% Discussion
Comment 1076321 by aragorn_brego
- Upvotes: 5
Selected Answer: A Delta Lake’s Auto Compaction feature is designed to improve the efficiency of data storage by reducing the number of small files in a Delta table. After data is written to a Delta table, an asynchronous job can be triggered to evaluate the file sizes. If it determines that there are a significant number of small files, it will automatically run the OPTIMIZE command, which coalesces these small files into larger ones, typically aiming for files around 1 GB in size for optimal performance.
E is incorrect because the statement is similar to A but with an incorrect default file size target.
Comment 1339891 by RandomForest
- Upvotes: 2
Selected Answer: E Delta Lake Auto Compaction is a feature that automatically detects opportunities to optimize small files. When a write operation is completed, an asynchronous job assesses whether the resulting files can be compacted into larger files (the default target size is 128 MB). If compaction is needed, the system executes an OPTIMIZE job in the background to improve file size and query performance.
This feature reduces the overhead of managing small files manually and improves storage and query efficiency. It aligns with Delta Lake’s goal of simplifying and optimizing data lake performance.
Comment 1338070 by mwynn
- Upvotes: 4
Selected Answer: E I think it is E because they are just asking us to generally describe the feature - here’s some info I gleaned from a DB Academy video: ○ Compact small files on write with auto-optimize (tries to achieve file size of 128 MB) ○ Auto-Compact launches a new job after execution of first Spark job (i.e. async), where it will try to compress files closer to 128 MB
Comment 1324338 by Nicks_name
- Upvotes: 1
Selected Answer: E typo in databricks documentation about sync job, but default size is explicitly mentioned as 128
Comment 1324217 by carah
- Upvotes: 3
Selected Answer: B Table property: delta.autoOptimize.autoCompact B. correct, although https://docs.databricks.com/en/delta/tune-file-size.html#auto-compaction-for-delta-lake-on-databricks does not mention OPTIMIZE, it is best option A., E. wrong, auto compaction runs synchronously C. wrong, it describes Table setting: delta.autoOptimize.optimizeWrite D. wrong, not related to file compaction
Comment 1306361 by vish9
- Upvotes: 3
There appears to be a typo in databricks documentation
Comment 1305499 by rrprofessional
- Upvotes: 1
Enable auto compaction. By default will use 128 MB as the target file size.
Comment 1292179 by akashdesarda
- Upvotes: 3
Selected Answer: B If you go through this docs - then one thing is clear that it is not async job, so we have to eliminate A & C. D is wrong. It has no special job wrt the partition. Also file size 0f 128 MB is legacy config, latest one is dynamic. So we are left with B
Comment 1290017 by pk07
- Upvotes: 2
Selected Answer: E https://docs.databricks.com/en/delta/tune-file-size.html
Comment 1267036 by partha1022
- Upvotes: 2
Selected Answer: B Auto compaction is synchronous job.
Comment 1252116 by Shailly
- Upvotes: 4
Selected Answer: B A and E are wrong because auto compaction is synchronous operation!
I vote for B
As per documentation - “Auto compaction occurs after a write to a table has succeeded and runs synchronously on the cluster that has performed the write. Auto compaction only compacts files that haven’t been compacted previously.”
https://docs.delta.io/latest/optimizations-oss.html
Comment 1222474 by imatheushenrique
- Upvotes: 1
E. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 128 MB. https://community.databricks.com/t5/data-engineering/what-is-the-difference-between-optimize-and-auto-optimize/td-p/21189
Comment 1149837 by ojudz08
- Upvotes: 2
Selected Answer: E E is the answer. Enable the settings uses the 128 MB as the target file size https://learn.microsoft.com/en-us/azure/databricks/delta/tune-file-size
Comment 1136831 by DAN_H
- Upvotes: 1
Selected Answer: E default file size is 128MB in auto compaction
Comment 1118630 by kz_data
- Upvotes: 1
E is correct as the default file size is 128MB in auto compaction, not 1GB as normal OPTIMIZE statement.
Comment 1117283 by IWantCerts
- Upvotes: 1
Selected Answer: E 128MB is the default.
Comment 1105598 by Yogi05
- Upvotes: 1
Question is more on auto compaction hence the answer is E, as default size or auto compaction is 128 mb
Comment 1086152 by hamzaKhribi
- Upvotes: 1
Selected Answer: E Optimize default target file size is 1Gb, however in this question we are dealing with auto compaction. Which when enabled runs optimize with 128MB file size by default.
Comment 1060821 by BIKRAM063
- Upvotes: 1
Selected Answer: E E is correct. Auto compact tries to optimize to a file size of 128MB
Comment 1040341 by sturcu
- Upvotes: 3
Selected Answer: E E is the best feet, although databricks says that auto compaction runs runs synchronously
Comment 1013356 by Eertyy
- Upvotes: 1
correct answer is e
Comment 999180 by cotardo2077
- Upvotes: 4
Selected Answer: E E fits best, but according to docs it is synchronous opeartion “Auto compaction occurs after a write to a table has succeeded and runs synchronously on the cluster that has performed the write. Auto compaction only compacts files that haven’t been compacted previously.”
Comment 988234 by taif12340
- Upvotes: 3
Correct answer is E: Auto optimize consists of 2 complementary operations:
- Optimized writes: with this feature enabled, Databricks attempts to write out 128 MB files for each table partition.
- Auto compaction: this will check after an individual write, if files can further be compacted. If yes, it runs an OPTIMIZE job with 128 MB file sizes (instead of the 1 GB file size used in the standard OPTIMIZE)
Comment 983613 by BrianNguyen95
- Upvotes: 1
correct answer is A
Comment 973658 by 8605246
- Upvotes: 4
correct answer is E, the auto-compaction runs a asynchronous job to combine small files to a default of 128 MB https://learn.microsoft.com/en-us/azure/databricks/delta/tune-file-size
Question 5O19TS9IYkdPvmKA2Tq2
Question
Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?
Choices
- A: In the Executor’s log file, by grepping for “predicate push-down”
- B: In the Stage’s Detail screen, in the Completed Stages table, by noting the size of data read from the Input column
- C: In the Query Detail screen, by interpreting the Physical Plan
- D: In the Delta Lake transaction log. by noting the column statistics
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1387352 by lakime
- Upvotes: 1
Selected Answer: C Option C is correct because you can diagnose performance issues related to predicate push-down by examining the Physical Plan in the Query Detail screen of the Spark UI. Predicate push-down is a feature where filtering conditions (predicates) are applied as early as possible, typically in the data scan operation, to reduce the amount of data being read. If predicate push-down is not happening, the physical plan will show that the filter operation is being applied after the data is read, leading to inefficient queries.
Comment 1323666 by benni_ale
- Upvotes: 1
Selected Answer: C Look at the Physical Plan
Question 2s4qdbmVnfpoSUcYFnuw
Question
A data engineer needs to capture pipeline settings from an existing setting in the workspace, and use them to create and version a JSON file to create a new pipeline.
Which command should the data engineer enter in a web terminal configured with the Databricks CLI?
Choices
- A: Use list pipelines to get the specs for all pipelines; get the pipeline spec from the returned results; parse and use this to create a pipeline
- B: Stop the existing pipeline; use the returned settings in a reset command
- C: Use the get command to capture the settings for the existing pipeline; remove the pipeline_id and rename the pipeline; use this in a create command
- D: Use the clone command to create a copy of an existing pipeline; use the get JSON command to get the pipeline definition; save this to git
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1323672 by benni_ale
- Upvotes: 2
Selected Answer: C I say C from common logical sense, however iI have not properly tested it… I just don’t see any problems with that
Question orB34JO9sP88f3dof2O1
Question
Which REST API call can be used to review the notebooks configured to run as tasks in a multi-task job?
Choices
- A: /jobs/runs/list
- B: /jobs/list
- C: /jobs/runs/get
- D: /jobs/get
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1322804 by temple1305
- Upvotes: 1
Selected Answer: D multi-task: /jobs/get single-task: /jobs/runs/get
Question QLHg2dUpagOBFEL4qtbz
Question
A Data Engineer wants to run unit tests using common Python testing frameworks on Python functions defined across several Databricks notebooks currently used in production.
How can the data engineer run unit tests against functions that work with data in production?
Choices
- A: Define and import unit test functions from a separate Databricks notebook
- B: Define and unit test functions using Files in Repos
- C: Run unit tests against non-production data that closely mirrors production
- D: Define unit tests and functions within the same notebook
answer?
Answer: B Answer_ET: B Community answer B (75%) C (25%) Discussion
Comment 1335585 by arekm
- Upvotes: 1
Selected Answer: B B - to test functions we need to import them in our unit tests. This means that storing functions in notebooks is not a good idea. You store them separately - as “files” - and import in notebooks the same way you import them in unit tests.
Comment 1324012 by Thameur01
- Upvotes: 2
Selected Answer: B Databricks Repos is the recommended way to organize and manage code, including functions and unit tests, in a scalable and maintainable way. By defining your functions and unit tests in Files in Repos, you can:
Modularize Your Code:
Functions can be organized into separate Python files or modules, making them reusable and easier to test. Use Standard Testing Frameworks:
Frameworks like pytest or unittest can be used to write and execute unit tests against these functions. Version Control Integration:
Files in Repos can be version-controlled using Git, ensuring traceability and collaboration. Production Data Testing:
With proper safeguards, you can design unit tests to test production-like data while maintaining modularity and separation from production pipelines.
Comment 1323674 by benni_ale
- Upvotes: 1
Selected Answer: C It is true that or Python and R notebooks, Databricks recommends storing functions and their unit tests outside of notebooks so given the question is in Python environment some could argue B being the correct solution. Nevertheless that is only an “advice” and Databricks more generally states that in general, it is a best practice to not run unit tests against functions that work with data in production. This is especially important for functions that add, remove, or otherwise change data. To protect your production data from being compromised by your unit tests in unexpected ways, you should run unit tests against non-production data. So I would got for option C. Alternatively I would have gone for B. https://docs.databricks.com/en/notebooks/testing.html?utm_source=chatgpt.com#write-unit-tests