Questions and Answers
Question G1IIs43ohieuOtpgXvZx
Question
Which statement describes the default execution mode for Databricks Auto Loader?
Choices
- A: Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; the target table is materialized by directly querying all valid files in the source directory.
- B: New files are identified by listing the input directory; the target table is materialized by directly querying all valid files in the source directory.
- C: Webhooks trigger a Databricks job to run anytime new data arrives in a source directory; new data are automatically merged into target tables using rules inferred from the data.
- D: New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.
- E: Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and idempotently loaded into the target Delta Lake table.
answer?
Answer: D Answer_ET: D Community answer D (90%) 10% Discussion
Comment 1141632 by vctrhugo
- Upvotes: 7
Selected Answer: D “Auto Loader uses directory listing mode by default. In directory listing mode, Auto Loader identifies new files by listing the input directory.”
https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/directory-listing-mode
Comment 1335478 by arekm
- Upvotes: 2
Selected Answer: D D - Autoloader supports:
- directory listing mode
- file notification mode The first option is the default. Answer E describes the second option.
Comment 1134517 by Rinscy
- Upvotes: 2
D definitely ! Auto Loader is an optimized file source that overcomes all the above limitations and provides a seamless way for data teams to load the raw data at low cost and latency with minimal DevOps effort. You just need to provide a source directory path and start a streaming job. The new structured streaming source, called “cloudFiles”, will automatically set up file notification services that subscribe file events from the input directory and process new files as they arrive, with the option of also processing existing files in that directory.
Comment 1133248 by ranith
- Upvotes: 1
Selected answer: D
Comment 1132887 by get_certified9
- Upvotes: 1
D is the answer. The default execution mode for Databricks Auto Loader is the Directory Listing mode
Comment 1131843 by spaceexplorer
- Upvotes: 1
Selected Answer: E E is the answer
Question LsqagrTu1dqqMKpyA36U
Question
A Delta Lake table representing metadata about content posts from users has the following schema:
user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE
Based on the above schema, which column is a good candidate for partitioning the Delta Table?
Choices
- A: post_time
- B: latitude
- C: post_id
- D: user_id
- E: date
answer?
Answer: E Answer_ET: E Community answer E (100%) Discussion
Comment 1304490 by benni_ale
- Upvotes: 1
Selected Answer: E Date is usually best candidate for time series data without further specifications
Comment 1141629 by vctrhugo
- Upvotes: 4
Selected Answer: E Partitioning a Delta Lake table on the date column is a common practice. This is because partitioning by date can significantly improve query performance when dealing with time-series data. It allows for efficient filtering of data based on time periods, which is a common requirement in many analytics workloads. Partitioning by date also helps manage the size of your partitions, as each partition will contain only the data for a specific date. This can lead to more efficient reads and writes, and can also make it easier to manage and maintain your data.
Question 02gUaptuYlAMQae3eZRi
Question
The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings. The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization. The compliance officer has recently learned about Delta Lake’s time travel functionality. They are concerned that this might allow continued access to deleted data. Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?
Choices
- A: Because the VACUUM command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.
- B: Because the default data retention threshold is 24 hours, data files containing deleted records will be retained until the VACUUM job is run the following day.
- C: Because Delta Lake time travel provides full access to the entire history of a table, deleted records can always be recreated by users with full admin privileges.
- D: Because Delta Lake’s delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.
- E: Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the VACUUM job is run 8 days later.
answer?
Answer: E Answer_ET: E Community answer E (61%) A (39%) Discussion
Comment 969790 by asmayassineg
- Upvotes: 20
Answer is E, default retention period is 7 days https://learn.microsoft.com/en-us/azure/databricks/delta/vacuum
Comment 1160222 by mardigras
- Upvotes: 12
Selected Answer: A The answer has to be A. The deletion is done on Sunday 1am and then the next day Monday 3am, VACUUM was initiated, so one can only time travel for about 24 hours.
Comment 1556603 by Deep92
- Upvotes: 1
Selected Answer: E Team configured weekly deletion and vacuum will delete weekly data. So this is correct for me.
Comment 1361619 by Tedet
- Upvotes: 1
Selected Answer: E Read question carefully: “all deletions from the previous week”, “Every Monday at 3am, a batch job executes a series of VACUUM commands” Logically 8th day. Appply basics: Default retention threshold for VACUUM is 7 days beyond which on running VACUUM data will be purged.
Comment 1350616 by fabiospont
- Upvotes: 1
Selected Answer: E E is correct, because the VACUUM retention is of 168h or 7 days, after statments of deletions.
Comment 1334744 by arekm
- Upvotes: 1
Selected Answer: E Saturday delete puts the deleted records in the transaction log. The retention clock starts ticking. Since the default for keeping the history is 168 hours (7 days), by no means the following Monday vacuum removes since the clock did not reach 168 hours (7 days) - it is at hour 26.
Comment 1324383 by Anithec0der
- Upvotes: 1
Selected Answer: E I was also thinking in the same way that data will be deleted immediate after the vaccum command is run but it actually logically deletes the data and not physically till the 7 day from the ask of vaccum command. so E is perfect.
Comment 1322341 by AlejandroU
- Upvotes: 1
Selected Answer: E Answer E. The retention period for time travel queries in Delta Lake is controlled by a 7-day default, not 24 hours. Hence, the statement (Option A) that deleted records may be accessible for around 24 hours is incorrect in the context of Delta Lake’s default retention period.
Comment 1297980 by benni_ale
- Upvotes: 1
Selected Answer: E Default retention period is 7 days so the vacuum command won’t delete the files corresponding to deleted rows at Sunday 1 am but the ones of the previous week instead.
Comment 1292204 by tangerine141
- Upvotes: 2
Selected Answer: E Delta Lake’s default retention threshold for old data files (which allows time travel) is 7 days. This means that even after records are deleted, the files that previously contained those records are kept for 7 days before they are eligible for permanent deletion by the VACUUM command. The VACUUM command is responsible for permanently deleting the old data files after the retention period. Since the job runs every Monday, this means that data deleted during the previous week will not be fully purged until after the retention period has passed (which would be 8 days after the deletion, considering the weekly processing).
Comment 1290679 by akashdesarda
- Upvotes: 1
Selected Answer: E Delete job is running as batch job for all requests made current week on Sunday & Vacuum is ran next day . Since there is no mention of change is retention period then it is 7 days. Vacuum will delete data older than 7 days, i.e. it will delete data of previous week & not current week. Current weeks data will be removed in next week’s vacuum job.
Comment 1265869 by fe3b2fc
- Upvotes: 3
Selected Answer: E From the documentation. “The default retention threshold for data files after running VACUUM is 7 days.” It doesn’t matter if VACUUM is ran the following day, the retention period on a default setup is still 7 days after they do the VACUUM on Monday.
Comment 1237848 by 03355a2
- Upvotes: 3
Selected Answer: A They expect the deleted records for the previous week to be deleted Sunday from 1am to 2am. Then the next day(Monday) at 3am approx 24hrs later, the vacuum command is ran. This means the records from the previous week are only around for 24ish hours before they are removed with the vacuum command. They aren’t waiting 8 days to run the command, there fore E is wrong.
Comment 1224442 by imatheushenrique
- Upvotes: 1
E. Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the VACUUM job is run 8 days later.
Comment 1213858 by coercion
- Upvotes: 1
Selected Answer: E Default retention period is 7 days so newly deleted data on Sunday will be available for next 7 days (even if vacuum was run on Monday as it will delete 7 days old data and not the data that was loaded yesterday “Sunday” )
Comment 1204722 by Tayari
- Upvotes: 1
Selected Answer: E The default retention threshold for data files after running VACUUM is 7 days.
Comment 1192232 by hedbergare
- Upvotes: 1
Selected Answer: E Answer is E
Comment 1191793 by juliom6
- Upvotes: 2
Selected Answer: A Si bien la data es borrada (DELETE) el domingo, aún se puede recuperar ella mediante time traveling, sólo el día siguiente (lunes) se eliminará esta posibilidad debido a que se ejecuta el VACUUM, en consecuencia la data se podrá recuperar en ese lapso de 24 horas aprox
Comment 1145982 by RiktRikt007
- Upvotes: 1
Selected Answer: E if i v0: create table, v1: insert 2 reocrds, v2: insert 2 record, v3: delete 2 records, and then run the vacuum command (with default 7 day retention), the delete records will be there and you can access using SELECT * FROM delta_table VERSION AS OF 2;
Comment 1130888 by spaceexplorer
- Upvotes: 1
Selected Answer: E Answer is E
Comment 1118559 by kz_data
- Upvotes: 1
Selected Answer: E Answer is E
Comment 1118558 by kz_data
- Upvotes: 1
Answer is E as the default retention period is 7 days
Comment 1117264 by RafaelCFC
- Upvotes: 1
Selected Answer: E Correct according to the documentation: https://docs.databricks.com/en/sql/language-manual/delta-vacuum.html
Comment 1086048 by hamzaKhribi
- Upvotes: 1
Correct answer is E, In this question tables are with default settings and giving delta retention is 7 days the data will still be accessible for the last 7 days.
Comment 1075927 by aragorn_brego
- Upvotes: 6
Selected Answer: E Delta Lake’s time travel feature allows you to query an older snapshot of a table. By default, Delta Lake retains a 7-day history for the table to support operations like time travel. When data is deleted from a Delta table, the actual data files are not immediately removed from the storage layer; they are just marked for deletion. The VACUUM command is used to clean up these files that are no longer in the state of the table, but it will not remove any files that fall within the retention period unless it is run with an override option to reduce the retention period.
Thus, if the deletions are processed on Sunday and the VACUUM command is run on Monday without overriding the default retention period, the deleted records would still be accessible via time travel for approximately 8 days (until the next run of the VACUUM command after the data has aged past the 7-day retention period).
Comment 1060781 by BIKRAM063
- Upvotes: 2
Answer is E
Comment 1040257 by sturcu
- Upvotes: 1
Selected Answer: E Vacuum by default retention is 7 days
Comment 991549 by Eertyy
- Upvotes: 6
e is right answer
Question VYbv6i0YiCrD11hUNgc3
Question
A large company seeks to implement a near real-time solution involving hundreds of pipelines with parallel updates of many tables with extremely high volume and high velocity data.
Which of the following solutions would you implement to achieve this requirement?
Choices
- A: Use Databricks High Concurrency clusters, which leverage optimized cloud storage connections to maximize data throughput.
- B: Partition ingestion tables by a small time duration to allow for many data files to be written in parallel.
- C: Configure Databricks to save all data to attached SSD volumes instead of object storage, increasing file I/O significantly.
- D: Isolate Delta Lake tables in their own storage containers to avoid API limits imposed by cloud vendors.
- E: Store all tables in a single database to ensure that the Databricks Catalyst Metastore can load balance overall throughput.
answer?
Answer: A Answer_ET: A Community answer A (45%) B (45%) 9% Discussion
Comment 1322023 by natadatabricksadf
- Upvotes: 3
Selected Answer: B High Concurrency clusters are depricated, so B https://learn.microsoft.com/en-us/answers/questions/1688410/are-high-concurrency-clusters-deprecated-or-rename
Comment 1320684 by temple1305
- Upvotes: 3
Selected Answer: B High Concurrency clusters are depricated, so B?
Comment 1286961 by shaojunni
- Upvotes: 2
Selected Answer: A “hundreds of pipelines with parallel updates of many tables” indicates updating many tables concurrently via many pipelines. A is the best solution for that. B is the answer for updating a few large tables with few partitions.
Comment 1267588 by practicioner
- Upvotes: 1
Selected Answer: B “Which of the following solutions” I’m sure this is a question with multichoice. A and B options are correct together.
Comment 1224589 by BrianNguyen95
- Upvotes: 2
Selected Answer: B High volume and high-velocity data ingestion often becomes a bottleneck due to limited write parallelism. By partitioning ingestion tables based on small time durations (e.g., hourly or even minutes), you create many smaller partitions. This allows concurrent writes to different partitions, significantly increasing the overall throughput of your data ingestion.
Comment 1216696 by svik
- Upvotes: 1
Selected Answer: A Since multiple pipelines are being used high concurrency cluster would give maximum resource utilization.
Comment 1193397 by Er5
- Upvotes: 1
A. B is only useful to improve performance of large tables ingestions.
Comment 1162060 by Curious76
- Upvotes: 2
Selected Answer: D Why not D?
Comment 1141625 by vctrhugo
- Upvotes: 1
Both options A and B could be relevant depending on the specific details of the use case. If the emphasis is on optimizing concurrent queries and overall data throughput, option A might be more appropriate. If the primary concern is parallel updates of tables with high-volume, high-velocity data, option B is a more targeted approach.
Comment 1138390 by PrincipalJoe
- Upvotes: 1
Selected Answer: B The best way to deal with high volume and high velocity data is to use partitioning
Comment 1119522 by bacckom
- Upvotes: 2
Selected Answer: A Databricks High Concurrency cluster
Comment 1086146 by petrv
- Upvotes: 1
Selected Answer: A
- Partitioning by Time: Partitioning tables by a small time duration allows for efficient parallelism in data writes. Each time partition can be processed independently, enabling parallel updates to multiple partitions concurrently. 2)Optimizing for Parallelism: By partitioning the tables based on time, data can be ingested and processed in parallel, providing the ability to handle high volume and high velocity data effectively.
Regarding option A, Databricks High Concurrency clusters are more focused on supporting a large number of concurrent users, which might not directly address the requirement for parallel updates of many tables with extremely high volume and high velocity data
Comment 1076850 by aragorn_brego
- Upvotes: 4
Selected Answer: A High Concurrency clusters in Databricks are designed for multiple concurrent users and workloads. They provide fine-grained sharing of cluster resources and are optimized for operations such as running multiple parallel queries and updates. This would be suitable for a solution that involves many pipelines with parallel updates, especially with high volume and high velocity data.
Question DTp4kPnjC3Wx52BA5wHy
Question
Which describes a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?
Choices
- A: Run source env/bin/activate in a notebook setup script
- B: Use b in a notebook cell
- C: Use %pip install in a notebook cell
- D: Use %sh pip install in a notebook cell
- E: Install libraries from PyPI using the cluster UI
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1092032 by JamesWright
- Upvotes: 3
C is correct
Comment 1076851 by aragorn_brego
- Upvotes: 3
Selected Answer: C In Databricks notebooks, you can use the %pip install command in a notebook cell to install a Python package. This will install the package on all nodes in the currently active cluster at the notebook level. It is a feature provided by Databricks to facilitate the installation of Python libraries for the notebook environment specifically.
Comment 1071845 by 60ties
- Upvotes: 3
Selected Answer: C C is correct