Questions and Answers
Question qHHJA9X6TZbWVXs5sxtj
Question
A dataset has been defined using Delta Live Tables and includes an expectations clause: CONSTRAINT valid_timestamp EXPECT (timestamp > ‘2020-01-01’) ON VIOLATION DROP ROW What is the expected behavior when a batch of data containing data that violates these constraints is processed?
Choices
- A: Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.
- B: Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.
- C: Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
- D: Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.
- E: Records that violate the expectation cause the job to fail.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 858825 by XiltroX
- Upvotes: 17
Selected Answer: C I am simply appalled by the number of wrong answers in this series of questions. The statement in the question already says “ON VIOLATE DROP ROW” which means if condition is violated, there will be nothing saved to quarantine table and a log of all invalid entries will be recoded. All invalid data that doesn’t meet condition will be dropped. So C is the correct answer.
Comment 861148 by rafahb
- Upvotes: 5
Selected Answer: C C is correct
Comment 1314198 by 806e7d2
- Upvotes: 2
Selected Answer: C In Delta Live Tables, expectations are used to enforce data quality rules. In this specific case, the expectation is that the timestamp column should be greater than ‘2020-01-01’. When a batch of data is processed, if a record violates this expectation, the following happens:
Drop the violating rows: The rows that don’t meet the expectation (timestamp > ‘2020-01-01’) will be dropped from the dataset. Logging of the violation: The fact that these rows were dropped due to the violation will be recorded in the event log for audit and tracking purposes. This ensures that only valid data (according to the expectation) is loaded into the final dataset, while invalid data is tracked.
Comment 1286697 by Stefan94
- Upvotes: 1
Selected Answer: C 100% C
Comment 1274043 by gdc.moser
- Upvotes: 1
Selected Answer: C C is the correct answer.
Comment 1244556 by 3fbc31b
- Upvotes: 1
Selected Answer: C C is the correct answer. The DROP ROW clause will cause them to NOT be added to the destination; only marked in the log.
Comment 1203448 by benni_ale
- Upvotes: 1
Selected Answer: C C is correct
Comment 1117088 by SerGrey
- Upvotes: 1
Selected Answer: C C is correct
Comment 1109363 by Garyn
- Upvotes: 3
Selected Answer: C C. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
Explanation:
The defined expectation specifies that if the timestamp is not greater than ‘2020-01-01’, the row will be considered in violation of the constraint. The ON VIOLATION DROP ROW clause states that rows that violate the constraint will be dropped from the target dataset. Additionally, the expectation clause will log these violations in the event log, indicating which records did not meet the specified constraint criteria. This behavior ensures that the rows failing the defined constraint are not included in the target dataset and are logged as invalid in the event log for reference or further investigation, maintaining data integrity within the dataset based on the specified constraints.
Comment 1071782 by Huroye
- Upvotes: 1
who choses these answers? The correct answer is C. The record is dropped. This is not about the default behavior. It is explicit.
Comment 1058871 by DavidRou
- Upvotes: 1
Selected Answer: C Right answer: C Invalid rows will be dropped as requested by the constraint and flagged as such in log files. If you need a quarantine table, you’ll have to write more code.
Comment 998000 by vctrhugo
- Upvotes: 2
Selected Answer: C C. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
With the defined constraint and expectation clause, when a batch of data is processed, any records that violate the expectation (in this case, where the timestamp is not greater than ‘2020-01-01’) will be dropped from the target dataset. These dropped records will also be recorded as invalid in the event log, allowing for auditing and tracking of the data quality issues without causing the entire job to fail.
Comment 985347 by AndreFR
- Upvotes: 2
Selected Answer: C https://docs.databricks.com/en/delta-live-tables/expectations.html
Comment 946489 by Atnafu
- Upvotes: 1
C When a batch of data is processed in Delta Live Tables and contains data that violates the defined expectations or constraints, the expected behavior is that the records violating the expectation are dropped from the target dataset. Additionally, these violated records are recorded as invalid in the event log.
Comment 945751 by mehroosali
- Upvotes: 1
Selected Answer: C C is correct
Comment 869376 by SHINGX
- Upvotes: 2
B is correct. This question is number 35 on the practice test on databricks patner academy. https://partner-academy.databricks.com/ correct answer is “Records that violate the expectation are added to the target dataset and recorded as invalid in the event log”
Comment 859705 by surrabhi_4
- Upvotes: 4
Selected Answer: C option C
Question 7TMIYRliPFNxzXswLgmw
Question
Which of the following describes when to use the CREATE STREAMING LIVE TABLE (formerly CREATE INCREMENTAL LIVE TABLE) syntax over the CREATE LIVE TABLE syntax when creating Delta Live Tables (DLT) tables using SQL?
Choices
- A: CREATE STREAMING LIVE TABLE should be used when the subsequent step in the DLT pipeline is static.
- B: CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally.
- C: CREATE STREAMING LIVE TABLE is redundant for DLT and it does not need to be used.
- D: CREATE STREAMING LIVE TABLE should be used when data needs to be processed through complicated aggregations.
- E: CREATE STREAMING LIVE TABLE should be used when the previous step in the DLT pipeline is static.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 861990 by 4be8126
- Upvotes: 7
Selected Answer: B B. CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally. The CREATE STREAMING LIVE TABLE syntax is used to create tables that read data incrementally, while the CREATE LIVE TABLE syntax is used to create tables that read data in batch mode. Delta Live Tables support both streaming and batch modes of processing data. When the data is streamed and needs to be processed incrementally, CREATE STREAMING LIVE TABLE should be used.
Comment 858827 by XiltroX
- Upvotes: 6
Selected Answer: B B is the correct answer.
Comment 1306208 by ajay1709
- Upvotes: 1
B. CREATE STREAMING LIVE TABLE should be used when data needs to be processed Streaming is used for incremental data.
Comment 1262766 by 80370eb
- Upvotes: 1
Selected Answer: B B. CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally. This syntax is used to define a Delta Live Table that processes data incrementally as new data arrives, which is essential for handling streaming data or large datasets that need to be processed in chunks rather than all at once.
Comment 1117090 by SerGrey
- Upvotes: 1
Selected Answer: B B is correct
Comment 1064881 by awofalus
- Upvotes: 1
Selected Answer: B B is correct
Comment 998001 by vctrhugo
- Upvotes: 3
Selected Answer: B B. CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally.
The CREATE STREAMING LIVE TABLE syntax is used when you want to create Delta Live Tables (DLT) tables that are designed for processing data incrementally. This is typically used when your data pipeline involves streaming or incremental data updates, and you want the table to stay up to date as new data arrives. It allows you to define tables that can handle data changes incrementally without the need for full table refreshes.
So, option B correctly describes when to use CREATE STREAMING LIVE TABLE over CREATE LIVE TABLE in the context of Delta Live Tables.
Comment 923275 by ZSun
- Upvotes: 2
This is old version question, currently, Databricks only have Streaming Table (Create Live Table). The previous Streaming live table and Live table already combined.
Comment 859706 by surrabhi_4
- Upvotes: 4
Selected Answer: B option B
Question W6NTZ2EikiTdVVnfQAuQ
Question
A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only ingest those new files with each run. Which of the following tools can the data engineer use to solve this problem?
Choices
- A: Unity Catalog
- B: Delta Lake
- C: Databricks SQL
- D: Data Explorer
- E: Auto Loader
answer?
Answer: E Answer_ET: E Community answer E (100%) Discussion
Comment 1314202 by 806e7d2
- Upvotes: 1
Selected Answer: E Auto Loader is a feature in Databricks that is specifically designed to efficiently ingest new files incrementally from cloud storage directories. It can handle the scenario where files accumulate in a shared directory, and you want to ingest only the new files since the previous run without reprocessing the entire dataset.
Auto Loader uses file notification services to track new files that appear in the directory, enabling incremental processing of those files as they are added. It also supports schema inference and automatically manages the state of the files that have been processed, so you don’t need to manually track which files have been ingested.
Comment 1262767 by 80370eb
- Upvotes: 1
Selected Answer: E E. Auto Loader Auto Loader is designed to incrementally ingest new data files as they appear in a directory, making it ideal for scenarios where files accumulate and need to be ingested without reprocessing previously ingested files. It automatically tracks which files have already been processed, ensuring that only new files are ingested with each pipeline run.
Comment 1203449 by benni_ale
- Upvotes: 1
Selected Answer: E E is correct
Comment 1117091 by SerGrey
- Upvotes: 1
Selected Answer: E E is correct
Comment 1071789 by Huroye
- Upvotes: 1
the data engineer needs to identify which files are new since the previous run. This seems to be an analysis effort. If that is the case, and I might be wrong, then DB SQL is the correct answer.
Comment 1058949 by DavidRou
- Upvotes: 1
Selected Answer: E Autoloader can help if you want to ingest data incrementally.
Comment 985350 by AndreFR
- Upvotes: 2
Selected Answer: E Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any additional setup.
https://docs.databricks.com/en/ingestion/auto-loader/index.html
Comment 859707 by surrabhi_4
- Upvotes: 3
Selected Answer: E option E
Comment 858828 by XiltroX
- Upvotes: 4
Selected Answer: E E is the correct answer.
Question b1t1aijUmn7cwVZsueFv
Question
Which of the following Structured Streaming queries is performing a hop from a Silver table to a Gold table?
Choices
- A:
- B:
- C:
- D:
- E:
answer?
Answer: E Answer_ET: E Community answer E (100%) Discussion
Comment 1203451 by benni_ale
- Upvotes: 3
Selected Answer: E E - Aggregations are performed from silver to gold
Comment 1137339 by agAshish
- Upvotes: 1
Answer shoiuld be A , for writestream data should be stream only and not static
Comment 1117092 by SerGrey
- Upvotes: 2
Selected Answer: E E is correct
Comment 1066559 by surya_lolla
- Upvotes: 4
The best practice is to use “Complete” as output mode instead of “append” when working with aggregated tables. Since gold layer is work final aggregated tables, the only option with output mode as complete is option E.
Comment 1058952 by DavidRou
- Upvotes: 1
Selected Answer: E E is the right answer. The “gold layer” is used to store aggregated clean data, E is the only answer in wich aggregation is performed.
Comment 1026390 by tocs
- Upvotes: 2
Selected Answer: E E as we’re doing an aggregation and we’re rewriting the whole table and not just appending.
Comment 971585 by GhaneshK
- Upvotes: 2
E is correct as it includes group by as well by store.
Comment 861162 by rafahb
- Upvotes: 3
Selected Answer: E E option
Comment 859711 by surrabhi_4
- Upvotes: 3
Selected Answer: E Option E
Comment 858829 by XiltroX
- Upvotes: 4
E is the correct answer.
Question 30ntmlnPGYfBKJRuMwga
Question
A data engineer has three tables in a Delta Live Tables (DLT) pipeline. They have configured the pipeline to drop invalid records at each table. They notice that some data is being dropped due to quality concerns at some point in the DLT pipeline. They would like to determine at which table in their pipeline the data is being dropped. Which of the following approaches can the data engineer take to identify the table that is dropping the records?
Choices
- A: They can set up separate expectations for each table when developing their DLT pipeline.
- B: They cannot determine which table is dropping the records.
- C: They can set up DLT to notify them via email when records are dropped.
- D: They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics.
- E: They can navigate to the DLT pipeline page, click on the “Error” button, and review the present errors.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 998004 by vctrhugo
- Upvotes: 16
Selected Answer: D D. They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics.
To identify the table in a Delta Live Tables (DLT) pipeline where data is being dropped due to quality concerns, the data engineer can navigate to the DLT pipeline page, click on each table in the pipeline, and view the data quality statistics. These statistics often include information about records dropped, violations of expectations, and other data quality metrics. By examining the data quality statistics for each table in the pipeline, the data engineer can determine at which table the data is being dropped.
Comment 1276104 by CID2024
- Upvotes: 1
The correct answer is: D. They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics. Delta Live Tables provides detailed data quality statistics for each table in the pipeline. By navigating to the DLT pipeline page and clicking on each table, the data engineer can view these statistics and determine at which table the records are being dropped due to quality concerns. This allows them to identify and address the specific issues causing the data to be dropped.
Comment 1244558 by 3fbc31b
- Upvotes: 1
Selected Answer: D Correct answer is “D”.
Comment 1203454 by benni_ale
- Upvotes: 1
Selected Answer: D I would say D but I have never really tested it, still other solutions smell wrong
Comment 1137342 by agAshish
- Upvotes: 1
D is correct By clicking on each table in the DLT pipeline page, the data engineer may be able to access data quality statistics, error logs, or other information related to dropped records. This can help them pinpoint at which table in the pipeline the data is being dropped.
Comment 1089200 by Diewrine
- Upvotes: 2
Selected Answer: D E is for when an error occur. But pipeline is defined to drop some records that will not result on error
Comment 1064887 by awofalus
- Upvotes: 1
Selected Answer: D D is correct
Comment 946516 by Atnafu
- Upvotes: 2
E When records are dropped due to quality concerns in a DLT pipeline, the errors are logged in the event log. The data engineer can navigate to the DLT pipeline page and click on the “Error” button to view the present errors. The errors will show the table where the records were dropped. Option A: Setting up separate expectations for each table will not help the data engineer determine which table is dropping the records.
Option B: The data engineer cannot determine which table is dropping the records without looking at the event log.
Option C: Setting up DLT to notify the data engineer via email when records are dropped will not help the data engineer determine which table is dropping the records.
Option D: Viewing the data quality statistics for each table will not help the data engineer determine which table is dropping the records.
Comment 896653 by prasioso
- Upvotes: 4
Selected Answer: D Think answer is D. The pipeline is configured to drop invalid records, i.e. a SQL equivalent query with a ON VIOLATION DROP ROW clause. This will not result in a failed pipeline execution because there are no errors. Instead, you’d have to go to each table and review the quality charactistics.
Comment 879419 by [Removed]
- Upvotes: 3
Is this for v2 or v3
Comment 862401 by XiltroX
- Upvotes: 4
Selected Answer: D The correct answer is D