Questions and Answers
Question ZteZImx77lSHJs51Vigy
Question
A data engineer is attempting to drop a Spark SQL table my_table. The data engineer wants to delete all table metadata and data. They run the following command:
DROP TABLE IF EXISTS my_table - While the object no longer appears when they run SHOW TABLES, the data files still exist. Which of the following describes why the data files still exist and the metadata files were deleted?
Choices
- A: The table’s data was larger than 10 GB
- B: The table’s data was smaller than 10 GB
- C: The table was external
- D: The table did not have a location
- E: The table was managed
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1262417 by 80370eb
- Upvotes: 2
Selected Answer: C C. The table was external
When dropping an external table in Spark SQL, only the metadata is removed. The actual data files remain in their original location because they are not managed by Spark but by the external source.
Comment 1113405 by SerGrey
- Upvotes: 1
Selected Answer: C C is correct
Comment 1049077 by hemanthgvsk
- Upvotes: 1
THE QUESTION SHOULD BE “Which of the following describes why the metadata files still exist and the data files were deleted?”
Comment 997957 by vctrhugo
- Upvotes: 2
Selected Answer: C C. The table was external
The reason why the data files still exist while the metadata files were deleted is because the table was external. When a table is external in Spark SQL (or in other database systems), it means that the table metadata (such as schema information and table structure) is managed externally, and Spark SQL assumes that the data is managed and maintained outside of the system. Therefore, when you execute a DROP TABLE statement for an external table, it removes only the table metadata from the catalog, leaving the data files intact.
On the other hand, for managed tables (option E), Spark SQL manages both the metadata and the data files. When you drop a managed table, it deletes both the metadata and the associated data files, resulting in a complete removal of the table.
Comment 860585 by surrabhi_4
- Upvotes: 2
Selected Answer: C Option C
Comment 858081 by XiltroX
- Upvotes: 2
Selected Answer: C C is the correct answer. For external tables, you need to go to the specific location using DESCRIBE EXTERNAL TABLE command and delete all files.
Question 5i1wMxJ7eBiNt7BTjvnD
Question
A data engineer wants to create a data entity from a couple of tables. The data entity must be used by other data engineers in other sessions. It also must be saved to a physical location. Which of the following data entities should the data engineer create?
Choices
- A: Database
- B: Function
- C: View
- D: Temporary view
- E: Table
answer?
Answer: E Answer_ET: E Community answer E (80%) C (20%) Discussion
Comment 894193 by Bob123456
- Upvotes: 33
Questions says :
- The data entity must be used by other data engineers in other sessions.
- It also must be saved to a physical location.
Here View doesn’t store data in physical location , from the options only table stores data in physical location
So answer should be ‘E’ which is Table.
Comment 1133111 by Nika12
- Upvotes: 9
Selected Answer: E Just got 100% in the exam. Table was a correct answer.
Comment 1262422 by 80370eb
- Upvotes: 2
Selected Answer: E E. Table
Creating a table ensures that the data entity is saved to a physical location and can be used by other data engineers in different sessions. Tables persist data and metadata, making them suitable for long-term storage and sharing across sessions.
Comment 1244549 by 3fbc31b
- Upvotes: 1
Selected Answer: E The correct answer is “E”. A view does not save to a physical location; only caching a SELECT statement.
Comment 1203436 by benni_ale
- Upvotes: 2
Selected Answer: E physical location means table
Comment 1137327 by agAshish
- Upvotes: 1
E. as view doesnt has any location
Comment 1113406 by SerGrey
- Upvotes: 1
Selected Answer: E E is correct
Comment 1109256 by Garyn
- Upvotes: 1
Selected Answer: E E. Table
Usage by Other Sessions: Tables in a database are persistent data structures that can be accessed by multiple users and sessions concurrently.
Saved to a Physical Location: Tables store data physically in a structured manner on disk or in a storage system, making them suitable for long-term storage.
Usage by Other Data Engineers: Other data engineers can query, access, and work with the data within the table, making it a feasible choice for shared access among multiple users or sessions.
While other entities like views or temporary views can provide different ways to represent or filter data, a table fits the criteria best when the data engineer requires a persistent physical storage entity accessible by other sessions and users for data manipulation, retrieval, and storage.
Comment 1093253 by RafaelCFC
- Upvotes: 4
Selected Answer: C I think the key to the answer is that it refers to the Data Entinty, and not to the data itself, when it mentions “the Data Entity must be used by other Data Engineers”, and “It must be saved to a physical location”. From this PoV, both C and E would be correct, however, creating a new table would incur in processing to a static state the relationship from “a couple of tables”. While this make sense to many use cases, this would require either a Workflow or a DLT to make it work, which goes over the requested scope. C is the best answer for the requested scenario.
Comment 1084859 by Vikram1710
- Upvotes: 2
Key point to remember during answering this question: ” It also must be saved to a physical location”
So answer should be ‘E’ which is Table.
Comment 1084842 by rbeeraka
- Upvotes: 1
C is the right answer. View is a data entity and its definition is physically saved so other users can consume view
Comment 1071639 by Huroye
- Upvotes: 1
The correct answer is E because it has to be physically saved. View is in memory.
Comment 1064826 by awofalus
- Upvotes: 1
Selected Answer: E Correct : E
Comment 997958 by vctrhugo
- Upvotes: 2
Selected Answer: E E. Table
To create a data entity that can be used by other data engineers in other sessions and must be saved to a physical location, you should create a table. Tables in a database are physical storage structures that hold data, and they can be accessed and shared by multiple users and sessions. By creating a table, you provide a permanent and structured storage location for the data entity that can be used across different sessions and by other users as needed.
Options like databases (A) can be used to organize tables, views (C) can provide virtual representations of data, and temporary views (D) are temporary in nature and don’t save data to a physical location. Functions (B) are typically used for processing data or performing calculations, not for storing data.
Comment 993224 by [Removed]
- Upvotes: 1
Selected Answer: E View does not have a physical location so answer has to be E
Comment 972916 by Kartz130789
- Upvotes: 2
Selected Answer: E View Doesn’t physical location
Comment 951440 by ehsanmor18
- Upvotes: 2
The answer is E: “Table”
In the context described, creating a “Table” is the most suitable choice. Tables in SQL are data entities that exist independently of any session and are saved in a physical location. They can be accessed and manipulated by other data engineers in different sessions, which aligns with the requirements stated.
A “Database” is a collection of tables, views, and other database objects. A “Function” is a stored procedure that performs an operation. A “View” is a virtual table based on the result-set of an SQL statement, but it is not stored physically. A “Temporary view” is a feature that allows you to store the result of a query as a view that disappears once your session with the database is closed.
Comment 949868 by keksssd
- Upvotes: 1
Selected Answer: E answer e
Comment 946074 by Atnafu
- Upvotes: 1
C A view is a virtual table that is created from a query on one or more tables. Views are stored in the database and can be used by other data engineers in other sessions.
The other options are not correct.
Option A: A database is a collection of tables.
Option B: A function is a named block of code that can be executed.
Option D: A temporary view is a view that is only stored in memory and is not saved to a physical location.
Option E: A table is a physical collection of data.
Comment 945745 by mehroosali
- Upvotes: 1
Selected Answer: E View is wrong. it should be table.
Comment 873777 by hrabiabw
- Upvotes: 2
[C] - A view stores the text for a query typically against one or more data sources or tables in the metastore. You can query views from any part of the Databricks product, assuming you have permission to do so. https://learn.microsoft.com/en-us/azure/databricks/lakehouse/data-objects
Comment 861944 by 4be8126
- Upvotes: 2
The data engineer should create a table.
A table can be used by other data engineers in other sessions, and it can be saved to a physical location. Views and temporary views can also be used by other data engineers, but they do not have a physical location to store data. A function is a piece of code that can be called repeatedly with different arguments, but it does not store data. A database is a logical container that can hold multiple tables and views.
Comment 860586 by surrabhi_4
- Upvotes: 1
Selected Answer: C option C
Comment 858084 by XiltroX
- Upvotes: 1
Selected Answer: C C is the correct answer. Please check link below. https://learn.microsoft.com/en-us/azure/databricks/lakehouse/data-objects
Question p9b7g6fqIxbPOw3ZGgwh
Question
A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the quality level. Which of the following tools can the data engineer use to solve this problem?
Choices
- A: Unity Catalog
- B: Data Explorer
- C: Delta Lake
- D: Delta Live Tables
- E: Auto Loader
answer?
Answer: D Answer_ET: D Community answer D (95%) 5% Discussion
Comment 858087 by XiltroX
- Upvotes: 17
Selected Answer: D The answer is incorrect. The correct answer is Delta Live Tables or (C) https://docs.databricks.com/delta-live-tables/expectations.html
Comment 999903 by DQCR
- Upvotes: 8
Selected Answer: D Delta Live Tables is a declarative framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations to perform on your data and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling.
Quality is explicitly mentioned in the definition.
Comment 1314149 by 806e7d2
- Upvotes: 4
Selected Answer: D Delta Live Tables (DLT) is designed for building and managing data pipelines with built-in support for data quality monitoring and enforcement. It allows data engineers to define expectations (data quality rules) and automatically track if the ingested data meets these expectations. If data fails the specified rules, DLT can log the errors and either reject or quarantine the data, depending on the configured behavior.
Comment 1262423 by 80370eb
- Upvotes: 2
Selected Answer: D D. Delta Live Tables
Delta Live Tables provides features for automating data quality monitoring and ensuring that the data in the pipeline meets certain quality standards. It allows you to define expectations and monitor data quality as part of the data pipeline.
Comment 1203437 by benni_ale
- Upvotes: 1
Selected Answer: D delta live table
Comment 1113647 by SerGrey
- Upvotes: 1
Selected Answer: D Correct is D
Comment 1064830 by awofalus
- Upvotes: 1
Selected Answer: D Correct: D
Comment 1064829 by awofalus
- Upvotes: 1
Selected Answer: D D is correct
Comment 997973 by vctrhugo
- Upvotes: 3
Selected Answer: D D. Delta Live Tables
Delta Live Tables is a tool provided by Databricks that can help data engineers automate the monitoring of data quality. It is designed for managing data pipelines, monitoring data quality, and automating workflows. With Delta Live Tables, you can set up data quality checks and alerts to detect issues and anomalies in your data as it is ingested and processed in real-time. It provides a way to ensure that the data quality meets your desired standards and can trigger actions or notifications when issues are detected.
While the other tools mentioned may have their own purposes in a data engineering environment, Delta Live Tables is specifically designed for data quality monitoring and automation within the Databricks ecosystem.
Comment 946075 by Atnafu
- Upvotes: 1
D Delta Live Tables.
Delta Live Tables is a tool that can be used to automate the process of monitoring the quality level of data in a data pipeline. Delta Live Tables provides a number of features that can be used to monitor data quality, including:
Data lineage: Delta Live Tables tracks the lineage of data as it flows through the data pipeline. This allows the data engineer to see where the data came from and how it has been transformed. Data quality checks: Delta Live Tables allows the data engineer to define data quality checks that can be run on the data as it is ingested. These checks can be used to identify data that is not meeting the expected quality standards. Alerts: Delta Live Tables can be configured to send alerts when data quality checks fail. This allows the data engineer to be notified of potential problems with the data pipeline.
Comment 889123 by Majjjj
- Upvotes: 1
Selected Answer: B The data engineer can use the Data Explorer tool to monitor the quality level of the ingested data. Data Explorer is a feature of Databricks that provides data profiling and data quality metrics to monitor the health of data pipelines.
Comment 861949 by 4be8126
- Upvotes: 1
Selected Answer: B B. Data Explorer can be used to monitor the quality level of data. It provides an interactive interface to analyze the data and define quality rules to identify issues. Data Explorer also offers automated validation rules that can be used to monitor data quality over time.
Question 9EBLqDOP5oWGRtAo7Nqr
Question
A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE. The table is configured to run in Production mode using the Continuous Pipeline Mode. Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?
Choices
- A: All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.
- B: All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.
- C: All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.
- D: All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.
- E: All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1109353 by Garyn
- Upvotes: 10
Selected Answer: C C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.
Explanation:
Continuous Pipeline Mode in Production mode implies that the pipeline continuously processes incoming data updates at set intervals, ensuring the datasets are kept up-to-date as new data arrives. Since the pipeline is set to Continuous Pipeline Mode, it will keep running and updating the datasets until it is manually shut down. The compute resources are allocated dynamically to process and update the datasets as needed, and they will be terminated when the pipeline is stopped or shut down. This mode allows for real-time or near-real-time updates to the datasets from the streaming/live tables, ensuring that the data remains current and reflects the changes occurring in the data sources.
Comment 1262425 by 80370eb
- Upvotes: 1
Selected Answer: C In Continuous Pipeline Mode, Delta Live Tables processes data continuously and updates datasets at regular intervals. Compute resources are used to handle these updates and are terminated when the pipeline is stopped.
Comment 1244550 by 3fbc31b
- Upvotes: 1
Selected Answer: C Correct answer is “C”.
Comment 1203439 by benni_ale
- Upvotes: 1
Selected Answer: C daje gianluca
Comment 1113649 by SerGrey
- Upvotes: 2
Selected Answer: C Correct is C
Comment 1064838 by awofalus
- Upvotes: 1
Selected Answer: C Correct : C
Comment 997975 by vctrhugo
- Upvotes: 4
Selected Answer: C C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.
In the scenario described:
The Delta Live Table pipeline is configured in Production mode, which means it will continuously process data using the Continuous Pipeline Mode. There are both STREAMING LIVE TABLE datasets and LIVE TABLE datasets defined. When you click Start to update the pipeline in Continuous Pipeline Mode:
All datasets, including both STREAMING LIVE TABLE and LIVE TABLE datasets, will be updated at set intervals. Compute resources will be deployed for the update, ensuring that the pipeline processes data. The compute resources will be terminated when the pipeline is stopped or shut down. This setup allows for continuous data processing while efficiently managing compute resources, and the pipeline can be stopped when no longer needed.
Comment 987506 by Sandy_17
- Upvotes: 1
Selected Answer: C All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.
Comment 968112 by say88
- Upvotes: 4
No answer is correct. Prod Continuous mode processes data at set intervals until pipe is shutdown. However, compute must be always-on and will not terminate. https://docs.databricks.com/delta-live-tables/updates.html#continuous-triggered
Comment 946077 by Atnafu
- Upvotes: 2
C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.
In a Delta Live Table pipeline running in Continuous Pipeline Mode, when you click Start to update the pipeline, the following outcome is expected:
All datasets defined using STREAMING LIVE TABLE and LIVE TABLE against Delta Lake table sources will be updated at set intervals. The compute resources will be deployed for the update process and will be active during the execution of the pipeline. The compute resources will be terminated when the pipeline is stopped or shut down. This mode allows for continuous and periodic updates to the datasets as new data arrives or changes in the underlying Delta Lake tables occur. The compute resources are provisioned and utilized during the update intervals to process the data and perform the necessary operations.
Comment 912061 by chays
- Upvotes: 3
Selected Answer: C Answer: C
Comment 888291 by Er5
- Upvotes: 2
Answer: C Pipeline mode - This specifies how the pipeline will be run. Choose the mode based on latency and cost requirements.
- Triggered pipelines run once and then shut down until the next manual or scheduled update.
- Continuous pipelines run continuously, ingesting new data as it arrives.
Comment 873827 by hrabiabw
- Upvotes: 3
Answer: D Official Databricks practice exam with answers - question 36
Comment 858091 by XiltroX
- Upvotes: 3
E is not the right answer. The correct answer is C https://www.databricks.com/product/delta-live-tables
Question x9SRErNiiTJ0su46Snx4
Question
In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?
Choices
- A: Checkpointing and Write-ahead Logs
- B: Structured Streaming cannot record the offset range of the data being processed in each trigger.
- C: Replayable Sources and Idempotent Sinks
- D: Write-ahead Logs and Idempotent Sinks
- E: Checkpointing and Idempotent Sinks
answer?
Answer: A Answer_ET: A Community answer A (76%) E (24%) Discussion
Comment 1365133 by shanksund
- Upvotes: 2
Selected Answer: A Idempotent sinks is for ensuring no duplicates, that is not what the question is asking
Comment 1357354 by avidlearner
- Upvotes: 3
Selected Answer: A The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure. URL for reference
Comment 1353911 by MrCastro
- Upvotes: 2
Selected Answer: A A.
Really don’t understand people saying E.
Question is:
”… which of the following two approaches is used by Spark **** TO RECORD **** the offset range of the data being processed”
TO RECORD is the key here.
Idempotent sinks don’t record anything. It’s a feature to replay operations.
RECORDS of the operations are created with Checkpoints and Write Ahead Logs.
Comment 1349523 by edaf08e
- Upvotes: 1
Selected Answer: E E. Checkpointing and Idempotent Sinks
Comment 1344879 by SatuPatu
- Upvotes: 1
Selected Answer: E If failure by restarting and/or reprocessing then choose E If the worker running the task crashes then choose A
Comment 1316696 by Poutrata
- Upvotes: 1
Selected Answer: E E is correct
Comment 1315170 by NzmD
- Upvotes: 2
Selected Answer: E Correct answer is E
Comment 1314161 by 806e7d2
- Upvotes: 2
Selected Answer: E In Structured Streaming, Spark uses the following two mechanisms to reliably track the progress of the stream and ensure fault tolerance:
Checkpointing:
Spark maintains metadata about the processing state, including the offset range of the data processed in each trigger. This metadata is stored in a reliable storage system like HDFS, AWS S3, or Azure Data Lake. If a failure occurs, Spark can recover and resume processing from the last recorded state in the checkpoint. Idempotent Sinks:
Idempotent sinks ensure that output operations (e.g., writing data to storage or a database) can be re-executed without causing duplicate data or errors. By combining idempotent sinks with checkpointing, Spark ensures that reprocessing data due to a failure does not compromise data integrity.
Comment 1290018 by Colje
- Upvotes: 3
Why the correct answer is E. Checkpointing and Idempotent Sinks: Checkpointing: Spark Structured Streaming uses checkpointing to track the state of the data being processed. Checkpoints allow the system to restart processing from where it left off in case of failure, ensuring reliability. Idempotent Sinks: Idempotent sinks ensure that reprocessing the same data multiple times (in case of a failure or restart) doesn’t lead to duplicate results. The sink can handle repeated writes of the same data without issues. Why A. Checkpointing and Write-ahead Logs is incorrect: Spark Structured Streaming does not use Write-ahead Logs (WAL) for tracking offsets or ensuring fault tolerance. While WALs are used in some systems for durability, Spark Structured Streaming relies on checkpointing and the concept of idempotent operations to ensure consistency and fault tolerance.
Comment 1276080 by CID2024
- Upvotes: 2
The correct answer is: E. Checkpointing and Idempotent Sinks In Structured Streaming, Spark uses checkpointing to reliably track the progress of the streaming data. Checkpointing saves the state of the streaming computation to a reliable storage system. Idempotent sinks ensure that even if data is reprocessed, the results remain consistent and correct, preventing duplicate data from being written.
Comment 1262428 by 80370eb
- Upvotes: 2
Selected Answer: A Checkpointing: Spark saves metadata, including offsets, in a checkpoint directory, allowing it to recover from failures by replaying data starting from the last checkpoint. Write-ahead Logs (WAL): Spark writes information about the data being processed to a log before the data is written to the sink. This ensures that even if a failure occurs, Spark can recover and reprocess the data from the log.
Comment 1244552 by 3fbc31b
- Upvotes: 1
Selected Answer: A A is the correct answer.
Comment 1211510 by squidy24
- Upvotes: 3
Selected Answer: A The answer is A
“Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. … Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs.” - Apache Spark Structured Streaming Programming Guide
Comment 1207177 by keensolution
- Upvotes: 1
Nice information and i hope best [url=https://keensolution.in/data-visualization-services/]Data visualization agencies in India[/url]
Comment 1204515 by bita7
- Upvotes: 1
The answer is Checkpointing and idempotent sinks (E) How does structured streaming achieves end to end fault tolerance: • First, Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval. • Next, the streaming sinks are designed to be idempotent—that is, multiple writes of the same data (as identified by the offset) do not result in duplicates being written to the sink. Taken together, replayable data sources and idempotent sinks allow Structured Streaming to ensure end-to-end, exactly-once semantics under any failure condition
Comment 1203440 by benni_ale
- Upvotes: 3
Selected Answer: A 1 checkpointing and write ahead logs to record the offset range of data being processed 2 checkpointing and idempotent sinks achieve end to end fault tolerance
Comment 1113651 by SerGrey
- Upvotes: 1
Selected Answer: A Correct is A
Comment 1001544 by juadaves
- Upvotes: 3
The answer is Checkpointing and idempotent sinks
How does structured streaming achieves end to end fault tolerance:
First, Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval.
Next, the streaming sinks are designed to be idempotent—that is, multiple writes of the same data (as identified by the offset) do not result in duplicates being written to the sink.
Taken together, replayable data sources and idempotent sinks allow Structured Streaming to ensure end-to-end, exactly-once semantics under any failure condition.
Comment 997982 by vctrhugo
- Upvotes: 2
Selected Answer: A A. Checkpointing and Write-ahead Logs
To reliably track the exact progress of processing and handle failures in Spark Structured Streaming, Spark uses both checkpointing and write-ahead logs. Checkpointing allows Spark to periodically save the state of the streaming application to a reliable distributed file system, which can be used for recovery in case of failures. Write-ahead logs are used to record the offset range of data being processed, ensuring that the system can recover and reprocess data from the last known offset in the event of a failure.
Comment 957058 by akk_1289
- Upvotes: 4
A: The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. — in the link search for “The engine uses ” youll find the answer. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#:~:text=The%20engine%20uses%20checkpointing%20and,being%20processed%20in%20each%20trigger.
Comment 946466 by Atnafu
- Upvotes: 3
A. Checkpointing and Write-ahead Logs.
Checkpointing is a process of periodically saving the state of the streaming computation to a durable storage system. This ensures that if the streaming computation fails, it can be restarted from the last checkpoint and resume processing from where it left off. Write-ahead logs are a type of log that records all changes made to a dataset. This allows Structured Streaming to recover from failures by replaying the write-ahead logs from the last checkpoint.
Comment 921268 by mimzzz
- Upvotes: 2
why i think both A E are correct? https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-streaming-exactly-once#:~:text=Use%20idempotent%20sinks
Comment 912063 by chays
- Upvotes: 3
Selected Answer: A Answer is A: The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#:~:text=The%20engine%20uses%20checkpointing%20and,being%20processed%20in%20each%20trigger.
Comment 896633 by prasioso
- Upvotes: 3
Selected Answer: A Answer is A. From Spark documentation: Every streaming source is assumed to have offsets to track the read position in the stream. The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger.
Comment 889126 by Majjjj
- Upvotes: 1
Selected Answer: E E. Checkpointing and Idempotent Sinks are the two approaches used by Spark to record the offset range of the data being processed in each trigger, enabling Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Checkpointing periodically checkpoints the state of the streaming query to a fault-tolerant storage system, while idempotent sinks ensure that data can be written multiple times to the sink without affecting the final result.
Comment 861956 by 4be8126
- Upvotes: 1
Selected Answer: E The answer is E. Checkpointing and Idempotent Sinks are used by Spark to record the offset range of the data being processed in each trigger. Checkpointing helps to recover the query from the point of failure and Idempotent Sinks ensure that the output of a streaming query is consistent even in the face of failures and retries.
Comment 858099 by XiltroX
- Upvotes: 4
Selected Answer: A E is a partial answer. The two correct answers are A and E. Structured streaming is important because it uses these two methods to make sure there is fault tolerance and Exactly-once guarantee of data