Questions and Answers

Question ZteZImx77lSHJs51Vigy

Question

A data engineer is attempting to drop a Spark SQL table my_table. The data engineer wants to delete all table metadata and data. They run the following command:

DROP TABLE IF EXISTS my_table - While the object no longer appears when they run SHOW TABLES, the data files still exist. Which of the following describes why the data files still exist and the metadata files were deleted?

Choices

A: The table’s data was larger than 10 GB
B: The table’s data was smaller than 10 GB
C: The table was external
D: The table did not have a location
E: The table was managed

answer?

Answer: C Answer_ET: C Community answer C (100%) Discussion

Comment 1262417 by 80370eb

Upvotes: 2

Selected Answer: C C. The table was external

When dropping an external table in Spark SQL, only the metadata is removed. The actual data files remain in their original location because they are not managed by Spark but by the external source.

Comment 1113405 by SerGrey

Upvotes: 1

Selected Answer: C C is correct

Comment 1049077 by hemanthgvsk

Upvotes: 1

THE QUESTION SHOULD BE “Which of the following describes why the metadata files still exist and the data files were deleted?”

Comment 997957 by vctrhugo

Upvotes: 2

Selected Answer: C C. The table was external

The reason why the data files still exist while the metadata files were deleted is because the table was external. When a table is external in Spark SQL (or in other database systems), it means that the table metadata (such as schema information and table structure) is managed externally, and Spark SQL assumes that the data is managed and maintained outside of the system. Therefore, when you execute a DROP TABLE statement for an external table, it removes only the table metadata from the catalog, leaving the data files intact.

On the other hand, for managed tables (option E), Spark SQL manages both the metadata and the data files. When you drop a managed table, it deletes both the metadata and the associated data files, resulting in a complete removal of the table.

Comment 860585 by surrabhi_4

Upvotes: 2

Selected Answer: C Option C

Comment 858081 by XiltroX

Upvotes: 2

Selected Answer: C C is the correct answer. For external tables, you need to go to the specific location using DESCRIBE EXTERNAL TABLE command and delete all files.

Question 5i1wMxJ7eBiNt7BTjvnD

Question

A data engineer wants to create a data entity from a couple of tables. The data entity must be used by other data engineers in other sessions. It also must be saved to a physical location. Which of the following data entities should the data engineer create?

Choices

A: Database
B: Function
C: View
D: Temporary view
E: Table

answer?

Answer: E Answer_ET: E Community answer E (80%) C (20%) Discussion

Comment 894193 by Bob123456

Upvotes: 33

Questions says :

The data entity must be used by other data engineers in other sessions.

It also must be saved to a physical location.

Here View doesn’t store data in physical location , from the options only table stores data in physical location

So answer should be ‘E’ which is Table.

Comment 1133111 by Nika12

Upvotes: 9

Selected Answer: E Just got 100% in the exam. Table was a correct answer.

Comment 1262422 by 80370eb

Upvotes: 2

Selected Answer: E E. Table

Creating a table ensures that the data entity is saved to a physical location and can be used by other data engineers in different sessions. Tables persist data and metadata, making them suitable for long-term storage and sharing across sessions.

Comment 1244549 by 3fbc31b

Upvotes: 1

Selected Answer: E The correct answer is “E”. A view does not save to a physical location; only caching a SELECT statement.

Comment 1203436 by benni_ale

Upvotes: 2

Selected Answer: E physical location means table

Comment 1137327 by agAshish

Upvotes: 1

E. as view doesnt has any location

Comment 1113406 by SerGrey

Upvotes: 1

Selected Answer: E E is correct

Comment 1109256 by Garyn

Upvotes: 1

Selected Answer: E E. Table

Usage by Other Sessions: Tables in a database are persistent data structures that can be accessed by multiple users and sessions concurrently.

Saved to a Physical Location: Tables store data physically in a structured manner on disk or in a storage system, making them suitable for long-term storage.

Usage by Other Data Engineers: Other data engineers can query, access, and work with the data within the table, making it a feasible choice for shared access among multiple users or sessions.

While other entities like views or temporary views can provide different ways to represent or filter data, a table fits the criteria best when the data engineer requires a persistent physical storage entity accessible by other sessions and users for data manipulation, retrieval, and storage.

Comment 1093253 by RafaelCFC

Upvotes: 4

Selected Answer: C I think the key to the answer is that it refers to the Data Entinty, and not to the data itself, when it mentions “the Data Entity must be used by other Data Engineers”, and “It must be saved to a physical location”. From this PoV, both C and E would be correct, however, creating a new table would incur in processing to a static state the relationship from “a couple of tables”. While this make sense to many use cases, this would require either a Workflow or a DLT to make it work, which goes over the requested scope. C is the best answer for the requested scenario.

Comment 1084859 by Vikram1710

Upvotes: 2

Key point to remember during answering this question: ” It also must be saved to a physical location”

So answer should be ‘E’ which is Table.

Comment 1084842 by rbeeraka

Upvotes: 1

C is the right answer. View is a data entity and its definition is physically saved so other users can consume view

Comment 1071639 by Huroye

Upvotes: 1

The correct answer is E because it has to be physically saved. View is in memory.

Comment 1064826 by awofalus

Upvotes: 1

Selected Answer: E Correct : E

Comment 997958 by vctrhugo

Upvotes: 2

Selected Answer: E E. Table

To create a data entity that can be used by other data engineers in other sessions and must be saved to a physical location, you should create a table. Tables in a database are physical storage structures that hold data, and they can be accessed and shared by multiple users and sessions. By creating a table, you provide a permanent and structured storage location for the data entity that can be used across different sessions and by other users as needed.

Options like databases (A) can be used to organize tables, views (C) can provide virtual representations of data, and temporary views (D) are temporary in nature and don’t save data to a physical location. Functions (B) are typically used for processing data or performing calculations, not for storing data.

Comment 993224 by [Removed]

Upvotes: 1

Selected Answer: E View does not have a physical location so answer has to be E

Comment 972916 by Kartz130789

Upvotes: 2

Selected Answer: E View Doesn’t physical location

Comment 951440 by ehsanmor18

Upvotes: 2

The answer is E: “Table”

In the context described, creating a “Table” is the most suitable choice. Tables in SQL are data entities that exist independently of any session and are saved in a physical location. They can be accessed and manipulated by other data engineers in different sessions, which aligns with the requirements stated.

A “Database” is a collection of tables, views, and other database objects. A “Function” is a stored procedure that performs an operation. A “View” is a virtual table based on the result-set of an SQL statement, but it is not stored physically. A “Temporary view” is a feature that allows you to store the result of a query as a view that disappears once your session with the database is closed.

Comment 949868 by keksssd

Upvotes: 1

Selected Answer: E answer e

Comment 946074 by Atnafu

Upvotes: 1

C A view is a virtual table that is created from a query on one or more tables. Views are stored in the database and can be used by other data engineers in other sessions.

The other options are not correct.

Option A: A database is a collection of tables.

Option B: A function is a named block of code that can be executed.

Option D: A temporary view is a view that is only stored in memory and is not saved to a physical location.

Option E: A table is a physical collection of data.

Comment 945745 by mehroosali

Upvotes: 1

Selected Answer: E View is wrong. it should be table.

Comment 873777 by hrabiabw

Upvotes: 2

[C] - A view stores the text for a query typically against one or more data sources or tables in the metastore. You can query views from any part of the Databricks product, assuming you have permission to do so. https://learn.microsoft.com/en-us/azure/databricks/lakehouse/data-objects

Comment 861944 by 4be8126

Upvotes: 2

The data engineer should create a table.

A table can be used by other data engineers in other sessions, and it can be saved to a physical location. Views and temporary views can also be used by other data engineers, but they do not have a physical location to store data. A function is a piece of code that can be called repeatedly with different arguments, but it does not store data. A database is a logical container that can hold multiple tables and views.

Comment 860586 by surrabhi_4

Upvotes: 1

Selected Answer: C option C

Comment 858084 by XiltroX

Upvotes: 1

Selected Answer: C C is the correct answer. Please check link below. https://learn.microsoft.com/en-us/azure/databricks/lakehouse/data-objects

Question p9b7g6fqIxbPOw3ZGgwh

Question

A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the quality level. Which of the following tools can the data engineer use to solve this problem?

Choices

A: Unity Catalog
B: Data Explorer
C: Delta Lake
D: Delta Live Tables
E: Auto Loader

answer?

Answer: D Answer_ET: D Community answer D (95%) 5% Discussion

Comment 858087 by XiltroX

Upvotes: 17

Selected Answer: D The answer is incorrect. The correct answer is Delta Live Tables or (C) https://docs.databricks.com/delta-live-tables/expectations.html

Comment 999903 by DQCR

Upvotes: 8

Selected Answer: D Delta Live Tables is a declarative framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations to perform on your data and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling.

Quality is explicitly mentioned in the definition.

Comment 1314149 by 806e7d2

Upvotes: 4

Selected Answer: D Delta Live Tables (DLT) is designed for building and managing data pipelines with built-in support for data quality monitoring and enforcement. It allows data engineers to define expectations (data quality rules) and automatically track if the ingested data meets these expectations. If data fails the specified rules, DLT can log the errors and either reject or quarantine the data, depending on the configured behavior.

Comment 1262423 by 80370eb

Upvotes: 2

Selected Answer: D D. Delta Live Tables

Delta Live Tables provides features for automating data quality monitoring and ensuring that the data in the pipeline meets certain quality standards. It allows you to define expectations and monitor data quality as part of the data pipeline.

Comment 1203437 by benni_ale

Upvotes: 1

Selected Answer: D delta live table

Comment 1113647 by SerGrey

Upvotes: 1

Selected Answer: D Correct is D

Comment 1064830 by awofalus

Upvotes: 1

Selected Answer: D Correct: D

Comment 1064829 by awofalus

Upvotes: 1

Selected Answer: D D is correct

Comment 997973 by vctrhugo

Upvotes: 3

Selected Answer: D D. Delta Live Tables

Delta Live Tables is a tool provided by Databricks that can help data engineers automate the monitoring of data quality. It is designed for managing data pipelines, monitoring data quality, and automating workflows. With Delta Live Tables, you can set up data quality checks and alerts to detect issues and anomalies in your data as it is ingested and processed in real-time. It provides a way to ensure that the data quality meets your desired standards and can trigger actions or notifications when issues are detected.

While the other tools mentioned may have their own purposes in a data engineering environment, Delta Live Tables is specifically designed for data quality monitoring and automation within the Databricks ecosystem.

Comment 946075 by Atnafu

Upvotes: 1

D Delta Live Tables.

Delta Live Tables is a tool that can be used to automate the process of monitoring the quality level of data in a data pipeline. Delta Live Tables provides a number of features that can be used to monitor data quality, including:

Data lineage: Delta Live Tables tracks the lineage of data as it flows through the data pipeline. This allows the data engineer to see where the data came from and how it has been transformed. Data quality checks: Delta Live Tables allows the data engineer to define data quality checks that can be run on the data as it is ingested. These checks can be used to identify data that is not meeting the expected quality standards. Alerts: Delta Live Tables can be configured to send alerts when data quality checks fail. This allows the data engineer to be notified of potential problems with the data pipeline.

Comment 889123 by Majjjj

Upvotes: 1

Selected Answer: B The data engineer can use the Data Explorer tool to monitor the quality level of the ingested data. Data Explorer is a feature of Databricks that provides data profiling and data quality metrics to monitor the health of data pipelines.

Comment 861949 by 4be8126

Upvotes: 1

Selected Answer: B B. Data Explorer can be used to monitor the quality level of data. It provides an interactive interface to analyze the data and define quality rules to identify issues. Data Explorer also offers automated validation rules that can be used to monitor data quality over time.

Question 9EBLqDOP5oWGRtAo7Nqr

Question

A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE. The table is configured to run in Production mode using the Continuous Pipeline Mode. Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?

Choices

A: All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.
B: All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.
C: All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.
D: All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.
E: All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.

answer?

Answer: C Answer_ET: C Community answer C (100%) Discussion

Comment 1109353 by Garyn

Upvotes: 10

Selected Answer: C C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.

Explanation:

Continuous Pipeline Mode in Production mode implies that the pipeline continuously processes incoming data updates at set intervals, ensuring the datasets are kept up-to-date as new data arrives. Since the pipeline is set to Continuous Pipeline Mode, it will keep running and updating the datasets until it is manually shut down. The compute resources are allocated dynamically to process and update the datasets as needed, and they will be terminated when the pipeline is stopped or shut down. This mode allows for real-time or near-real-time updates to the datasets from the streaming/live tables, ensuring that the data remains current and reflects the changes occurring in the data sources.

Comment 1262425 by 80370eb

Upvotes: 1

Selected Answer: C In Continuous Pipeline Mode, Delta Live Tables processes data continuously and updates datasets at regular intervals. Compute resources are used to handle these updates and are terminated when the pipeline is stopped.

Comment 1244550 by 3fbc31b

Upvotes: 1

Selected Answer: C Correct answer is “C”.

Comment 1203439 by benni_ale

Upvotes: 1

Selected Answer: C daje gianluca

Comment 1113649 by SerGrey

Upvotes: 2

Selected Answer: C Correct is C

Comment 1064838 by awofalus

Upvotes: 1

Selected Answer: C Correct : C

Comment 997975 by vctrhugo

Upvotes: 4

Selected Answer: C C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.

In the scenario described:

The Delta Live Table pipeline is configured in Production mode, which means it will continuously process data using the Continuous Pipeline Mode. There are both STREAMING LIVE TABLE datasets and LIVE TABLE datasets defined. When you click Start to update the pipeline in Continuous Pipeline Mode:

All datasets, including both STREAMING LIVE TABLE and LIVE TABLE datasets, will be updated at set intervals. Compute resources will be deployed for the update, ensuring that the pipeline processes data. The compute resources will be terminated when the pipeline is stopped or shut down. This setup allows for continuous data processing while efficiently managing compute resources, and the pipeline can be stopped when no longer needed.

Comment 987506 by Sandy_17

Upvotes: 1

Selected Answer: C All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.

Comment 968112 by say88

Upvotes: 4

No answer is correct. Prod Continuous mode processes data at set intervals until pipe is shutdown. However, compute must be always-on and will not terminate. https://docs.databricks.com/delta-live-tables/updates.html#continuous-triggered

Comment 946077 by Atnafu

Upvotes: 2

C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.

In a Delta Live Table pipeline running in Continuous Pipeline Mode, when you click Start to update the pipeline, the following outcome is expected:

All datasets defined using STREAMING LIVE TABLE and LIVE TABLE against Delta Lake table sources will be updated at set intervals. The compute resources will be deployed for the update process and will be active during the execution of the pipeline. The compute resources will be terminated when the pipeline is stopped or shut down. This mode allows for continuous and periodic updates to the datasets as new data arrives or changes in the underlying Delta Lake tables occur. The compute resources are provisioned and utilized during the update intervals to process the data and perform the necessary operations.

Comment 912061 by chays

Upvotes: 3

Selected Answer: C Answer: C

Comment 888291 by Er5

Upvotes: 2

Answer: C Pipeline mode - This specifies how the pipeline will be run. Choose the mode based on latency and cost requirements.

Triggered pipelines run once and then shut down until the next manual or scheduled update.

Continuous pipelines run continuously, ingesting new data as it arrives.

Comment 873827 by hrabiabw

Upvotes: 3

Answer: D Official Databricks practice exam with answers - question 36

Comment 858091 by XiltroX

Upvotes: 3

E is not the right answer. The correct answer is C https://www.databricks.com/product/delta-live-tables

Question x9SRErNiiTJ0su46Snx4

Question

In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?

Choices

A: Checkpointing and Write-ahead Logs
B: Structured Streaming cannot record the offset range of the data being processed in each trigger.
C: Replayable Sources and Idempotent Sinks
D: Write-ahead Logs and Idempotent Sinks
E: Checkpointing and Idempotent Sinks

answer?

Answer: A Answer_ET: A Community answer A (76%) E (24%) Discussion

Comment 1365133 by shanksund

Upvotes: 2

Selected Answer: A Idempotent sinks is for ensuring no duplicates, that is not what the question is asking

Comment 1357354 by avidlearner

Upvotes: 3

Selected Answer: A The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure. URL for reference

Comment 1353911 by MrCastro

Upvotes: 2

Selected Answer: A A.

Really don’t understand people saying E.

Question is:

”… which of the following two approaches is used by Spark **** TO RECORD **** the offset range of the data being processed”

TO RECORD is the key here.

Idempotent sinks don’t record anything. It’s a feature to replay operations.

RECORDS of the operations are created with Checkpoints and Write Ahead Logs.

Comment 1349523 by edaf08e

Upvotes: 1

Selected Answer: E E. Checkpointing and Idempotent Sinks

Comment 1344879 by SatuPatu

Upvotes: 1

Selected Answer: E If failure by restarting and/or reprocessing then choose E If the worker running the task crashes then choose A

Comment 1316696 by Poutrata

Upvotes: 1

Selected Answer: E E is correct

Comment 1315170 by NzmD

Upvotes: 2

Selected Answer: E Correct answer is E

Comment 1314161 by 806e7d2

Upvotes: 2

Selected Answer: E In Structured Streaming, Spark uses the following two mechanisms to reliably track the progress of the stream and ensure fault tolerance:

Checkpointing:

Spark maintains metadata about the processing state, including the offset range of the data processed in each trigger. This metadata is stored in a reliable storage system like HDFS, AWS S3, or Azure Data Lake. If a failure occurs, Spark can recover and resume processing from the last recorded state in the checkpoint. Idempotent Sinks:

Idempotent sinks ensure that output operations (e.g., writing data to storage or a database) can be re-executed without causing duplicate data or errors. By combining idempotent sinks with checkpointing, Spark ensures that reprocessing data due to a failure does not compromise data integrity.

Comment 1290018 by Colje

Upvotes: 3

Why the correct answer is E. Checkpointing and Idempotent Sinks: Checkpointing: Spark Structured Streaming uses checkpointing to track the state of the data being processed. Checkpoints allow the system to restart processing from where it left off in case of failure, ensuring reliability. Idempotent Sinks: Idempotent sinks ensure that reprocessing the same data multiple times (in case of a failure or restart) doesn’t lead to duplicate results. The sink can handle repeated writes of the same data without issues. Why A. Checkpointing and Write-ahead Logs is incorrect: Spark Structured Streaming does not use Write-ahead Logs (WAL) for tracking offsets or ensuring fault tolerance. While WALs are used in some systems for durability, Spark Structured Streaming relies on checkpointing and the concept of idempotent operations to ensure consistency and fault tolerance.

Comment 1276080 by CID2024

Upvotes: 2

The correct answer is: E. Checkpointing and Idempotent Sinks In Structured Streaming, Spark uses checkpointing to reliably track the progress of the streaming data. Checkpointing saves the state of the streaming computation to a reliable storage system. Idempotent sinks ensure that even if data is reprocessed, the results remain consistent and correct, preventing duplicate data from being written.

Comment 1262428 by 80370eb

Upvotes: 2

Selected Answer: A Checkpointing: Spark saves metadata, including offsets, in a checkpoint directory, allowing it to recover from failures by replaying data starting from the last checkpoint. Write-ahead Logs (WAL): Spark writes information about the data being processed to a log before the data is written to the sink. This ensures that even if a failure occurs, Spark can recover and reprocess the data from the log.

Comment 1244552 by 3fbc31b

Upvotes: 1

Selected Answer: A A is the correct answer.

Comment 1211510 by squidy24

Upvotes: 3

Selected Answer: A The answer is A

“Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. … Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs.” - Apache Spark Structured Streaming Programming Guide

Comment 1207177 by keensolution

Upvotes: 1

Nice information and i hope best [url=https://keensolution.in/data-visualization-services/]Data visualization agencies in India[/url]

Comment 1204515 by bita7

Upvotes: 1

The answer is Checkpointing and idempotent sinks (E) How does structured streaming achieves end to end fault tolerance: • First, Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval. • Next, the streaming sinks are designed to be idempotent—that is, multiple writes of the same data (as identified by the offset) do not result in duplicates being written to the sink. Taken together, replayable data sources and idempotent sinks allow Structured Streaming to ensure end-to-end, exactly-once semantics under any failure condition

Comment 1203440 by benni_ale

Upvotes: 3

Selected Answer: A 1 checkpointing and write ahead logs to record the offset range of data being processed 2 checkpointing and idempotent sinks achieve end to end fault tolerance

Comment 1113651 by SerGrey

Upvotes: 1

Selected Answer: A Correct is A

Comment 1001544 by juadaves

Upvotes: 3

The answer is Checkpointing and idempotent sinks

How does structured streaming achieves end to end fault tolerance:

First, Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval.

Next, the streaming sinks are designed to be idempotent—that is, multiple writes of the same data (as identified by the offset) do not result in duplicates being written to the sink.

Taken together, replayable data sources and idempotent sinks allow Structured Streaming to ensure end-to-end, exactly-once semantics under any failure condition.

Comment 997982 by vctrhugo

Upvotes: 2

Selected Answer: A A. Checkpointing and Write-ahead Logs

To reliably track the exact progress of processing and handle failures in Spark Structured Streaming, Spark uses both checkpointing and write-ahead logs. Checkpointing allows Spark to periodically save the state of the streaming application to a reliable distributed file system, which can be used for recovery in case of failures. Write-ahead logs are used to record the offset range of data being processed, ensuring that the system can recover and reprocess data from the last known offset in the event of a failure.

Comment 957058 by akk_1289

Upvotes: 4

A: The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. — in the link search for “The engine uses ” youll find the answer. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#:~:text=The%20engine%20uses%20checkpointing%20and,being%20processed%20in%20each%20trigger.

Comment 946466 by Atnafu

Upvotes: 3

A. Checkpointing and Write-ahead Logs.

Checkpointing is a process of periodically saving the state of the streaming computation to a durable storage system. This ensures that if the streaming computation fails, it can be restarted from the last checkpoint and resume processing from where it left off. Write-ahead logs are a type of log that records all changes made to a dataset. This allows Structured Streaming to recover from failures by replaying the write-ahead logs from the last checkpoint.

Comment 921268 by mimzzz

Upvotes: 2

why i think both A E are correct? https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-streaming-exactly-once#:~:text=Use%20idempotent%20sinks

Comment 912063 by chays

Upvotes: 3

Selected Answer: A Answer is A: The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#:~:text=The%20engine%20uses%20checkpointing%20and,being%20processed%20in%20each%20trigger.

Comment 896633 by prasioso

Upvotes: 3

Selected Answer: A Answer is A. From Spark documentation: Every streaming source is assumed to have offsets to track the read position in the stream. The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger.

Comment 889126 by Majjjj

Upvotes: 1

Selected Answer: E E. Checkpointing and Idempotent Sinks are the two approaches used by Spark to record the offset range of the data being processed in each trigger, enabling Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Checkpointing periodically checkpoints the state of the streaming query to a fault-tolerant storage system, while idempotent sinks ensure that data can be written multiple times to the sink without affecting the final result.

Comment 861956 by 4be8126

Upvotes: 1

Selected Answer: E The answer is E. Checkpointing and Idempotent Sinks are used by Spark to record the offset range of the data being processed in each trigger. Checkpointing helps to recover the query from the point of failure and Idempotent Sinks ensure that the output of a streaming query is consistent even in the face of failures and retries.

Comment 858099 by XiltroX

Upvotes: 4

Selected Answer: A E is a partial answer. The two correct answers are A and E. Structured streaming is important because it uses these two methods to make sure there is fault tolerance and Exactly-once guarantee of data

vuthanhdatt's Second Brain

Explorer

18

Questions and Answers

Question ZteZImx77lSHJs51Vigy

Question

Choices

Comment 1262417 by 80370eb

Comment 1113405 by SerGrey

Comment 1049077 by hemanthgvsk

Comment 997957 by vctrhugo

Comment 860585 by surrabhi_4

Comment 858081 by XiltroX

Question 5i1wMxJ7eBiNt7BTjvnD

Question

Choices

Comment 894193 by Bob123456

Comment 1133111 by Nika12

Comment 1262422 by 80370eb

Comment 1244549 by 3fbc31b

Comment 1203436 by benni_ale

Comment 1137327 by agAshish

Comment 1113406 by SerGrey

Comment 1109256 by Garyn

Comment 1093253 by RafaelCFC

Comment 1084859 by Vikram1710

Comment 1084842 by rbeeraka

Comment 1071639 by Huroye

Comment 1064826 by awofalus

Comment 997958 by vctrhugo

Comment 993224 by [Removed]

Comment 972916 by Kartz130789

Comment 951440 by ehsanmor18

Comment 949868 by keksssd

Comment 946074 by Atnafu

Comment 945745 by mehroosali

Comment 873777 by hrabiabw

Comment 861944 by 4be8126

Comment 860586 by surrabhi_4

Comment 858084 by XiltroX

Question p9b7g6fqIxbPOw3ZGgwh

Question

Choices

Comment 858087 by XiltroX

Comment 999903 by DQCR

Comment 1314149 by 806e7d2

Comment 1262423 by 80370eb

Comment 1203437 by benni_ale

Comment 1113647 by SerGrey

Comment 1064830 by awofalus

Comment 1064829 by awofalus

Comment 997973 by vctrhugo

Comment 946075 by Atnafu

Comment 889123 by Majjjj

Comment 861949 by 4be8126

Question 9EBLqDOP5oWGRtAo7Nqr

Question

Choices

Comment 1109353 by Garyn

Comment 1262425 by 80370eb

Comment 1244550 by 3fbc31b

Comment 1203439 by benni_ale

Comment 1113649 by SerGrey

Comment 1064838 by awofalus

Comment 997975 by vctrhugo

Comment 987506 by Sandy_17

Comment 968112 by say88

Comment 946077 by Atnafu

Comment 912061 by chays

Comment 888291 by Er5

Comment 873827 by hrabiabw

Comment 858091 by XiltroX

Question x9SRErNiiTJ0su46Snx4

Question

Choices

Comment 1365133 by shanksund

Comment 1357354 by avidlearner

Comment 1353911 by MrCastro

Comment 1349523 by edaf08e

Comment 1344879 by SatuPatu