Questions and Answers
Question H15MG7w2KSjqnl96FurD
Question
A data engineer has three tables in a Delta Live Tables (DLT) pipeline. They have configured the pipeline to drop invalid records at each table. They notice that some data is being dropped due to quality concerns at some point in the DLT pipeline. They would like to determine at which table in their pipeline the data is being dropped.
Which approach can the data engineer take to identify the table that is dropping the records?
Choices
- A: They can set up separate expectations for each table when developing their DLT pipeline.
- B: They can navigate to the DLT pipeline page, click on the “Error” button, and review the present errors.
- C: They can set up DLT to notify them via email when records are dropped.
- D: They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics.
answer?
Answer: D Answer_ET: D Discussion
Comment 1273332 by 9d4d68a
- Upvotes: 5
Repeated, Correct
Question dj77z4lwOSYcRjblcFPJ
Question
What is used by Spark to record the offset range of the data being processed in each trigger in order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing?
Choices
- A: Checkpointing and Write-ahead Logs
- B: Replayable Sources and Idempotent Sinks
- C: Write-ahead Logs and Idempotent Sinks
- D: Checkpointing and Idempotent Sinks
answer?
Answer: D Answer_ET: A Community answer D (60%) A (40%) Discussion
Comment 1401251 by Lili97
- Upvotes: 2
Selected Answer: D Hello, is it usual to have duplicated questions? What is the point of paying if some questions are repeated?
Comment 1330847 by grygi
- Upvotes: 1
Selected Answer: A A is correct. I had this on the exam, from my results it seems so. I chose D and didn’t max this area and I was sure of all other answers.
Comment 1327314 by MultiCloudIronMan
- Upvotes: 1
Selected Answer: D The correct answer is D. Checkpointing and Idempotent Sinks. In Structured Streaming, Spark uses checkpointing to reliably track the progress of the data being processed. Checkpointing saves the state of the streaming query, including the offset ranges of the data processed in each trigger. Idempotent sinks ensure that even if the same data is processed multiple times due to a failure and restart, the results remain consistent and correct.
Comment 1313097 by NzmD
- Upvotes: 1
Selected Answer: A Repeated!
Comment 1273345 by 9d4d68a
- Upvotes: 2
Repeated, Correct
The correct answer is A. Checkpointing and Write-ahead Logs. Checkpointing records the progress of streaming queries, while write-ahead logs (WALs) capture the data before it is processed, allowing Spark to recover and process data reliably in case of failures.
Question R75OKQYdzOoHvmDwhR7b
Question
What describes the relationship between Gold tables and Silver tables?
Choices
- A: Gold tables are more likely to contain aggregations than Silver tables.
- B: Gold tables are more likely to contain valuable data than Silver tables.
- C: Gold tables are more likely to contain a less refined view of data than Silver tables.
- D: Gold tables are more likely to contain truthful data than Silver tables.
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1327316 by MultiCloudIronMan
- Upvotes: 1
Selected Answer: A Gold is final stage to feed analytics platforms
Comment 1273343 by 9d4d68a
- Upvotes: 2
Repeated, Correct
Question 9Gm25ynrYzW2IyOJxUlu
Question
A data analyst has created a Delta table sales that is used by the entire data analysis team. They want help from the data engineering team to implement a series of tests to ensure the data is clean. However, the data engineering team uses Python for its tests rather than SQL. Which of the following commands could the data engineering team use to access sales in PySpark?
Choices
- A: SELECT * FROM sales
- B: There is no way to share data between PySpark and SQL.
- C: spark.sql(“sales”)D. spark.delta.table(“sales”)
- D: spark.table(“sales”)
answer?
Answer: E Answer_ET: E Community answer E (95%) 5% Discussion
Comment 946766 by Atnafu
- Upvotes: 14
E The spark.table() function in PySpark allows you to access tables registered in the catalog, including Delta tables. By specifying the table name (“sales”), the data engineering team can read the Delta table and perform various operations on it using PySpark.
Option A, SELECT * FROM sales, is a SQL syntax and cannot be directly used in PySpark.
Option B, “There is no way to share data between PySpark and SQL,” is incorrect. PySpark provides the capability to interact with data using both SQL and DataFrame/DataSet APIs.
Option C, spark.sql(“sales”), is a valid command to execute SQL queries on registered tables in PySpark. However, in this case, the “sales” argument alone is not a valid SQL query.
Option D, spark.delta.table(“sales”), is a specific method provided by Delta Lake to access Delta tables directly. While it can be used to access the “sales” table, it is not the most common approach in PySpark.
Comment 1344724 by dhohigh
- Upvotes: 1
Selected Answer: E This answer is pure python and is a simple solution for the Question.
Comment 1272813 by 9d4d68a
- Upvotes: 1
To access the Delta table sales using PySpark, the data engineering team can use the following command:
E. spark.table(“sales”)
This command allows them to load the table into a PySpark DataFrame, which they can then use for their tests and data processing in Python. No, the command spark.delta.table(“table name”) does not exist in PySpark. To access a Delta table, you should use:
spark.table(“table name”)
Or, if you need to use Delta-specific functionality, you would typically use Delta’s APIs or spark.read.format(“delta”).table(“table name”) to read the table into a DataFrame.
Comment 1262396 by 80370eb
- Upvotes: 1
Selected Answer: E E. spark.table(“sales”)
This command allows the team to access the table using PySpark, enabling them to implement their tests in Python.
Comment 1252366 by souldiv
- Upvotes: 1
spark.table() . E is the correct one
Comment 1203171 by benni_ale
- Upvotes: 1
Selected Answer: E E is correct
Comment 1189113 by benni_ale
- Upvotes: 2
Selected Answer: E e is correct
Comment 1177189 by Itmma
- Upvotes: 1
Selected Answer: E E is correct
Comment 1113192 by SerGrey
- Upvotes: 1
Selected Answer: E Correct answer is E
Comment 1109089 by Garyn
- Upvotes: 4
Selected Answer: E E. spark.table(“sales”)
The spark.table() function in PySpark allows access to a registered table within the SparkSession. In this case, “sales” is the name of the Delta table created by the data analyst, and the spark.table() function enables access to this table for performing data engineering tests using Python (PySpark).
Comment 1106007 by csd
- Upvotes: 1
C is correct Answer
Comment 1064788 by awofalus
- Upvotes: 1
Selected Answer: E Correct is E
Comment 1017350 by KalavathiP
- Upvotes: 1
Selected Answer: E E is correct
Comment 1016561 by d_b47
- Upvotes: 1
Selected Answer: E delta is default.
Comment 921433 by ThomasReps
- Upvotes: 2
Selected Answer: E It’s E. As stated by others, the default format is delta
If you try to run D, you get an error, that there are no “delta”-command for spark: “AttributeError: ‘SparkSession’ object has no attribute ‘delta’“. If you want to explicit tell it should be delta, then you need an “.option(format=‘delta’)” insted.
Comment 914071 by Dwarakkrishna
- Upvotes: 1
You access data in Delta tables by the table name or the table path, as shown in the following examples: people_df = spark.read.table(table_name)
display(people_df)
Comment 895846 by prasioso
- Upvotes: 1
I believe the answer is E as in databricks the default tables are delta tables hence spark.table should be enough. Have not seen a spark.delta.table function before.
Comment 892904 by Tickxit
- Upvotes: 2
Selected Answer: E E: spark.table or spark.read.table
Comment 889298 by softthinkers
- Upvotes: 1
Correct Answer is D spark.delta.table(“sales”) And the reason that its asking for delta table not normal table if its for normal table then it should be spark.table(“sales”)
Comment 889065 by Majjjj
- Upvotes: 1
The correct answer is D.
The data engineering team can access the Delta table sales in PySpark by using the spark.delta.table command. This command is used to create a DataFrame based on a Delta table. Therefore, the correct command is spark.delta.table(“sales”).
Comment 876207 by Varma_Saraswathula
- Upvotes: 1
Option E - https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.SparkSession.table.html
Comment 875861 by naxacod574
- Upvotes: 1
Option E
Comment 868052 by azurearch
- Upvotes: 2
option E
Comment 867435 by SireeJ
- Upvotes: 1
Option: D
Comment 860627 by sdas1
- Upvotes: 2
Option E
Comment 860422 by knivesz
- Upvotes: 3
Selected Answer: E Creamos una tabla: create or replace table delta_su (id INT , nombre STRING) Insertamos la tabla y posteriomente obtenemos los valores registrados con : spark.table(“delta_su”).show()
Comment 860276 by Retko
- Upvotes: 4
E is correct, spark.table(“sales”)
Comment 857990 by XiltroX
- Upvotes: 1
Selected Answer: C Correct answer is C
Question JxCrj6gQstqtDXOGPCxg
Question
What describes when to use the CREATE STREAMING LIVE TABLE (formerly CREATE INCREMENTAL LIVE TABLE) syntax over the CREATE LIVE TABLE syntax when creating Delta Live Tables (DLT) tables using SQL?
Choices
- A: CREATE STREAMING LIVE TABLE should be used when the subsequent step in the DLT pipeline is static.
- B: CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally.
- C: CREATE STREAMING LIVE TABLE should be used when data needs to be processed through complicated aggregations.
- D: CREATE STREAMING LIVE TABLE should be used when the previous step in the DLT pipeline is static.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1327318 by MultiCloudIronMan
- Upvotes: 1
Selected Answer: B Streaming data from source to destination
Comment 1273344 by 9d4d68a
- Upvotes: 2
Repeated, Correct