Questions and Answers
Question rGkP3h3op3fQrXe2VzHm
Question
A Delta table of weather records is partitioned by date and has the below schema:
date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT
To find all the records from within the Arctic Circle, you execute a query with the below filter:
latitude > 66.3
Which statement describes how the Delta engine identifies which files to load?
Choices
- A: All records are cached to an operational database and then the filter is applied
- B: The Parquet file footers are scanned for min and max statistics for the latitude column
- C: The Hive metastore is scanned for min and max statistics for the latitude column
- D: The Delta log is scanned for min and max statistics for the latitude column
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1323962 by Thameur01
- Upvotes: 2
Selected Answer: D As per the documentation, I understand that the table statistics can be fetched through the delta log (eg min, max, count) in order to not read the underlying data of a delta table. This is the case for numerical types, and timestamp is supposed to be supported.
Comment 1322523 by temple1305
- Upvotes: 2
Selected Answer: D Delta Table’s log consist statistics for columns
Question B9NUAzbbo6gLHRxTmJHP
Question
A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device. Streaming DataFrame df has the following schema: “device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT” Code block: //IMG//
Choose the response that correctly fills in the blank within the code block to complete this task.
Choices
- A: to_interval(“event_time”, “5 minutes”).alias(“time”)
- B: window(“event_time”, “5 minutes”).alias(“time”)
- C: “event_time”
- D: window(“event_time”, “10 minutes”).alias(“time”)
- E: lag(“event_time”, “10 minutes”).alias(“time”)
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1222466 by imatheushenrique
- Upvotes: 4
B. window(“event_time”, “5 minutes”).alias(“time”) In Structured Streaming, expressing such windows on event-time is simply performing a special grouping using the window() function. For example, counts over 5 minute tumbling (non-overlapping) windows on the eventTime column in the event is as following.
Comment 1121947 by Jay_98_11
- Upvotes: 2
Selected Answer: B correct B
Comment 1118608 by kz_data
- Upvotes: 1
Selected Answer: B B is correct
Comment 1060815 by BIKRAM063
- Upvotes: 2
Selected Answer: B Window of 5 mins
Comment 1040324 by sturcu
- Upvotes: 2
Selected Answer: B B is correct: https://www.databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html
Comment 1013343 by Eertyy
- Upvotes: 2
answer is B
Comment 1001477 by thxsgod
- Upvotes: 4
Selected Answer: B Correct, B.
Question pKSF1v56p4M4zw9t2dDz
Question
A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create.
//IMG//
Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?
Choices
- A: The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.
- B: The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.
- C: Three new jobs named “Ingest new data” will be defined in the workspace, but no jobs will be executed.
- D: One new job named “Ingest new data” will be defined in the workspace, but it will not be executed.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1322906 by Ayomidetolu_A
- Upvotes: 1
Selected Answer: C C is the correct answer
Comment 1322659 by divyapsingh
- Upvotes: 2
Selected Answer: C C is the answer as 3 times call will create three jobs with same name but with different job id.
Comment 1322525 by temple1305
- Upvotes: 1
Selected Answer: C C correct, 3 jobs created, no executions
Question u2MMMuckvN64axxtNdMh
Question
A view is registered with the following code:
//IMG//
Both users and orders are Delta Lake tables.
Which statement describes the results of querying recent_orders?
Choices
- A: The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.
- B: All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.
- C: All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.
- D: All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.
answer?
Answer: D Answer_ET: D Community answer D (80%) B (20%) Discussion
Comment 1335565 by arekm
- Upvotes: 1
Selected Answer: D D - with the view it is always at query time. Once it is run, the engine reads the data - meaning the data state is from when the query was run.
Comment 1323638 by benni_ale
- Upvotes: 1
Selected Answer: D View does not write to disk
Comment 1322660 by divyapsingh
- Upvotes: 2
Selected Answer: D Here the view is getting created not the table. view is only saver select query which will get data from the valid version of the underlying table at the begining of the select statement made on view.
Comment 1322528 by temple1305
- Upvotes: 1
Selected Answer: B B is correct, view is just query without materialization and return result for moment of execution
Question V0ARoNS3VX2DAeGQdG4t
Question
A data engineer is performing a join operation to combine values from a static userLookup table with a streaming DataFrame streamingDF.
Which code block attempts to perform an invalid stream-static join?
Choices
- A: userLookup.join(streamingDF, [“user_id”], how=“right”)
- B: streamingDF.join(userLookup, [“user_id”], how=“inner”)
- C: userLookup.join(streamingDF, [“user_id”), how=“inner”)
- D: userLookup.join(streamingDF, [“user_id”], how=“left”)
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1322531 by temple1305
- Upvotes: 6
Selected Answer: D D correct https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#support-matrix-for-joins-in-streaming-queries Static join stream - left join not supported