Questions and Answers

Question rGkP3h3op3fQrXe2VzHm

Question

A Delta table of weather records is partitioned by date and has the below schema:

date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT

To find all the records from within the Arctic Circle, you execute a query with the below filter:

latitude > 66.3

Which statement describes how the Delta engine identifies which files to load?

Choices

A: All records are cached to an operational database and then the filter is applied
B: The Parquet file footers are scanned for min and max statistics for the latitude column
C: The Hive metastore is scanned for min and max statistics for the latitude column
D: The Delta log is scanned for min and max statistics for the latitude column

answer?

Answer: D Answer_ET: D Community answer D (100%) Discussion

Comment 1323962 by Thameur01

Upvotes: 2

Selected Answer: D As per the documentation, I understand that the table statistics can be fetched through the delta log (eg min, max, count) in order to not read the underlying data of a delta table. This is the case for numerical types, and timestamp is supposed to be supported.

Comment 1322523 by temple1305

Upvotes: 2

Selected Answer: D Delta Table’s log consist statistics for columns

Question B9NUAzbbo6gLHRxTmJHP

Question

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device. Streaming DataFrame df has the following schema: “device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT” Code block: //IMG//

Choose the response that correctly fills in the blank within the code block to complete this task.

Choices

A: to_interval(“event_time”, “5 minutes”).alias(“time”)
B: window(“event_time”, “5 minutes”).alias(“time”)
C: “event_time”
D: window(“event_time”, “10 minutes”).alias(“time”)
E: lag(“event_time”, “10 minutes”).alias(“time”)

answer?

Answer: B Answer_ET: B Community answer B (100%) Discussion

Comment 1222466 by imatheushenrique

Upvotes: 4

B. window(“event_time”, “5 minutes”).alias(“time”) In Structured Streaming, expressing such windows on event-time is simply performing a special grouping using the window() function. For example, counts over 5 minute tumbling (non-overlapping) windows on the eventTime column in the event is as following.

Comment 1121947 by Jay_98_11

Upvotes: 2

Selected Answer: B correct B

Comment 1118608 by kz_data

Upvotes: 1

Selected Answer: B B is correct

Comment 1060815 by BIKRAM063

Upvotes: 2

Selected Answer: B Window of 5 mins

Comment 1040324 by sturcu

Upvotes: 2

Selected Answer: B B is correct: https://www.databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html

Comment 1013343 by Eertyy

Upvotes: 2

answer is B

Comment 1001477 by thxsgod

Upvotes: 4

Selected Answer: B Correct, B.

Question pKSF1v56p4M4zw9t2dDz

Question

A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create.

//IMG//

Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?

Choices

A: The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.
B: The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.
C: Three new jobs named “Ingest new data” will be defined in the workspace, but no jobs will be executed.
D: One new job named “Ingest new data” will be defined in the workspace, but it will not be executed.

answer?

Answer: C Answer_ET: C Community answer C (100%) Discussion

Comment 1322906 by Ayomidetolu_A

Upvotes: 1

Selected Answer: C C is the correct answer

Comment 1322659 by divyapsingh

Upvotes: 2

Selected Answer: C C is the answer as 3 times call will create three jobs with same name but with different job id.

Comment 1322525 by temple1305

Upvotes: 1

Selected Answer: C C correct, 3 jobs created, no executions

Question u2MMMuckvN64axxtNdMh

Question

A view is registered with the following code:

//IMG//

Both users and orders are Delta Lake tables.

Which statement describes the results of querying recent_orders?

Choices

A: The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.
B: All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.
C: All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.
D: All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

answer?

Answer: D Answer_ET: D Community answer D (80%) B (20%) Discussion

Comment 1335565 by arekm

Upvotes: 1

Selected Answer: D D - with the view it is always at query time. Once it is run, the engine reads the data - meaning the data state is from when the query was run.

Comment 1323638 by benni_ale

Upvotes: 1

Selected Answer: D View does not write to disk

Comment 1322660 by divyapsingh

Upvotes: 2

Selected Answer: D Here the view is getting created not the table. view is only saver select query which will get data from the valid version of the underlying table at the begining of the select statement made on view.

Comment 1322528 by temple1305

Upvotes: 1

Selected Answer: B B is correct, view is just query without materialization and return result for moment of execution

Question V0ARoNS3VX2DAeGQdG4t

Question

A data engineer is performing a join operation to combine values from a static userLookup table with a streaming DataFrame streamingDF.

Which code block attempts to perform an invalid stream-static join?

Choices

A: userLookup.join(streamingDF, [“user_id”], how=“right”)
B: streamingDF.join(userLookup, [“user_id”], how=“inner”)
C: userLookup.join(streamingDF, [“user_id”), how=“inner”)
D: userLookup.join(streamingDF, [“user_id”], how=“left”)

answer?

Answer: D Answer_ET: D Community answer D (100%) Discussion

Comment 1322531 by temple1305

Upvotes: 6

Selected Answer: D D correct https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#support-matrix-for-joins-in-streaming-queries Static join stream - left join not supported

vuthanhdatt's Second Brain

Explorer

20

Questions and Answers

Question rGkP3h3op3fQrXe2VzHm

Question

Choices

Comment 1323962 by Thameur01

Comment 1322523 by temple1305

Question B9NUAzbbo6gLHRxTmJHP

Question

Choices

Comment 1222466 by imatheushenrique

Comment 1121947 by Jay_98_11

Comment 1118608 by kz_data

Comment 1060815 by BIKRAM063

Comment 1040324 by sturcu

Comment 1013343 by Eertyy

Comment 1001477 by thxsgod

Question pKSF1v56p4M4zw9t2dDz

Question

Choices

Comment 1322906 by Ayomidetolu_A

Comment 1322659 by divyapsingh

Comment 1322525 by temple1305

Question u2MMMuckvN64axxtNdMh

Question

Choices

Comment 1335565 by arekm

Comment 1323638 by benni_ale

Comment 1322660 by divyapsingh

Comment 1322528 by temple1305

Question V0ARoNS3VX2DAeGQdG4t

Question

Choices

Comment 1322531 by temple1305

Graph View

Table of Contents