Questions and Answers
Question YCCp8V78FGdJhhI4wNc2
Question
An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema: user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id. Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job?
Choices
- A: Use Auto Loader to subscribe to new files in the account_history directory; configure a Structured Streaming trigger once job to batch update newly detected files into the account_current table.
- B: Overwrite the account_current table with each batch using the results of a query against the account_history table grouping by user_id and filtering for the max value of last_updated.
- C: Filter records in account_history using the last_updated field and the most recent hour processed, as well as the max last_iogin by user_id write a merge statement to update or insert the most recent value for each user_id.
- D: Use Delta Lake version history to get the difference between the latest version of account_history and one version prior, then write these records to account_current.
- E: Filter records in account_history using the last_updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the most recent value for each username.
answer?
Answer: C Answer_ET: C Community answer C (67%) B (24%) 6% Discussion
Comment 1117279 by RafaelCFC
- Upvotes: 10
Selected Answer: C My reasoning is thus: The application is based on batch processes, so A is wrong. Overwriting the table would destroy the Type 1 SCD behavior, so B is wrong. Comparing versions of account_history would not be efficient, as the whole data would be scanned, so D is wrong. ‘username’ is not a key column, so we have no guarantee that it’s unique, thus de-duplicating by it can yield wrongly grouped sets of rows, so E is not a safe bet, with the information we know. C is the best option.
Comment 1174493 by terrku
- Upvotes: 8
Selected Answer: B Type 1 table means the behavior is overwriting.
Comment 1361626 by Tedet
- Upvotes: 1
Selected Answer: C This option proposes filtering the records in the account_history table for the most recent records based on the last_updated field, which is exactly what you want to do to get the most recent value. The use of a MERGE statement ensures that only the most recent records are inserted or updated in the account_current table. This method avoids full overwrites of the account_current table and only updates records that have actually changed, which is efficient for large datasets. Conclusion: This is the most efficient approach because it ensures only the most recent data is merged, and it avoids unnecessary full table rebuilds.
Comment 1301966 by Ananth4Sap
- Upvotes: 3
C is correct because B is wrong as it says filtering the max value of last updated + overwriting we will miss some valid records. two valid scenarios: 1-Filtering the max value of last updated +merging (Option-c) 2.use window function on last update, filter and then overwrite (no options)
Comment 1299094 by benni_ale
- Upvotes: 3
Selected Answer: C A. NO. Batch job required so AutoLoader and StructuredStreaming unecessarily complex solutions. B. NO. A full overwrite of the table is not efficient. C. YES. Seems it is filterning and merging on the id by using as less data as reasonable in the merge statement, why not? D. NO. Difference operation is very ineffecient for this purpose E. NO. Username is not key
Comment 1261176 by Dhusanth
- Upvotes: 1
Selected Answer: C C is correct
Comment 1259331 by faraaz132
- Upvotes: 1
C is correct because: A record might have multiple changes and we need to select the most recent change that happened on that record. For that we will use max Log in date and rank it using window function, then we filter on rank=1 and use it for UPSERT operation.
Comment 1224651 by Karunakaran_R
- Upvotes: 1
I think B ,Type 1 table must overwrite the data
Comment 1219870 by Freyr
- Upvotes: 3
C is correct. A Type 1 table means that it performs an “upsert” operation without maintaining history, based on the merge condition. This means that new records are inserted, and existing records are updated. As a result, the merge process does not retain historical records.
Therefore, the correct answer is C.
Comment 1145233 by PrashantTiwari
- Upvotes: 1
C is correct
Comment 1136757 by DAN_H
- Upvotes: 2
Selected Answer: C answer is C
Comment 1130915 by spaceexplorer
- Upvotes: 1
Selected Answer: C answer is C
Comment 1118582 by kz_data
- Upvotes: 1
Selected Answer: C Correct answer is C
Comment 1114699 by ATLTennis
- Upvotes: 1
Selected Answer: D D is the most optimal way to identify the changes in the last data refresh
Comment 1040298 by sturcu
- Upvotes: 2
Selected Answer: E We need to filter on last hours and deduplicate records, then merge. Do is not correct, filtering on max loggin_date makes no sense.
Comment 1013251 by Eertyy
- Upvotes: 1
correct answer is C
Comment 1001431 by thxsgod
- Upvotes: 3
Selected Answer: C Correct
Question tunrJHVCdlciidWUf8Tl
Question
Which statement describes Delta Lake optimized writes?
Choices
- A: Before a Jobs cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job.
- B: An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 1 GB.
- C: A shuffle occurs prior to writing to try to group similar data together resulting in fewer files instead of each executor writing multiple files based on directory partitions.
- D: Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1307715 by benni_ale
- Upvotes: 1
Selected Answer: C C thanks
Comment 1294146 by Farid77
- Upvotes: 1
B is correct. C is wrong OPTIMIZE is a separate process from write.
Comment 1251625 by vexor3
- Upvotes: 2
Selected Answer: C C is correct
Question 7kWoGfFgk7PjaHW86rLe
Question
Which statement characterizes the general programming model used by Spark Structured Streaming?
Choices
- A: Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.
- B: Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.
- C: Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.
- D: Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1251627 by vexor3
- Upvotes: 2
Selected Answer: D D is correct
Question kTA7ztVOtDkA131KYQAO
Question
Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?
Choices
- A: spark.sql.files.maxPartitionBytes
- B: spark.sql.autoBroadcastJoinThreshold
- C: spark.sql.adaptive.advisoryPartitionSizeInBytes
- D: spark.sql.adaptive.coalescePartitions.minPartitionNum
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1339925 by lene
- Upvotes: 1
Selected Answer: A Property Name Default Meaning spark.sql.files.maxPartitionBytes 134217728 (128 MB) The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.
Comment 1251628 by vexor3
- Upvotes: 1
Selected Answer: A A is correct
Question eetgilNb1NoyG6cEfqPC
Question
A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.
Which situation is causing increased duration of the overall job?
Choices
- A: Task queueing resulting from improper thread pool assignment.
- B: Spill resulting from attached volume storage being too small.
- C: Network latency due to some cluster nodes being in different regions from the source data
- D: Skew caused by more data being assigned to a subset of spark-partitions.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1307716 by benni_ale
- Upvotes: 1
Selected Answer: D D is ok
Comment 1251630 by vexor3
- Upvotes: 1
Selected Answer: D D is correct