Questions and Answers
Question upyTGCa5ytq15hW7rnKN
Question
Which statement describes a key benefit of an end-to-end test?
Choices
- A: Makes it easier to automate your test suite
- B: Pinpoints errors in the building blocks of your application
- C: Provides testing coverage for all code paths and branches
- D: Closely simulates real world usage of your application
- E: Ensures code is optimized for a real-life workflow
answer?
Answer: D Answer_ET: D Community answer D (86%) 14% Discussion
Comment 1141641 by vctrhugo
- Upvotes: 4
Selected Answer: D End-to-end testing is a methodology used to test whether the flow of an application is performing as designed from start to finish. The purpose of carrying out end-to-end tests is to identify system dependencies and to ensure that the right information is passed between various system components and systems. The entire application is tested in a real-world scenario such as communicating with the database, network, hardware, and other applications. Therefore, it closely simulates real-world usage of the application. Other options are benefits of different types of testing, not specifically end-to-end testing.
Comment 1131807 by spaceexplorer
- Upvotes: 2
Selected Answer: D D is correct
Comment 1130250 by Def21
- Upvotes: 1
Selected Answer: B End-to-end tests use an example scenario, do not necessarily follow complex real world. Unit tests are component wise, end-to-end tests go over components.
Question vnGqnW4JNf2fsXR481t4
Question
The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a field run_id.
Which statement describes what the number alongside this field represents?
Choices
- A: The job_id and number of times the job has been run are concatenated and returned.
- B: The total number of jobs that have been run in the workspace.
- C: The number of times the job definition has been run in this workspace.
- D: The job_id is returned in this field.
- E: The globally unique ID of the newly triggered run.
answer?
Answer: E Answer_ET: E Community answer E (100%) Discussion
Comment 1141640 by vctrhugo
- Upvotes: 4
Selected Answer: E The number alongside the “run_id” field represents the globally unique identifier assigned to the newly triggered run of the job. Each run of a job in Databricks is assigned a unique run_id, allowing you to track and reference that specific execution of the job.
Comment 1130258 by Def21
- Upvotes: 1
Selected Answer: E Verified from Databricks UI
Question cudI4e05eqWozPIBqSdV
Question
The data science team has created and logged a production model using MLflow. The model accepts a list of column names and returns a new column of type DOUBLE.
The following code correctly imports the production model, loads the customers table containing the customer_id key column into a DataFrame, and defines the feature columns needed for the model.
//IMG//
Which code block will output a DataFrame with the schema “customer_id LONG, predictions DOUBLE”?
Choices
- A: df.map(lambda x:model(x[columns])).select(“customer_id, predictions”)
- B: df.select(“customer_id”, model(*columns).alias(“predictions”))
- C: model.predict(df, columns)
- D: df.select(“customer_id”, pandas_udf(model, columns).alias(“predictions”))
- E: df.apply(model, columns).select(“customer_id, predictions”)
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1076846 by aragorn_brego
- Upvotes: 7
Selected Answer: B This code block applies the Spark UDF created from the MLflow model to the DataFrame df by selecting the existing customer_id column and the new column produced by the model, which is aliased to predictions. The model(*columns) part is where the UDF is applied to the columns specified in the columns list, and alias(“predictions”) is used to name the output column of the model’s predictions. This will result in a DataFrame with the desired schema: “customer_id LONG, predictions DOUBLE”.
Comment 1111560 by divingbell17
- Upvotes: 6
Selected Answer: B B is correct. It’s a spark udf not pandas
Comment 1071629 by 60ties
- Upvotes: 2
I think it is B
Question yJ0rC7YlgKCEiTIJLPLH
Question
A nightly batch job is configured to ingest all data files from a cloud object storage container where records are stored in a nested directory structure YYYY/MM/DD. The data for each date represents all records that were processed by the source system on that date, noting that some records may be delayed as they await moderator approval. Each entry represents a user review of a product and has the following schema:
user_id STRING, review_id BIGINT, product_id BIGINT, review_timestamp TIMESTAMP, review_text STRING
The ingestion job is configured to append all data for the previous date to a target table reviews_raw with an identical schema to the source system. The next step in the pipeline is a batch write to propagate all new records inserted into reviews_raw to a table where data is fully deduplicated, validated, and enriched.
Which solution minimizes the compute costs to propagate this batch of data?
Choices
- A: Perform a batch read on the reviews_raw table and perform an insert-only merge using the natural composite key user_id, review_id, product_id, review_timestamp.
- B: Configure a Structured Streaming read against the reviews_raw table using the trigger once execution mode to process new records as a batch job.
- C: Use Delta Lake version history to get the difference between the latest version of reviews_raw and one version prior, then write these records to the next table.
- D: Filter all records in the reviews_raw table based on the review_timestamp; batch append those records produced in the last 48 hours.
- E: Reprocess all records in reviews_raw and overwrite the next table in the pipeline.
answer?
Answer: A Answer_ET: A Community answer A (59%) B (32%) 9% Discussion
Comment 1173211 by alexvno
- Upvotes: 5
Selected Answer: A Deduplication , so insert-only merge
Comment 1119510 by bacckom
- Upvotes: 5
Selected Answer: A Should we consider deduplicate? For Time travel, I don’t think it can be used to duplicate the target table.
Comment 1346910 by aarora
- Upvotes: 1
Selected Answer: C To minimize compute costs, the most efficient approach is to leverage Delta Lake’s version history to identify only the new records added since the previous ingestion and process those. Here’s why this solution works best: • Delta Lake versioning: Delta Lake tracks changes to the data through its transaction log. By comparing the latest version of the table with the previous version, you can identify only the records that were appended (new data for the previous date). • Efficient processing: By working only with the delta (new records), you avoid scanning the entire reviews_raw table, which reduces compute and storage I/O costs. • Accurate and optimized: This approach ensures no unnecessary reprocessing of older data while still capturing any delayed records. It works well for use cases involving deduplication and validation.
Comment 1331328 by Hienlv1
- Upvotes: 1
Selected Answer: C I think C is the correct answer, use the time travel feature to get the previous version and compare it to the current version to figure out which record needs to be inserted instead of a full scan during read like option A. The goal is to minimize compute costs while propagating only new records inserted into the reviews_raw table to the next table in the pipeline.
Comment 1326356 by Sriramiyer92
- Upvotes: 1
Selected Answer: A In case of D. The 48 hrs point just added to confuse us folks. A is enough.
Comment 1296869 by cales
- Upvotes: 1
Selected Answer: B “The next step in the pipeline is a batch write to propagate all new records inserted into reviews_raw to a table where data is fully deduplicated, validated, and enriched.” The deduplication will be performed in the following step. Answer B should fit better with cost minimization
Comment 1296620 by shaojunni
- Upvotes: 1
Selected Answer: A Batch read load full table, but guarantee no duplication with merge. Trigger Once only load new data, you have to run merge to guarantee no duplication in the whole target file. But B does not indicate that.
Comment 1288337 by RyanAck24
- Upvotes: 1
Selected Answer: A A is Correct
Comment 1286910 by shaojunni
- Upvotes: 1
Selected Answer: B B is correct, trigger once is the option in structured streaming for batch style job, but much more efficient.
Comment 1286908 by shaojunni
- Upvotes: 1
B is correct, trigger once is the option in structured streaming for batch style job, but much more efficient.
Comment 1132960 by spaceexplorer
- Upvotes: 1
Selected Answer: B B is correct
Comment 1132594 by ranith
- Upvotes: 2
B should be correct when looking at cost minimalization, a batch read would scan the whole reviews_raw table, this is unnecessary as historical data is not changed. If a review is delyaed to be approved by the moderator still it is inserted as a new record. Capturing the new data is sufficient.
Comment 1111563 by divingbell17
- Upvotes: 4
Selected Answer: B B should be correct. https://www.databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html
Question K2IQOnQrwhpeAaeJ4PC4
Question
Which statement describes Delta Lake optimized writes?
Choices
- A: Before a Jobs cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job.
- B: An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 1 GB.
- C: Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.
- D: Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.
- E: A shuffle occurs prior to writing to try to group similar data together resulting in fewer files instead of each executor writing multiple files based on directory partitions.
answer?
Answer: E Answer_ET: E Community answer E (100%) Discussion
Comment 1141636 by vctrhugo
- Upvotes: 1
Selected Answer: E Optimized writes improve file size as data is written and benefit subsequent reads on the table.
Optimized writes are most effective for partitioned tables, as they reduce the number of small files written to each partition. Writing fewer large files is more efficient than writing many small files, but you might still see an increase in write latency because data is shuffled before being written.
Comment 1115941 by lexaneon
- Upvotes: 3
Selected Answer: E https://docs.databricks.com/en/delta/tune-file-size.html#optimized-writes
Comment 1100430 by alexvno
- Upvotes: 3
Selected Answer: E Optimized writes are most effective for partitioned tables, as they reduce the number of small files written to each partition. Writing fewer large files is more efficient than writing many small files, but you might still see an increase in write latency because data is shuffled before being writte