Questions and Answers
Question lZOpomtDJfrMn9QUlWqN
Question
In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.
The function is displayed below with a blank:
//IMG//
Which response correctly fills in the blank to meet the specified requirements?
Choices
- A:
- B:
- C:
- D:
- E:
answer?
Answer: E Answer_ET: E Community answer E (100%) Discussion
Comment 1323161 by benni_ale
- Upvotes: 2
Selected Answer: E Evolve Schema = mergeSchema option is needed ; Incrementally = checkpointing is needed; Real-Time = WriteStream with default trigger . The only option that catches all of these is E
Comment 1264181 by 35fd6dd
- Upvotes: 2
Selected Answer: E write is not for spark streaming
Comment 1220639 by Freyr
- Upvotes: 4
Selected Answer: E Reference: https://docs.databricks.com/en/ingestion/auto-loader/schema.html
writeStream: Ensures real-time streaming write capabilities, which is essential f or near real-time workloads. checkpointLocation: Necessary for fault tolerance and tracking progress. mergeSchema: Ensures automatic schema evolution, allowing new columns to be detected and added to the target table.
Why Option ‘C ’ is incorrect? Uses write instead of writeStream, which is for batch processing, making it inappropriate for real-time streaming.
Why Option ‘B ’ is incorrect? Although it includes checkpointLocation and mergeSchema, the addition of trigger(once=True) is not necessary in this context, and it is better suited for batch-like processing.
Reference: https://docs.databricks.com/en/ingestion/auto-loader/schema.html
Comment 1169467 by vikram12apr
- Upvotes: 2
Selected Answer: E streamRead & StreamWrite shares the schema using checkpoint location so cloudFiles.schemaLocation needs to be same for checkpointLocation so that we dont need to specify it manually also mergeSchema True make sure if any new column detected , it will be added in the target table
https://docs.databricks.com/en/ingestion/auto-loader/schema.html
Comment 1166127 by hal2401me
- Upvotes: 2
Selected Answer: E https://notebooks.databricks.com/demos/auto-loader/01-Auto-loader-schema-evolution-Ingestion.html
Comment 1076631 by aragorn_brego
- Upvotes: 1
Selected Answer: E This response correctly fills in the blank to meet the specified requirements of using Databricks Auto Loader for automatic schema detection and evolution in a near real-time streaming context.
Comment 1067007 by AzureDE2522
- Upvotes: 3
Selected Answer: E Please refer: https://docs.databricks.com/en/ingestion/auto-loader/schema.html
Comment 1066273 by Dileepvikram
- Upvotes: 1
It does not mention to write as stream, it mentions to write incrementally, so option C looks correct for me
Comment 1054657 by mouad_attaqi
- Upvotes: 2
Selected Answer: E Correct answer is E, it is a streaming write, and the default outputMode is Append (so if it’s optional in this case)
Comment 1053478 by sturcu
- Upvotes: 1
there is a type in the statement. Is it schema or checkpoint ? Provided answer is not correct. It has to be a writestream, with mode append
Question TxiZ4DALxAD91qf40gNR
Question
The data engineering team maintains the following code:
//IMG//
Assuming that this code produces logically correct results and the data in the source table has been de-duplicated and validated, which statement describes what will occur when this code is executed?
Choices
- A: The silver_customer_sales table will be overwritten by aggregated values calculated from all records in the gold_customer_lifetime_sales_summary table as a batch job.
- B: A batch job will update the gold_customer_lifetime_sales_summary table, replacing only those rows that have different values than the current version of the table, using customer_id as the primary key.
- C: The gold_customer_lifetime_sales_summary table will be overwritten by aggregated values calculated from all records in the silver_customer_sales table as a batch job.
- D: An incremental job will leverage running information in the state store to update aggregate values in the gold_customer_lifetime_sales_summary table.
- E: An incremental job will detect if new rows have been written to the silver_customer_sales table; if new rows are detected, all aggregates will be recalculated and used to overwrite the gold_customer_lifetime_sales_summary table.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1076635 by aragorn_brego
- Upvotes: 7
Selected Answer: C The code is performing a batch aggregation operation on the “silver_customer_sales” table grouped by “customer_id”. It calculates the first and last transaction dates, the average sales, the total number of distinct orders, and the lifetime value of sales for each customer. The .mode(“overwrite”) operation specifies that the output table “gold_customer_lifetime_sales_summary” should be overwritten with the result of this aggregation. This means that every time this code runs, it will replace the existing “gold_customer_lifetime_sales_summary” table with a new version that reflects the current state of the “silver_customer_sales” table.
Comment 1166128 by hal2401me
- Upvotes: 1
Selected Answer: C C. there’s nowhere implicating streaming.
Comment 1066277 by Dileepvikram
- Upvotes: 1
C is the answer
Comment 1054660 by mouad_attaqi
- Upvotes: 3
Selected Answer: C Correct Answer is C, it is an overwrite mode
Comment 1053479 by sturcu
- Upvotes: 4
Selected Answer: C it does overwrite, so no incremental load
Question rEbnCBpqJHz4ngwrdGXB
Question
The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as “unmanaged”) Delta Lake tables.
Which approach will ensure that this requirement is met?
Choices
- A: When a database is being created, make sure that the LOCATION keyword is used.
- B: When configuring an external data warehouse for all table storage, leverage Databricks for all ELT.
- C: When data is saved to a table, make sure that a full file path is specified alongside the Delta format.
- D: When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement.
- E: When the workspace is being configured, make sure that external cloud object storage has been mounted.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1053480 by sturcu
- Upvotes: 8
Non of the provided. It should be: When a table is created, make sure LOCATION is provided
Comment 1141688 by vctrhugo
- Upvotes: 5
Selected Answer: C In Delta Lake, an external (or unmanaged) table is a table created outside of the data lake but is still accessible from the data lake. The data for external tables is stored in a location specified by the user, not in the default directory of the data lake. When you save data to an external table, you need to specify the full file path where the data will be stored. This makes the table “external” because the data itself is not managed by Delta Lake, only the metadata is. This is why specifying a full file path alongside the Delta format when saving data to a table will ensure that the table is configured as an external Delta Lake table.
Comment 1326069 by Sriramiyer92
- Upvotes: 3
Selected Answer: C Folks note: While creating a table - Use of External keyword - Non Mandatory. Mentioning Location and providing a path - Mandatory. In option C, it is not mentioned explicitly that Location keyword is used. But since the path is provided.. implies the use of Location keyword indirectly. The devil is in the details. :)
Comment 1282365 by hjy
- Upvotes: 1
‘create external table’ statement is using in HIVE, so C is correct.
Comment 1128469 by jkhan2405
- Upvotes: 2
Selected Answer: C C is correct.
Comment 1092035 by JamesWright
- Upvotes: 1
C is correct
Comment 1076640 by aragorn_brego
- Upvotes: 3
Selected Answer: C Here’s why the other options may not ensure the requirement is met: D. Delta Lake does not use the EXTERNAL keyword in the same way as some other SQL-based systems. In Delta Lake, whether a table is external is determined by where the data files are stored, not by a keyword in the CREATE TABLE statement.
%sql CREATE TABLE f1_demo.results_external USING DELTA LOCATION ‘/mnt/formula1dl/demo/results_external’
Comment 1066286 by Dileepvikram
- Upvotes: 1
possible answer is C
Comment 1064017 by Laraujo2022
- Upvotes: 1
I think it should be A because when a database is created using a location all tables within this database are automatically assign as unmanaged tables.
Comment 1057418 by sturcu
- Upvotes: 1
Selected Answer: C provide path (LOCATION)
Comment 1054665 by mouad_attaqi
- Upvotes: 1
Selected Answer: C C is plausible answer, as in this case we are writing the data to an external location
Question uctSwbSCqG0NFxyKqKjp
Question
An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable: //IMG//
Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order. If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?
Choices
- A: Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.
- B: Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.
- C: Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.
- D: Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will fail.
- E: Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1013185 by Eertyy
- Upvotes: 18
B. Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.
Explanation:
In the provided code, the .dropDuplicates([“customer_id”,“order_id”]) operation is performed on the data loaded from the Parquet files. This operation ensures that only unique records, based on the composite key of “customer_id” and “order_id,” are retained in the DataFrame before writing to the “orders” table.
However, this operation does not consider duplicates that may already exist in the “orders” table. It only filters duplicates from the current batch of data. If there are duplicates in the “orders” table from previous batches, they will remain in the table.
So, newly written records will not have duplicates within the batch being written, but duplicates from previous batches may still exist in the target table.
Comment 1334738 by arekm
- Upvotes: 2
Selected Answer: B No duplicates in the current batch - that is obvious. The duplicates may happen since the source occasionally produces duplicates hours apart. This means that one record can be generated by the source and processed on day 1, the duplicate on day 2. Since there is no logic checking if the corresponding record exists in the target - you get the duplicates there given we use append mode.
Comment 1323897 by Anithec0der
- Upvotes: 1
Selected Answer: B yeah B is the correct answer cause in the current code it will look for duplicates in the currentDF based on composite keys and not for the duplicates which are already in the target table. if we want to insert for the rows which are not there in target table then we can make use of Merge Into statement of databricks.
Comment 1286674 by benni_ale
- Upvotes: 1
Selected Answer: B Append method does not take in consideration any key in the target table, it simply add all rows of the input table to the target table.
Comment 1236309 by panya
- Upvotes: 1
Yes it should be B
Comment 1224439 by imatheushenrique
- Upvotes: 1
B. Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.
Using merge this problem would not happen
Comment 1169377 by DavidRou
- Upvotes: 1
Selected Answer: B B is the right answer. The above code only remove duplicates from the batch that is processed, no logic is applied to already saved records.
Comment 1121604 by Jay_98_11
- Upvotes: 1
Selected Answer: B B is correct
Comment 1111078 by 5ffcd04
- Upvotes: 1
Selected Answer: B Answer B
Comment 1102689 by kz_data
- Upvotes: 1
Selected Answer: B B is correct
Comment 1084634 by vivekla
- Upvotes: 1
correct B
Comment 1040242 by sturcu
- Upvotes: 1
Selected Answer: B Correct
Comment 1038956 by Starvosxant
- Upvotes: 1
Correct. B
Comment 1001408 by thxsgod
- Upvotes: 2
Selected Answer: B Correct
Question qKkhM5fANqR6MXxhBbkC
Question
The marketing team is looking to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing-specific fields have not been approved for the sales org.
Which of the following solutions addresses the situation while emphasizing simplicity?
Choices
- A: Create a view on the marketing table selecting only those fields approved for the sales team; alias the names of any fields that should be standardized to the sales naming conventions.
- B: Create a new table with the required schema and use Delta Lake’s DEEP CLONE functionality to sync up changes committed to one table to the corresponding table.
- C: Use a CTAS statement to create a derivative table from the marketing table; configure a production job to propagate changes.
- D: Add a parallel table write to the current production pipeline, updating a new sales table that varies as required from the marketing table.
- E: Instruct the marketing team to download results as a CSV and email them to the sales organization.
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1255507 by Hadiler
- Upvotes: 1
Selected Answer: A A is the simplest one
Comment 1141686 by vctrhugo
- Upvotes: 4
Selected Answer: A Creating a view is a simple and efficient way to provide access to a subset of data from a table. In this case, the view can be configured to include only the fields that have been approved for the sales team. Additionally, any fields that need to be renamed to match the sales team’s naming conventions can be aliased in the view. This approach does not require the creation of additional tables or the configuration of jobs to sync data, making it a relatively straightforward solution. However, it’s important to note that views do not physically store data, so any changes to the underlying marketing table will be reflected in the view. This means that the sales team will always have access to the most up-to-date approved data.
Comment 1131748 by spaceexplorer
- Upvotes: 1
Selected Answer: A A is the simplest
Comment 1108932 by dmov
- Upvotes: 2
Selected Answer: A Looks like A to me