Questions and Answers
Question 8ly8s8V6CLa7IbV5CAYi
Question
A data engineer needs to create a table in Databricks using data from a CSV file at location /path/to/csv.
They run the following command:
//IMG//
Which of the following lines of code fills in the above blank to successfully complete the task?
Choices
- A: None of these lines of code are needed to successfully complete the task
- B: USING CSV
- C: FROM CSV
- D: USING DELTA
- E: FROM “path/to/csv”
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1170507 by fifirifi
- Upvotes: 2
Selected Answer: B correct answer: B explanation: To create a table in Databricks using data from a CSV file, the correct syntax after specifying the table name and schema (if applicable) would be to use the USING CSV clause to define the format of the source data. This clause tells Databricks that the data source format is CSV. The command would typically look
Comment 1156035 by Bob123456
- Upvotes: 1
I have a question
Why can option using delta
Comment 1089742 by kz_data
- Upvotes: 1
Selected Answer: B B is correct
Comment 1083495 by 55f31c8
- Upvotes: 3
Selected Answer: B https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
Comment 1050168 by meow_akk
- Upvotes: 3
Ans B : Using csv is correct. that is the correct syntax
Comment 1048892 by kishanu
- Upvotes: 2
Selected Answer: B USING CSV
Question S9PIxGSsSLWrOOBQqUfT
Question
A data engineer has realized that they made a mistake when making a daily update to a table. They need to use Delta time travel to restore the table to a version that is 3 days old. However, when the data engineer attempts to time travel to the older version, they are unable to restore the data because the data files have been deleted. Which of the following explains why the data files are no longer present?
Choices
- A: The VACUUM command was run on the table
- B: The TIME TRAVEL command was run on the table
- C: The DELETE HISTORY command was run on the table
- D: The OPTIMIZE command was nun on the table
- E: The HISTORY command was run on the table
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1072323 by Feroz_Raza
- Upvotes: 12
Selected Answer: A There is no DELETE HISTORY command in Databricks VACCUM command can remove history and we can also specify the retention period with VACCUM Command. Default Retention period is 7 days. To allow changing the default retention period you can rum the following command
ALTER TABLE your_table SET TBLPROPERTIES (‘delta.retentionDurationCheck.enabled’ = ‘true’);
Comment 1411230 by devbila
- Upvotes: 1
Selected Answer: A The response is A
Comment 1288950 by mateo_vansweevelt
- Upvotes: 1
Selected Answer: A Its vacuum
Comment 997867 by vctrhugo
- Upvotes: 3
Selected Answer: A A. The VACUUM command was run on the table
The VACUUM command in Delta Lake is used to clean up and remove unnecessary data files that are no longer needed for time travel or query purposes. When you run VACUUM with certain retention settings, it can delete older data files, which might include versions of data that are older than the specified retention period. If the data engineer is unable to restore the table to a version that is 3 days old because the data files have been deleted, it’s likely because the VACUUM command was run on the table, removing the older data files as part of data cleanup.
Comment 1272797 by 9d4d68a
- Upvotes: 1
Selected Answer: A A. The VACUUM command was run on the table
The VACUUM command in Delta Lake is used to clean up old data files that are no longer needed, which could include files that are older than a certain retention period. If the data engineer is unable to restore data to a version that is 3 days old, it is likely because the VACUUM command has deleted the old data files beyond the retention period.
Comment 1262388 by 80370eb
- Upvotes: 1
Selected Answer: A vacuum command is used to remove the history of the table.
Comment 1249372 by ranjan24
- Upvotes: 1
A is the correct answer
Comment 1244536 by 3fbc31b
- Upvotes: 1
Selected Answer: A There is no DELETE HISTORY command anywhere in Databricks. The VACUUM command removes files older than the value that is set. The default value is 7 days.
Comment 1231261 by potaryxkug
- Upvotes: 1
A is the good answer
Comment 1227527 by mascarenhaslucas
- Upvotes: 1
The answer is A!
Comment 1182680 by bettermakeme
- Upvotes: 1
Answer is A. Just finished exam-got 100% [Databricks Associate Exam Practice Exams] All questions came from Databricks Certified Data Engineer Associate https://www.udemy.com/share/10aEFa3@9M_uT6vrKbnl68tOK96kfy-YWitjwzLTlVCrzPs-0hGUu8fyX8V4Tn_x_y65bwLm/
Comment 1177170 by Itmma
- Upvotes: 1
Selected Answer: A A is correct
Comment 1104703 by SerGrey
- Upvotes: 1
Selected Answer: A A i correct
Comment 1073382 by Huroye
- Upvotes: 2
I agree with the first post. A is the correct answer. There is no such thing as a Delete History Command
Comment 1064769 by awofalus
- Upvotes: 1
Selected Answer: A right answer is A
Comment 1057375 by vivekrrr
- Upvotes: 1
i think B is the answer, plz let me know if not correct
Comment 1057373 by vivekrrr
- Upvotes: 1
but vaccum allows to vaccum anything that’s older than 7 days right
Comment 1028763 by VijayKula
- Upvotes: 1
Selected Answer: A Answer is A Vaccum
Comment 1022510 by Sriramiyer92
- Upvotes: 1
Reading Material: https://learn.microsoft.com/en-us/azure/databricks/delta/vacuum#example-syntax-for-vacuum
Comment 1017342 by KalavathiP
- Upvotes: 2
Selected Answer: A A is correct
Comment 988569 by cpalmier
- Upvotes: 2
A is Correct! Does DELETE HISTORY command exist?
Comment 946760 by Atnafu
- Upvotes: 1
A When the data engineer attempted to time travel to an older version of the table, the data files were no longer present because the VACUUM command was run on the table. The VACUUM command in Delta Lake is used to clean up files that are no longer necessary for the current version of the table. It permanently removes older versions of data files and transaction log files that are no longer needed for queries or time travel. By running the VACUUM command, the data engineer inadvertently deleted the data files of the version they were trying to restore, making it impossible to access that specific version of the table through Delta time travel. VACUUM [db_name.]table_name [RETAIN num_hrs] [DRY RUN]
Comment 889038 by Majjjj
- Upvotes: 4
Selected Answer: A The most likely reason why the data files are no longer present when the data engineer attempts to time travel to an older version of a Delta table is that the VACUUM command was run on the table. The VACUUM command removes files that are no longer in use by the Delta table, including files that are required for time travel. Therefore, if the VACUUM command is run on a Delta table, it can make it impossible to use time travel to recover older versions of the table.
Comment 876194 by Varma_Saraswathula
- Upvotes: 1
Agreed A is answer
Comment 875845 by naxacod574
- Upvotes: 1
Vaccum
Comment 863851 by Data_4ever
- Upvotes: 3
Selected Answer: A VACUUM command is the only way to explicitly remove the history information from a delta table. There is no command like DELETE HISTORY exists in Databricks
Comment 861138 by upliftinghut
- Upvotes: 1
Selected Answer: A Vacuum not Delete
Comment 860617 by sdas1
- Upvotes: 1
option A
Comment 860261 by knivesz
- Upvotes: 1
Selected Answer: A VACUUM elimina los archivos pasados, dejando solo los ultimos, disponibles
Comment 859620 by surrabhi_4
- Upvotes: 1
Selected Answer: A option A
Comment 857962 by XiltroX
- Upvotes: 3
C is the wrong answer. Correct answer should be A. Using VACUUM command will not make it possible to TIME TRAVEL (depends on the conditions on the VACUUM command). https://docs.databricks.com/sql/language-manual/delta-vacuum.html
Comment 854102 by h79
- Upvotes: 2
wrong answer. there is no delete history command https://docs.databricks.com/delta/history.html
Comment 851909 by azurearch
- Upvotes: 4
wrong answer, vaccum command removes the history
Question CesFh273ckQ2F8rdN7u6
Question
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.
The code block used by the data engineer is below:
//IMG//
If the data engineer only wants the query to process all of the available data in as many batches as required, which of the following lines of code should the data engineer use to fill in the blank?
Choices
- A: processingTime(1)
- B: trigger(availableNow=True)
- C: trigger(parallelBatch=True)
- D: trigger(processingTime=“once”)
- E: trigger(continuous=“once”)
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1203843 by benni_ale
- Upvotes: 1
Selected Answer: B b is ok
Comment 1170512 by fifirifi
- Upvotes: 4
Selected Answer: B correct answer: B explanation: In Structured Streaming, if a data engineer wants to process all the available data in as many batches as required without any explicit trigger interval, they can use the option trigger(availableNow=True). This feature, availableNow, is used to specify that the query should process all the data that is available at the moment and not wait for more data to arrive.
Comment 1101417 by AndreFR
- Upvotes: 1
Selected Answer: B it’s the only answer with a correct syntax
Comment 1083536 by 55f31c8
- Upvotes: 2
Selected Answer: B https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamWriter.trigger.html
Comment 1052752 by kbaba101
- Upvotes: 4
B availableNowbool, optional if set to True, set a trigger that processes all available data in multiple batches then terminates the query. Only one trigger can be set.
Comment 1050963 by meow_akk
- Upvotes: 4
sorry Ans is B : https://stackoverflow.com/questions/71061809/trigger-availablenow-for-delta-source-streaming-queries-in-pyspark-databricks
for batch we use available now
Comment 1050169 by meow_akk
- Upvotes: 1
Correct Ans is D : %python
spark.readStream.format(“delta”).load(“<delta_table_path>”) .writeStream .format(“delta”) .trigger(processingTime=‘5 seconds’) Added line of code that defines .trigger processing time. .outputMode(“append”) .option(“checkpointLocation”,“<checkpoint_path>”) .options(**writeConfig) .start()
https://kb.databricks.com/streaming/optimize-streaming-transactions-with-trigger
Question P9uirovIcYtBaOoaKUhD
Question
A data engineer has developed a data pipeline to ingest data from a JSON source using Auto Loader, but the engineer has not provided any type inference or schema hints in their pipeline. Upon reviewing the data, the data engineer has noticed that all of the columns in the target table are of the string type despite some of the fields only including float or boolean values.
Which of the following describes why Auto Loader inferred all of the columns to be of the string type?
Choices
- A: There was a type mismatch between the specific schema and the inferred schema
- B: JSON data is a text-based format
- C: Auto Loader only works with string data
- D: All of the fields had at least one null value
- E: Auto Loader cannot infer the schema of ingested data
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1101423 by AndreFR
- Upvotes: 2
Selected Answer: B https://docs.databricks.com/en/ingestion/auto-loader/schema.html#how-does-auto-loader-schema-inference-work
By default, Auto Loader schema inference seeks to avoid schema evolution issues due to type mismatches. For formats that don’t encode data types (JSON and CSV), Auto Loader infers all columns as strings (including nested fields in JSON files).
Comment 1095309 by nedlo
- Upvotes: 2
Selected Answer: B Its B “By default, Auto Loader schema inference seeks to avoid schema evolution issues due to type mismatches. For formats that don’t encode data types (JSON and CSV), Auto Loader infers all columns as strings (including nested fields in JSON files). For formats with typed schema (Parquet and Avro), Auto Loader samples a subset of files and merges the schemas of individual files. This behavior is summarized in the following table:” https://docs.databricks.com/en/ingestion/auto-loader/schema.html
Comment 1083553 by 55f31c8
- Upvotes: 2
Selected Answer: B https://docs.databricks.com/en/ingestion/auto-loader/schema.html#how-does-auto-loader-schema-inference-work
Comment 1050170 by meow_akk
- Upvotes: 2
The correct answer is: B. JSON data is a text-based format
JSON data is a text-based format that uses strings to represent all values. When Auto Loader infers the schema of JSON data, it assumes that all values are strings. This is because Auto Loader cannot determine the type of a value based on its string representation.
https://docs.databricks.com/en/ingestion/auto-loader/schema.html
For example, the following JSON string represents a value that is logically a boolean:
JSON “true” Use code with caution. Learn more However, Auto Loader would infer that the type of this value is string. This is because Auto Loader cannot determine that the value is a boolean based on its string representation.
In order to get Auto Loader to infer the correct types for columns, the data engineer can provide type inference or schema hints. Type inference hints can be used to specify the types of specific columns. Schema hints can be used to provide the entire schema of the data.
Therefore, the correct answer is B. JSON data is a text-based format.
Question d1ZWko5b9hIrr1TAQDDz
Question
A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.
The table is configured to run in Development mode using the Continuous Pipeline Mode.
Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?
Choices
- A: All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.
- B: All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist until the pipeline is shut down.
- C: All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.
- D: All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.
- E: All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.
answer?
Answer: E Answer_ET: E Community answer E (68%) B (26%) 5% Discussion
Comment 1050171 by meow_akk
- Upvotes: 11
Ans E : Development and production modes You can optimize pipeline execution by switching between development and production modes. Use the Delta Live Tables Environment Toggle Icon buttons in the Pipelines UI to switch between these two modes. By default, pipelines run in development mode.
When you run your pipeline in development mode, the Delta Live Tables system does the following:
Reuses a cluster to avoid the overhead of restarts. By default, clusters run for two hours when development mode is enabled. You can change this with the pipelines.clusterShutdown.delay setting in the Configure your compute settings.
Disables pipeline retries so you can immediately detect and fix errors.
In production mode, the Delta Live Tables system does the following:
Restarts the cluster for specific recoverable errors, including memory leaks and stale credentials.
Retries execution in the event of specific errors, for example, a failure to start a cluster.
https://docs.databricks.com/en/delta-live-tables/updates.html#optimize-execution
Comment 1356925 by 1017857
- Upvotes: 1
Selected Answer: B testing why testing
Comment 1353128 by JuarezNJunior
- Upvotes: 1
Selected Answer: D According to the document downloaded from the official website, the correct answer is letter D. PracticeExam-DataEngineerAssociate.pdf
Comment 1336624 by AnirbanRC
- Upvotes: 1
Selected Answer: E Continuous and Development Mode. Hence E
Comment 1327282 by MultiCloudIronMan
- Upvotes: 1
Selected Answer: B The correct answer is B. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist until the pipeline is shut down. In Continuous Pipeline Mode, the pipeline continuously processes data at set intervals, and the compute resources remain active until the pipeline is manually shut down.
Comment 1315945 by 806e7d2
- Upvotes: 1
Selected Answer: B The pipeline runs in Continuous Pipeline Mode, so datasets will be updated at set intervals. Since the pipeline is in Development Mode, the compute resources will persist until manually shut down.
Comment 1267020 by 7a22144
- Upvotes: 1
E is correct !
Option B: “All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist until the pipeline is shut down.”
This option correctly reflects that the pipeline continues running, updating datasets at intervals, and only stops when manually shut down. Compute resources persist throughout this process. Option E: “All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.”
While this is very similar, it adds the phrase “to allow for additional testing,” which might imply that the resources are persisting just for testing purposes. This can be misleading because the primary reason for resource persistence in Continuous mode is to keep the pipeline active and processing data, not solely for testing.
Comment 1244568 by 3fbc31b
- Upvotes: 1
Selected Answer: E The answer is E. The compute resources will persist even after the pipeline is shut down.
Comment 1203844 by benni_ale
- Upvotes: 1
Selected Answer: E e as teh cluster actually persits differently from b
Comment 1110192 by Garyn
- Upvotes: 2
Selected Answer: E E. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.
Explanation:
In Development mode, Delta Live Tables persistently updates datasets at set intervals. The pipeline continuously processes incoming data until manually stopped or shut down.
Compute resources, including the cluster used for processing, persist without automatic restarts or retries (as it is the behavior in Development mode). This persistence allows for ongoing processing of data, enabling additional testing or continued data processing until the pipeline is manually shut down.
Therefore, option E accurately captures the behavior expected in Development mode, emphasizing the continuous update of datasets and the persistence of compute resources until the pipeline is manually terminated.
Comment 1089743 by kz_data
- Upvotes: 2
Selected Answer: E E seems the correct answer
Comment 1089109 by nedlo
- Upvotes: 2
Selected Answer: B Why E? It persists with same functionality as was before, not for “additional testing”?
Comment 1084405 by 55f31c8
- Upvotes: 2
Selected Answer: E https://docs.databricks.com/en/delta-live-tables/updates.html#continuous-vs-triggered-pipeline-execution
Comment 1053328 by anandpsg101
- Upvotes: 2
Selected Answer: E E is correct
Comment 1049399 by SD5713
- Upvotes: 2
Selected Answer: E E. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.