Questions and Answers

Question 8ly8s8V6CLa7IbV5CAYi

Question

A data engineer needs to create a table in Databricks using data from a CSV file at location /path/to/csv.

They run the following command:

//IMG//

Which of the following lines of code fills in the above blank to successfully complete the task?

Choices

A: None of these lines of code are needed to successfully complete the task
B: USING CSV
C: FROM CSV
D: USING DELTA
E: FROM “path/to/csv”

answer?

Answer: B Answer_ET: B Community answer B (100%) Discussion

Comment 1170507 by fifirifi

Upvotes: 2

Selected Answer: B correct answer: B explanation: To create a table in Databricks using data from a CSV file, the correct syntax after specifying the table name and schema (if applicable) would be to use the USING CSV clause to define the format of the source data. This clause tells Databricks that the data source format is CSV. The command would typically look

Comment 1156035 by Bob123456

Upvotes: 1

I have a question

Why can option using delta

Comment 1089742 by kz_data

Upvotes: 1

Selected Answer: B B is correct

Comment 1083495 by 55f31c8

Upvotes: 3

Selected Answer: B https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters

Comment 1050168 by meow_akk

Upvotes: 3

Ans B : Using csv is correct. that is the correct syntax

Comment 1048892 by kishanu

Upvotes: 2

Selected Answer: B USING CSV

Question S9PIxGSsSLWrOOBQqUfT

Question

A data engineer has realized that they made a mistake when making a daily update to a table. They need to use Delta time travel to restore the table to a version that is 3 days old. However, when the data engineer attempts to time travel to the older version, they are unable to restore the data because the data files have been deleted. Which of the following explains why the data files are no longer present?

Choices

A: The VACUUM command was run on the table
B: The TIME TRAVEL command was run on the table
C: The DELETE HISTORY command was run on the table
D: The OPTIMIZE command was nun on the table
E: The HISTORY command was run on the table

answer?

Answer: A Answer_ET: A Community answer A (100%) Discussion

Comment 1072323 by Feroz_Raza

Upvotes: 12

Selected Answer: A There is no DELETE HISTORY command in Databricks VACCUM command can remove history and we can also specify the retention period with VACCUM Command. Default Retention period is 7 days. To allow changing the default retention period you can rum the following command

ALTER TABLE your_table SET TBLPROPERTIES (‘delta.retentionDurationCheck.enabled’ = ‘true’);

Comment 1411230 by devbila

Upvotes: 1

Selected Answer: A The response is A

Comment 1288950 by mateo_vansweevelt

Upvotes: 1

Selected Answer: A Its vacuum

Comment 997867 by vctrhugo

Upvotes: 3

Selected Answer: A A. The VACUUM command was run on the table

The VACUUM command in Delta Lake is used to clean up and remove unnecessary data files that are no longer needed for time travel or query purposes. When you run VACUUM with certain retention settings, it can delete older data files, which might include versions of data that are older than the specified retention period. If the data engineer is unable to restore the table to a version that is 3 days old because the data files have been deleted, it’s likely because the VACUUM command was run on the table, removing the older data files as part of data cleanup.

Comment 1272797 by 9d4d68a

Upvotes: 1

Selected Answer: A A. The VACUUM command was run on the table

The VACUUM command in Delta Lake is used to clean up old data files that are no longer needed, which could include files that are older than a certain retention period. If the data engineer is unable to restore data to a version that is 3 days old, it is likely because the VACUUM command has deleted the old data files beyond the retention period.

Comment 1262388 by 80370eb

Upvotes: 1

Selected Answer: A vacuum command is used to remove the history of the table.

Comment 1249372 by ranjan24

Upvotes: 1

A is the correct answer

Comment 1244536 by 3fbc31b

Upvotes: 1

Selected Answer: A There is no DELETE HISTORY command anywhere in Databricks. The VACUUM command removes files older than the value that is set. The default value is 7 days.

Comment 1231261 by potaryxkug

Upvotes: 1

A is the good answer

Comment 1227527 by mascarenhaslucas

Upvotes: 1

The answer is A!

Comment 1182680 by bettermakeme

Upvotes: 1

Answer is A. Just finished exam-got 100% [Databricks Associate Exam Practice Exams] All questions came from Databricks Certified Data Engineer Associate https://www.udemy.com/share/10aEFa3@9M_uT6vrKbnl68tOK96kfy-YWitjwzLTlVCrzPs-0hGUu8fyX8V4Tn_x_y65bwLm/

Comment 1177170 by Itmma

Upvotes: 1

Selected Answer: A A is correct

Comment 1104703 by SerGrey

Upvotes: 1

Selected Answer: A A i correct

Comment 1073382 by Huroye

Upvotes: 2

I agree with the first post. A is the correct answer. There is no such thing as a Delete History Command

Comment 1064769 by awofalus

Upvotes: 1

Selected Answer: A right answer is A

Comment 1057375 by vivekrrr

Upvotes: 1

i think B is the answer, plz let me know if not correct

Comment 1057373 by vivekrrr

Upvotes: 1

but vaccum allows to vaccum anything that’s older than 7 days right

Comment 1028763 by VijayKula

Upvotes: 1

Selected Answer: A Answer is A Vaccum

Comment 1022510 by Sriramiyer92

Upvotes: 1

Reading Material: https://learn.microsoft.com/en-us/azure/databricks/delta/vacuum#example-syntax-for-vacuum

Comment 1017342 by KalavathiP

Upvotes: 2

Selected Answer: A A is correct

Comment 988569 by cpalmier

Upvotes: 2

A is Correct! Does DELETE HISTORY command exist?

Comment 946760 by Atnafu

Upvotes: 1

A When the data engineer attempted to time travel to an older version of the table, the data files were no longer present because the VACUUM command was run on the table. The VACUUM command in Delta Lake is used to clean up files that are no longer necessary for the current version of the table. It permanently removes older versions of data files and transaction log files that are no longer needed for queries or time travel. By running the VACUUM command, the data engineer inadvertently deleted the data files of the version they were trying to restore, making it impossible to access that specific version of the table through Delta time travel. VACUUM [db_name.]table_name [RETAIN num_hrs] [DRY RUN]

Comment 889038 by Majjjj

Upvotes: 4

Selected Answer: A The most likely reason why the data files are no longer present when the data engineer attempts to time travel to an older version of a Delta table is that the VACUUM command was run on the table. The VACUUM command removes files that are no longer in use by the Delta table, including files that are required for time travel. Therefore, if the VACUUM command is run on a Delta table, it can make it impossible to use time travel to recover older versions of the table.

Comment 876194 by Varma_Saraswathula

Upvotes: 1

Agreed A is answer

Comment 875845 by naxacod574

Upvotes: 1

Vaccum

Comment 863851 by Data_4ever

Upvotes: 3

Selected Answer: A VACUUM command is the only way to explicitly remove the history information from a delta table. There is no command like DELETE HISTORY exists in Databricks

Comment 861138 by upliftinghut

Upvotes: 1

Selected Answer: A Vacuum not Delete

Comment 860617 by sdas1

Upvotes: 1

option A

Comment 860261 by knivesz

Upvotes: 1

Selected Answer: A VACUUM elimina los archivos pasados, dejando solo los ultimos, disponibles

Comment 859620 by surrabhi_4

Upvotes: 1

Selected Answer: A option A

Comment 857962 by XiltroX

Upvotes: 3

C is the wrong answer. Correct answer should be A. Using VACUUM command will not make it possible to TIME TRAVEL (depends on the conditions on the VACUUM command). https://docs.databricks.com/sql/language-manual/delta-vacuum.html

Comment 854102 by h79

Upvotes: 2

wrong answer. there is no delete history command https://docs.databricks.com/delta/history.html

Comment 851909 by azurearch

Upvotes: 4

wrong answer, vaccum command removes the history

Question CesFh273ckQ2F8rdN7u6

Question

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

The code block used by the data engineer is below:

//IMG//

If the data engineer only wants the query to process all of the available data in as many batches as required, which of the following lines of code should the data engineer use to fill in the blank?

Choices

A: processingTime(1)
B: trigger(availableNow=True)
C: trigger(parallelBatch=True)
D: trigger(processingTime=“once”)
E: trigger(continuous=“once”)

answer?

Answer: B Answer_ET: B Community answer B (100%) Discussion

Comment 1203843 by benni_ale

Upvotes: 1

Selected Answer: B b is ok

Comment 1170512 by fifirifi

Upvotes: 4

Selected Answer: B correct answer: B explanation: In Structured Streaming, if a data engineer wants to process all the available data in as many batches as required without any explicit trigger interval, they can use the option trigger(availableNow=True). This feature, availableNow, is used to specify that the query should process all the data that is available at the moment and not wait for more data to arrive.

Comment 1101417 by AndreFR

Upvotes: 1

Selected Answer: B it’s the only answer with a correct syntax

Comment 1083536 by 55f31c8

Upvotes: 2

Selected Answer: B https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamWriter.trigger.html

Comment 1052752 by kbaba101

Upvotes: 4

B availableNowbool, optional if set to True, set a trigger that processes all available data in multiple batches then terminates the query. Only one trigger can be set.

Comment 1050963 by meow_akk

Upvotes: 4

sorry Ans is B : https://stackoverflow.com/questions/71061809/trigger-availablenow-for-delta-source-streaming-queries-in-pyspark-databricks

for batch we use available now

Comment 1050169 by meow_akk

Upvotes: 1

Correct Ans is D : %python

spark.readStream.format(“delta”).load(“<delta_table_path>”) .writeStream .format(“delta”) .trigger(processingTime=‘5 seconds’) Added line of code that defines .trigger processing time. .outputMode(“append”) .option(“checkpointLocation”,“<checkpoint_path>”) .options(**writeConfig) .start()

https://kb.databricks.com/streaming/optimize-streaming-transactions-with-trigger

Question P9uirovIcYtBaOoaKUhD

Question

A data engineer has developed a data pipeline to ingest data from a JSON source using Auto Loader, but the engineer has not provided any type inference or schema hints in their pipeline. Upon reviewing the data, the data engineer has noticed that all of the columns in the target table are of the string type despite some of the fields only including float or boolean values.

Which of the following describes why Auto Loader inferred all of the columns to be of the string type?

Choices

A: There was a type mismatch between the specific schema and the inferred schema
B: JSON data is a text-based format
C: Auto Loader only works with string data
D: All of the fields had at least one null value
E: Auto Loader cannot infer the schema of ingested data

answer?

Answer: B Answer_ET: B Community answer B (100%) Discussion

Comment 1101423 by AndreFR

Upvotes: 2

Selected Answer: B https://docs.databricks.com/en/ingestion/auto-loader/schema.html#how-does-auto-loader-schema-inference-work

By default, Auto Loader schema inference seeks to avoid schema evolution issues due to type mismatches. For formats that don’t encode data types (JSON and CSV), Auto Loader infers all columns as strings (including nested fields in JSON files).

Comment 1095309 by nedlo

Upvotes: 2

Selected Answer: B Its B “By default, Auto Loader schema inference seeks to avoid schema evolution issues due to type mismatches. For formats that don’t encode data types (JSON and CSV), Auto Loader infers all columns as strings (including nested fields in JSON files). For formats with typed schema (Parquet and Avro), Auto Loader samples a subset of files and merges the schemas of individual files. This behavior is summarized in the following table:” https://docs.databricks.com/en/ingestion/auto-loader/schema.html

Comment 1083553 by 55f31c8

Upvotes: 2

Selected Answer: B https://docs.databricks.com/en/ingestion/auto-loader/schema.html#how-does-auto-loader-schema-inference-work

Comment 1050170 by meow_akk

Upvotes: 2

The correct answer is: B. JSON data is a text-based format

JSON data is a text-based format that uses strings to represent all values. When Auto Loader infers the schema of JSON data, it assumes that all values are strings. This is because Auto Loader cannot determine the type of a value based on its string representation.

https://docs.databricks.com/en/ingestion/auto-loader/schema.html

For example, the following JSON string represents a value that is logically a boolean:

JSON “true” Use code with caution. Learn more However, Auto Loader would infer that the type of this value is string. This is because Auto Loader cannot determine that the value is a boolean based on its string representation.

In order to get Auto Loader to infer the correct types for columns, the data engineer can provide type inference or schema hints. Type inference hints can be used to specify the types of specific columns. Schema hints can be used to provide the entire schema of the data.

Therefore, the correct answer is B. JSON data is a text-based format.

Question d1ZWko5b9hIrr1TAQDDz

Question

A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.

The table is configured to run in Development mode using the Continuous Pipeline Mode.

Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?

Choices

A: All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.
B: All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist until the pipeline is shut down.
C: All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.
D: All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.
E: All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.

answer?

Answer: E Answer_ET: E Community answer E (68%) B (26%) 5% Discussion

Comment 1050171 by meow_akk

Upvotes: 11

Ans E : Development and production modes You can optimize pipeline execution by switching between development and production modes. Use the Delta Live Tables Environment Toggle Icon buttons in the Pipelines UI to switch between these two modes. By default, pipelines run in development mode.

When you run your pipeline in development mode, the Delta Live Tables system does the following:

Reuses a cluster to avoid the overhead of restarts. By default, clusters run for two hours when development mode is enabled. You can change this with the pipelines.clusterShutdown.delay setting in the Configure your compute settings.

Disables pipeline retries so you can immediately detect and fix errors.

In production mode, the Delta Live Tables system does the following:

Restarts the cluster for specific recoverable errors, including memory leaks and stale credentials.

Retries execution in the event of specific errors, for example, a failure to start a cluster.

https://docs.databricks.com/en/delta-live-tables/updates.html#optimize-execution

Comment 1356925 by 1017857

Upvotes: 1

Selected Answer: B testing why testing

Comment 1353128 by JuarezNJunior

Upvotes: 1

Selected Answer: D According to the document downloaded from the official website, the correct answer is letter D. PracticeExam-DataEngineerAssociate.pdf

Comment 1336624 by AnirbanRC

Upvotes: 1

Selected Answer: E Continuous and Development Mode. Hence E

Comment 1327282 by MultiCloudIronMan

Upvotes: 1

Selected Answer: B The correct answer is B. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist until the pipeline is shut down. In Continuous Pipeline Mode, the pipeline continuously processes data at set intervals, and the compute resources remain active until the pipeline is manually shut down.

Comment 1315945 by 806e7d2

Upvotes: 1

Selected Answer: B The pipeline runs in Continuous Pipeline Mode, so datasets will be updated at set intervals. Since the pipeline is in Development Mode, the compute resources will persist until manually shut down.

Comment 1267020 by 7a22144

Upvotes: 1

E is correct !

Option B: “All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist until the pipeline is shut down.”

This option correctly reflects that the pipeline continues running, updating datasets at intervals, and only stops when manually shut down. Compute resources persist throughout this process. Option E: “All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.”

While this is very similar, it adds the phrase “to allow for additional testing,” which might imply that the resources are persisting just for testing purposes. This can be misleading because the primary reason for resource persistence in Continuous mode is to keep the pipeline active and processing data, not solely for testing.

Comment 1244568 by 3fbc31b

Upvotes: 1

Selected Answer: E The answer is E. The compute resources will persist even after the pipeline is shut down.

Comment 1203844 by benni_ale

Upvotes: 1

Selected Answer: E e as teh cluster actually persits differently from b

Comment 1110192 by Garyn

Upvotes: 2

Selected Answer: E E. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.

Explanation:

In Development mode, Delta Live Tables persistently updates datasets at set intervals. The pipeline continuously processes incoming data until manually stopped or shut down.

Compute resources, including the cluster used for processing, persist without automatic restarts or retries (as it is the behavior in Development mode). This persistence allows for ongoing processing of data, enabling additional testing or continued data processing until the pipeline is manually shut down.

Therefore, option E accurately captures the behavior expected in Development mode, emphasizing the continuous update of datasets and the persistence of compute resources until the pipeline is manually terminated.

Comment 1089743 by kz_data

Upvotes: 2

Selected Answer: E E seems the correct answer

Comment 1089109 by nedlo

Upvotes: 2

Selected Answer: B Why E? It persists with same functionality as was before, not for “additional testing”?

Comment 1084405 by 55f31c8

Upvotes: 2

Selected Answer: E https://docs.databricks.com/en/delta-live-tables/updates.html#continuous-vs-triggered-pipeline-execution

https://docs.databricks.com/en/delta-live-tables/testing.html#use-development-mode-to-run-pipeline-updates

Comment 1053328 by anandpsg101

Upvotes: 2

Selected Answer: E E is correct

Comment 1049399 by SD5713

Upvotes: 2

Selected Answer: E E. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.

vuthanhdatt's Second Brain

Explorer

28

Questions and Answers

Question 8ly8s8V6CLa7IbV5CAYi

Question

Choices

Comment 1170507 by fifirifi

Comment 1156035 by Bob123456

Comment 1089742 by kz_data

Comment 1083495 by 55f31c8

Comment 1050168 by meow_akk

Comment 1048892 by kishanu

Question S9PIxGSsSLWrOOBQqUfT

Question

Choices

Comment 1072323 by Feroz_Raza

Comment 1411230 by devbila

Comment 1288950 by mateo_vansweevelt

Comment 997867 by vctrhugo

Comment 1272797 by 9d4d68a

Comment 1262388 by 80370eb

Comment 1249372 by ranjan24

Comment 1244536 by 3fbc31b

Comment 1231261 by potaryxkug

Comment 1227527 by mascarenhaslucas

Comment 1182680 by bettermakeme

Comment 1177170 by Itmma

Comment 1104703 by SerGrey

Comment 1073382 by Huroye

Comment 1064769 by awofalus

Comment 1057375 by vivekrrr

Comment 1057373 by vivekrrr

Comment 1028763 by VijayKula

Comment 1022510 by Sriramiyer92

Comment 1017342 by KalavathiP

Comment 988569 by cpalmier

Comment 946760 by Atnafu

Comment 889038 by Majjjj

Comment 876194 by Varma_Saraswathula

Comment 875845 by naxacod574

Comment 863851 by Data_4ever

Comment 861138 by upliftinghut

Comment 860617 by sdas1

Comment 860261 by knivesz

Comment 859620 by surrabhi_4

Comment 857962 by XiltroX

Comment 854102 by h79

Comment 851909 by azurearch

Question CesFh273ckQ2F8rdN7u6

Question

Choices

Comment 1203843 by benni_ale

Comment 1170512 by fifirifi

Comment 1101417 by AndreFR

Comment 1083536 by 55f31c8

Comment 1052752 by kbaba101

Comment 1050963 by meow_akk

Comment 1050169 by meow_akk

Question P9uirovIcYtBaOoaKUhD

Question

Choices

Comment 1101423 by AndreFR

Comment 1095309 by nedlo

Comment 1083553 by 55f31c8

Comment 1050170 by meow_akk

Question d1ZWko5b9hIrr1TAQDDz

Question

Choices

Comment 1050171 by meow_akk

Comment 1356925 by 1017857

Comment 1353128 by JuarezNJunior

Comment 1336624 by AnirbanRC

Comment 1327282 by MultiCloudIronMan

Comment 1315945 by 806e7d2

Comment 1267020 by 7a22144

Comment 1244568 by 3fbc31b

Comment 1203844 by benni_ale

Comment 1110192 by Garyn

Comment 1089743 by kz_data