Questions and Answers

Question G1IIs43ohieuOtpgXvZx

Question

Which statement describes the default execution mode for Databricks Auto Loader?

Choices

  • A: Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; the target table is materialized by directly querying all valid files in the source directory.
  • B: New files are identified by listing the input directory; the target table is materialized by directly querying all valid files in the source directory.
  • C: Webhooks trigger a Databricks job to run anytime new data arrives in a source directory; new data are automatically merged into target tables using rules inferred from the data.
  • D: New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.
  • E: Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and idempotently loaded into the target Delta Lake table.

Question LsqagrTu1dqqMKpyA36U

Question

A Delta Lake table representing metadata about content posts from users has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

Based on the above schema, which column is a good candidate for partitioning the Delta Table?

Choices

  • A: post_time
  • B: latitude
  • C: post_id
  • D: user_id
  • E: date

Question 02gUaptuYlAMQae3eZRi

Question

The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings. The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization. The compliance officer has recently learned about Delta Lake’s time travel functionality. They are concerned that this might allow continued access to deleted data. Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?

Choices

  • A: Because the VACUUM command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.
  • B: Because the default data retention threshold is 24 hours, data files containing deleted records will be retained until the VACUUM job is run the following day.
  • C: Because Delta Lake time travel provides full access to the entire history of a table, deleted records can always be recreated by users with full admin privileges.
  • D: Because Delta Lake’s delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.
  • E: Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the VACUUM job is run 8 days later.

Question VYbv6i0YiCrD11hUNgc3

Question

A large company seeks to implement a near real-time solution involving hundreds of pipelines with parallel updates of many tables with extremely high volume and high velocity data.

Which of the following solutions would you implement to achieve this requirement?

Choices

  • A: Use Databricks High Concurrency clusters, which leverage optimized cloud storage connections to maximize data throughput.
  • B: Partition ingestion tables by a small time duration to allow for many data files to be written in parallel.
  • C: Configure Databricks to save all data to attached SSD volumes instead of object storage, increasing file I/O significantly.
  • D: Isolate Delta Lake tables in their own storage containers to avoid API limits imposed by cloud vendors.
  • E: Store all tables in a single database to ensure that the Databricks Catalyst Metastore can load balance overall throughput.

Question DTp4kPnjC3Wx52BA5wHy

Question

Which describes a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?

Choices

  • A: Run source env/bin/activate in a notebook setup script
  • B: Use b in a notebook cell
  • C: Use %pip install in a notebook cell
  • D: Use %sh pip install in a notebook cell
  • E: Install libraries from PyPI using the cluster UI