Questions and Answers
Question Z0l0xrNhqM3foGy9Q1vJ
Question
The data engineering team is configuring environments for development, testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team wants to develop and test against data as similar to production data as possible.
A junior data engineer suggests that production data can be mounted to the development and testing environments, allowing pre-production code to execute against production data. Because all users have admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team.
Which statement captures best practices for this situation?
Choices
- A: All development, testing, and production code and data should exist in a single, unified workspace; creating separate environments for testing and development complicates administrative overhead.
- B: In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.
- C: Because access to production data will always be verified using passthrough credentials, it is safe to mount data to any Databricks development environment.
- D: Because Delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data; as such, it is generally safe to mount production data anywhere.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1300569 by m79590530
- Upvotes: 2
Selected Answer: B Production data should be maximum secured against intentional and unintentional modifications by developers or workspace/UC admins. So setting it up with read only access and in different catalog or schema/database per environment is best approach.
Question 7Y7X4WIWBYlfDDNFaRXV
Question
The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables.
Which approach will ensure that this requirement is met?
Choices
- A: Whenever a database is being created, make sure that the LOCATION keyword is used.
- B: When the workspace is being configured, make sure that external cloud object storage has been mounted.
- C: Whenever a table is being created, make sure that the LOCATION keyword is used.
- D: When tables are created, make sure that the UNMANAGED keyword is used in the CREATE TABLE statement.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1307436 by benni_ale
- Upvotes: 1
Selected Answer: C Location keyword in CTAS statment in only way to create External tables
Comment 1300572 by m79590530
- Upvotes: 2
Selected Answer: C CREATE-ing a TABLE with LOCATION key word makes it EXTERNAL TABLE. By CREATE-ing the database/schema with the LOCATION key word we can have specific locations for the schemas/databases containing MANAGED tables when these tables inside these schemas/databases are CREATE-d withOUT the LOCATION key word. This approach allows for configuring MANAGED TABLES at specific locations by fully leveraging Databricks Lakehouse automatic optimizations and performance tunning for them.
Question sKVH8zmOupf3cNL9gwvU
Question
The marketing team is looking to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing-specific fields have not been approved for the sales org.
Which of the following solutions addresses the situation while emphasizing simplicity?
Choices
- A: Create a view on the marketing table selecting only those fields approved for the sales team; alias the names of any fields that should be standardized to the sales naming conventions.
- B: Create a new table with the required schema and use Delta Lake’s DEEP CLONE functionality to sync up changes committed to one table to the corresponding table.
- C: Use a CTAS statement to create a derivative table from the marketing table; configure a production job to propagate changes.
- D: Add a parallel table write to the current production pipeline, updating a new sales table that varies as required from the marketing table.
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1400014 by lakime
- Upvotes: 1
Selected Answer: A Yup, no overhead, easy
Comment 1281234 by db22
- Upvotes: 3
Selected Answer: A Creating a view is the simplest and most effective solution
Question Tnq33F79NW6VTm2d2P1v
Question
A Delta Lake table representing metadata about content posts from users has the following schema:
user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE
This table is partitioned by the date column. A query is run with the following filter:
longitude < 20 & longitude > -20
Which statement describes how data will be filtered?
Choices
- A: Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.
- B: No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.
- C: The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.
- D: Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1366219 by lakime
- Upvotes: 1
Selected Answer: D Yup - data skipping D)
Comment 1251666 by vexor3
- Upvotes: 2
Selected Answer: D D is correct
Question JQ6JtZ32bfDu2NzFgGO0
Question
A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company’s data is stored in regional cloud storage in the United States.
The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed.
Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?
Choices
- A: Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines must be deployed in the region where the data is stored.
- B: Databricks workspaces do not rely on any regional infrastructure; as such, the decision should be made based upon what is most convenient for the workspace administrator.
- C: Cross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored.
- D: Databricks notebooks send all executable code from the user’s browser to virtual machines over the open internet; whenever possible, choosing a workspace region near the end users is the most secure.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1301566 by benni_ale
- Upvotes: 1
Selected Answer: C C is correct
Comment 1229314 by hpkr
- Upvotes: 2
Selected Answer: C option C