Questions and Answers
Question M9TBKjpsMuwBYBlFAOqc
Question
An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code: df = spark.read.format(“parquet”).load(f”/mnt/source/(date)”) Which code block should be used to create the date Python variable used in the above code block?
Choices
- A: date = spark.conf.get(“date”)
- B: input_dict = input() date= input_dict[“date”]
- C: import sys date = sys.argv[1]
- D: date = dbutils.notebooks.getParam(“date”)
- E: dbutils.widgets.text(“date”, “null”) date = dbutils.widgets.get(“date”)
answer?
Answer: E Answer_ET: E Community answer E (96%) 4% Discussion
Comment 1173064 by hal2401me
- Upvotes: 10
Selected Answer: E dbutils.widget. Just passed the exam with score >80%. examtopics covers about 90% of questions. there were 5 questions I didn’t see here in examtopics. But friends, you need to look at the discussions, and do test yourself. many answers provided here, even most voted answer, does NOT exists anymore in the exam - not the question, but the answer. Wish you all good luck, friends!
Comment 1410425 by ultimomassimo
- Upvotes: 1
Selected Answer: E E 100% There is no such thing as dbutils.notebooks.getParam, so no idea why some halfwits suggest D…
Comment 1361428 by shoaibmohammed3
- Upvotes: 1
Selected Answer: E dbutils.widgets is where we store all the param and use .get to fetch the params
Comment 1358833 by johnserafim
- Upvotes: 1
Selected Answer: D D is correct!
The question states that the upstream system passes the date as a parameter to the Databricks Jobs API. In Databricks, when a parameter is passed to a notebook via the Jobs API, it can be retrieved using the dbutils.notebooks.getParam() method.
Option D directly retrieves the parameter value using this method, which is the correct approach for this scenario.
Comment 1351567 by EelkeV
- Upvotes: 1
Selected Answer: E It is the way to fill in a parameter in a notebook
Comment 1335691 by HairyTorso
- Upvotes: 2
Selected Answer: E Around question #130 they just repeat themselves. So it’s not 226 but around 130… Shame
Comment 1290627 by akashdesarda
- Upvotes: 1
Selected Answer: E Jobs API allows to sending parameters via jobs parameter. This parameter must have the same notebook params. Eventually, it can be read using dbutils.widgets.get
Comment 1265036 by HorskKos
- Upvotes: 4
E is correct because: A - gets configuration of a spark session B - gets a value from a manual input - non relevant for the job run C - sys.argv - gets parameters, which were used to run a Pyrhon script from CMD - completely non-related D - haven’t found this function on the web at all, assume that it doesn’t exist Therefore E is correct, though it’s a bad practice to type a date as a parameter, it’s better to get it with datetime library and then use it in the code
Comment 1247581 by Shailly
- Upvotes: 1
Answer is E. Even though the value is passed from a upstream system, you can create parameters using widgets inside notebook and use the value as an input from the databricks jobs API.
Comment 1226795 by Isio05
- Upvotes: 1
Selected Answer: E Widgets are used to create parameters in notebook that can be then utilized by e.g. jobs
Comment 1222432 by imatheushenrique
- Upvotes: 1
E. E. dbutils.widgets.text(“date”, “null”) date = dbutils.widgets.get(“date”)
Comment 1198680 by AziLa
- Upvotes: 1
correct ans is E
Comment 1195398 by Sosicha
- Upvotes: 1
Are you reading the question? It asks about an upstream system that has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. Upstream system usually don’t use widgets. Widgets they are made for humans. Only C and D are correct but D is better so D.
Comment 1159633 by hal2401me
- Upvotes: 1
Selected Answer: E vote for E dbutils.widget
Comment 1128049 by AziLa
- Upvotes: 1
Correct Ans is E
Comment 1121585 by Jay_98_11
- Upvotes: 2
Selected Answer: E E is correct
Comment 1113402 by RafaelCFC
- Upvotes: 2
Selected Answer: E In https://docs.databricks.com/en/notebooks/notebook-workflows.html#dbutilsnotebook-api the “run Example” is an equivalent use-case as E.
Comment 1102660 by kz_data
- Upvotes: 2
Selected Answer: E E is correct
Comment 1027115 by chokthewa
- Upvotes: 1
I think D is correct answer, refer to https://docs.databricks.com/en/notebooks/notebook-workflows.html#dbutilsnotebook-api
Comment 981132 by BrianNguyen95
- Upvotes: 3
E is correct answer
Comment 977584 by lokvamsi
- Upvotes: 1
Selected Answer: E Correct. Ans: E
Comment 969962 by Happy_Prince
- Upvotes: 2
Correct
Question sk2J5xKeNzrykM27zMEC
Question
A Delta table of weather records is partitioned by date and has the below schema: date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT To find all the records from within the Arctic Circle, you execute a query with the below filter: latitude > 66.3 Which statement describes how the Delta engine identifies which files to load?
Choices
- A: All records are cached to an operational database and then the filter is applied
- B: The Parquet file footers are scanned for min and max statistics for the latitude column
- C: All records are cached to attached storage and then the filter is applied
- D: The Delta log is scanned for min and max statistics for the latitude column
- E: The Hive metastore is scanned for min and max statistics for the latitude column
answer?
Answer: D Answer_ET: D Community answer D (90%) 10% Discussion
Comment 988204 by taif12340
- Upvotes: 22
Answer D:
In the Transaction log, Delta Lake captures statistics for each data file of the table. These statistics indicate per file:
- Total number of records
- Minimum value in each column of the first 32 columns of the table
- Maximum value in each column of the first 32 columns of the table
- Null value counts for in each column of the first 32 columns of the table
When a query with a selective filter is executed against the table, the query optimizer uses these statistics to generate the query result. it leverages them to identify data files that may contain records matching the conditional filter. For the SELECT query in the question, The transaction log is scanned for min and max statistics for the price column
Comment 1365581 by johnserafim
- Upvotes: 2
Selected Answer: B B is correct!
Delta Lake stores min/max statistics for each column in the Parquet file footers. The engine scans these footers to determine if a file contains any data that satisfies the latitude > 66.3 condition. If the minimum latitude in a file is greater than 66.3, the file is loaded. If the maximum latitude is less than or equal to 66.3, the file is skipped.
Comment 1290673 by akashdesarda
- Upvotes: 3
Selected Answer: D Above mentioned points are correct. If the table was just parquet table then parquet file footer have been used. But since this is Delta table, then delta log is used to scan & skip files. It uses stats written in in transaction log.
Comment 1268752 by AndreFR
- Upvotes: 2
Answer D :
Delta data skipping automatically collects the stats (min, max, etc.) for the first 32 columns for each underlying Parquet file when you write data into a Delta table. Databricks takes advantage of this information (minimum and maximum values) at query time to skip unnecessary files in order to speed up the queries.
https://www.databricks.com/discover/pages/optimize-data-workloads-guide#delta-data
Comment 1267459 by saravanan289
- Upvotes: 2
Selected Answer: D Delta table stores file statistics in transaction log
Comment 1237632 by 03355a2
- Upvotes: 2
Selected Answer: D No explanation needed, this is where the information is stored.
Comment 1224441 by imatheushenrique
- Upvotes: 1
D. The Delta log is scanned for min and max statistics for the latitude column
Comment 1213855 by coercion
- Upvotes: 1
Selected Answer: D Delta log collects statistics like min value, max value, no of records, no of files for each transaction that happens on the table for the first 32 columns (default value)
Comment 1204721 by Tayari
- Upvotes: 1
Selected Answer: D D is the answer
Comment 1183436 by arik90
- Upvotes: 1
Selected Answer: D Based on Docu is D I don’t know why here is showing B
Comment 1172271 by alexvno
- Upvotes: 1
Selected Answer: D Delta log first
Comment 1170375 by DavidRou
- Upvotes: 1
Selected Answer: D Statistics on first 32 columns of a table are computed and written in the Delta Log by default.
Comment 1162470 by vikram12apr
- Upvotes: 1
Selected Answer: D D is the right answer
Comment 1161000 by Curious76
- Upvotes: 1
Selected Answer: D D is the answer
Comment 1152224 by kkravets
- Upvotes: 1
Selected Answer: D D is correct one
Comment 1145974 by RiktRikt007
- Upvotes: 2
I checked the delta log, and it dose store stat, stats”:”{“numRecords”:1,“minValues”:{“id”:1,“name”:“one”,“age”:11},“maxValues”:{“id”:1,“name”:“one”,“age”:11},“nullCount”:{“id”:0,“name”:0,“age”:0}}“
Comment 1128075 by AziLa
- Upvotes: 1
correct ans is D
Comment 1121654 by Jay_98_11
- Upvotes: 1
Selected Answer: D D for sure
Comment 1118554 by kz_data
- Upvotes: 1
Selected Answer: D I think the correct answer is D
Comment 1116057 by ranith
- Upvotes: 1
_delta_log contains the max and min of each column for the first 30 odd columns in a table for each partition. Also there is nothing called parquet file footers. Correct answer is D.
Comment 1111031 by lexaneon
- Upvotes: 1
Comment 1102119 by chowchowchow
- Upvotes: 1
chatgpt vote b as well
Comment 1086046 by hamzaKhribi
- Upvotes: 1
For me D is correct, as statistics are collected in the delta loge of the first 32 columns
Comment 1060780 by BIKRAM063
- Upvotes: 1
D is correct, Transaction log will be scanned
Comment 1043874 by jms309
- Upvotes: 1
Selected Answer: D D is the correct answer
Comment 991545 by Eertyy
- Upvotes: 3
D is correct answer
Comment 970495 by tusharl
- Upvotes: 3
D is correct Answer
Question 7wcwoe7HFa64NKQXGWGy
Question
The data engineering team has been tasked with configuring connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user groups already created in Databricks that represent various teams within the company.
A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.
Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?
Choices
- A: “Manage” permissions should be set on a secret key mapped to those credentials that will be used by a given team.
- B: “Read” permissions should be set on a secret key mapped to those credentials that will be used by a given team.
- C: “Read” permissions should be set on a secret scope containing only those credentials that will be used by a given team.
- D: “Manage” permissions should be set on a secret scope containing only those credentials that will be used by a given team. No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1141644 by vctrhugo
- Upvotes: 4
Selected Answer: C In Databricks, secret scopes are used to manage and organize secrets. By setting “Read” permissions on a secret scope containing the credentials, you allow the team to access the necessary credentials without granting unnecessary privileges. This approach ensures that the teams have the minimum necessary access to the credentials required for connecting to the external database. “Manage” permissions would provide more access than needed for just using the credentials.
Option A and D suggest setting permissions on individual secret keys, which might work, but using a secret scope for organizational purposes is a cleaner and more scalable solution.
Comment 1136731 by Somesh512
- Upvotes: 2
Selected Answer: C Access is at scope level and not key level
Comment 1086019 by petrv
- Upvotes: 1
Selected Answer: C In summary, while technically feasible, setting “Read” permissions on a secret key might not be the most efficient or scalable solution when dealing with multiple teams and their corresponding credentials. Using secret scopes provides a more organized and maintainable approach for managing secrets in Databricks.
Comment 1080934 by Enduresoul
- Upvotes: 3
Selected Answer: C Answer C is correct: https://docs.databricks.com/en/security/auth-authz/access-control/secret-acl.html#secret-access-control “Access control for secrets is managed at the secret scope level”
Question VeABWHbIxbNNJR0z2ajA
Question
Which indicators would you look for in the Spark UI’s Storage tab to signal that a cached table is not performing optimally? Assume you are using Spark’s MEMORY_ONLY storage level.
Choices
- A: Size on Disk is < Size in Memory
- B: The RDD Block Name includes the “*” annotation signaling a failure to cache
- C: Size on Disk is > 0
- D: The number of Cached Partitions > the number of Spark Partitions
- E: On Heap Memory Usage is within 75% of Off Heap Memory Usage
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1141643 by vctrhugo
- Upvotes: 7
Selected Answer: C C. Size on Disk is > 0
When using Spark’s MEMORY_ONLY storage level, the ideal scenario is that the data is fully cached in memory, and the Size on Disk should be 0 (indicating that the data is not spilled to disk). If the Size on Disk is greater than 0, it suggests that some data has been spilled to disk, which can lead to degraded performance as reading from disk is slower than reading from memory.
Comment 1323543 by benni_ale
- Upvotes: 1
Selected Answer: C I think is C
Comment 1226802 by Isio05
- Upvotes: 2
Selected Answer: C In this case any data on disk means that cache is not performing optimally
Question CNRCo07OuYSxwtCFTb2l
Question
What is the first line of a Databricks Python notebook when viewed in a text editor?
Choices
- A: %python
- B: // Databricks notebook source
- C: # Databricks notebook source
- D: — Databricks notebook source
- E: # MAGIC %python
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1119503 by bacckom
- Upvotes: 8
Selected Answer: C Python: # Databricks notebook source SQL: — Databricks notebook source Scala: // Databricks notebook source R: # Databricks notebook source
Comment 1237014 by Ati1362
- Upvotes: 2
Selected Answer: C c is the answer
Comment 1111556 by divingbell17
- Upvotes: 2
Selected Answer: C https://docs.databricks.com/en/notebooks/notebook-export-import.html#import-a-file-and-convert-it-to-a-notebook
Comment 1076842 by aragorn_brego
- Upvotes: 2
Selected Answer: C This is the correct line that you would find at the top of a Databricks notebook when viewed in a text editor, especially for Python notebooks. The # symbol is used for comments in Python, and the comment # Databricks notebook source is used by Databricks to indicate the start of the notebook’s source code in the plain text file.
These lines are comments in the respective languages (Scala uses // and SQL uses — for single-line comments) and indicate the beginning of the Databricks notebook content in the text file.
Comment 1075100 by AWSMaster69
- Upvotes: 1
Selected Answer: C The Answer is C, Just downloaded a notebook from Databricks and viewed it in a text editor.
Comment 1071617 by 60ties
- Upvotes: 1
Selected Answer: C Answer is C
Comment 1071615 by 60ties
- Upvotes: 1
// Databricks notebook source - Scala
Databricks notebook source - Python
— Databricks notebook source - SQL Answer is C