Questions and Answers
Question eXSbr0mUty1itWZG2bH0
Question
Which of the following describes the relationship between Gold tables and Silver tables?
Choices
- A: Gold tables are more likely to contain aggregations than Silver tables.
- B: Gold tables are more likely to contain valuable data than Silver tables.
- C: Gold tables are more likely to contain a less refined view of data than Silver tables.
- D: Gold tables are more likely to contain more data than Silver tables.
- E: Gold tables are more likely to contain truthful data than Silver tables.
answer?
Answer: A Answer_ET: A Community answer A (88%) 12% Discussion
Comment 997985 by vctrhugo
- Upvotes: 6
Selected Answer: A A. Gold tables are more likely to contain aggregations than Silver tables.
In some data processing pipelines, especially those following a typical “Bronze-Silver-Gold” data lakehouse architecture, Silver tables are often considered a more refined version of the raw or Bronze data. Silver tables may include data cleansing, schema enforcement, and some initial transformations.
Gold tables, on the other hand, typically represent a stage where data is further enriched, aggregated, and processed to provide valuable insights for analytical purposes. This could indeed involve more aggregations compared to Silver tables.
Comment 1314163 by 806e7d2
- Upvotes: 1
Selected Answer: A In the medallion architecture commonly used in Delta Lake and Databricks, the relationship between Gold and Silver tables is as follows:
Bronze Tables: Raw, unprocessed data directly ingested from the source. Silver Tables: Cleaned and enriched data, often with transformations applied for a more refined view. Gold Tables: Highly refined data, typically containing business-level aggregations, metrics, and summaries that are ready for analytics or reporting.
Comment 1280088 by 6aa83ae
- Upvotes: 1
Selected Answer: A Gold data is often refined and aggregated.
Comment 1276082 by CID2024
- Upvotes: 1
A. Gold tables are more likely to contain aggregations than Silver tables. In the Delta Lake architecture, Silver tables typically contain cleaned and enriched data that has been transformed from raw data (Bronze tables). Gold tables, on the other hand, are often used for business-level aggregates, reporting, and analytics. They are built on top of Silver tables and provide a more refined and aggregated view of the data, making them more likely to contain aggregations.
Comment 1262429 by 80370eb
- Upvotes: 1
Selected Answer: A In the typical data pipeline architecture, Gold tables are often the final layer and contain aggregated, high-value insights that are ready for reporting and analysis. Silver tables usually contain more detailed and refined data that is processed from the raw or Bronze tables but may not yet be aggregated.
Comment 1253899 by Bujji1234
- Upvotes: 1
A
This gold data is often highly refined and aggregated, containing data that powers analytics, machine learning, and production applications. While all tables in the lakehouse should serve an important purpose, gold tables represent data that has been transformed into knowledge, rather than just information.
Analysts largely rely on gold tables for their core responsibilities, and data shared with a customer would rarely be stored outside this level.
Comment 1246608 by rukerrc
- Upvotes: 1
Correct answer is B!
Comment 1244554 by 3fbc31b
- Upvotes: 2
Selected Answer: B Correct answer is B. I saw this from the Databricks practice test. It’s a little blurry, but this is the correct answer.
Comment 1226040 by jaromarg
- Upvotes: 1
In some data processing pipelines, particularly those following a “Bronze-Silver-Gold” data lakehouse architecture, Silver tables are indeed considered a more refined version of raw or Bronze data. Gold tables, which represent the final stage of data processing, typically contain highly refined, aggregated, and ready-to-consume data. Therefore, it’s common for Gold tables to contain aggregations, as they often represent the final, summarized, and aggregated view of the data. On the other hand, Silver tables may contain partially aggregated or cleansed data but are not typically the final destination for aggregated data. “Gold tables are more likely to contain aggregations than Silver tables” is accurate, making option A a valid choice.
Comment 1221969 by Dusica
- Upvotes: 1
A; row data = bronze data > silver data > golden data C is so opposite and wrong
Comment 1214402 by carlosmps
- Upvotes: 1
Raw Data > Bronze Data > Silver Data > Golden Data
Comment 1203442 by benni_ale
- Upvotes: 1
Selected Answer: A correct is A
Comment 1113652 by SerGrey
- Upvotes: 1
Selected Answer: A Correct is A
Comment 1064865 by awofalus
- Upvotes: 1
Selected Answer: A Correct: A
Comment 978342 by Inhaler_boy
- Upvotes: 2
To me it seems A and E is equally correct. Truthfull is not very defined in the question. But Gold layer typically have more rules and transformations in order to be consumed by business and reports. So It could be intepreted as more “truthfull”. Or am I wrong here?
Comment 946467 by Atnafu
- Upvotes: 1
B 2 Type of Tables in Delta Lake data lake architecture Gold tables are the most refined and valuable tables in the data lake, while Silver tables are less refined and less valuable. Gold tables are typically used for downstream analysis and reporting, while Silver tables are typically used for data exploration and experimentation.
Gold tables typically contain the most refined, high-quality, and valuable data in an organization’s data architecture. They often represent the final output or result of data processing pipelines, where data has undergone extensive cleansing, transformation, and aggregation. Gold tables are typically used for critical business analysis, reporting, and decision-making processes.
Option A: Gold tables are not necessarily more likely to contain aggregations than Silver tables. Option C: Gold tables are more likely to contain a more refined view of data than Silver tables. Option D: Gold tables are not necessarily more likely to contain more data than Silver tables.
Comment 861959 by 4be8126
- Upvotes: 1
Selected Answer: B The correct answer is B. Gold tables are typically considered to be the most valuable and trusted data assets in an organization. They represent the final, refined view of the data after all cleaning, transformations, and enrichments have been performed. Silver tables are the intermediate tables that feed into the Gold tables, and are typically used to perform data cleansing, filtering, and enrichment before the data is promoted to Gold.
Comment 861133 by rafahb
- Upvotes: 2
Selected Answer: A A os correct
Comment 859698 by surrabhi_4
- Upvotes: 4
Selected Answer: A Option A
Comment 858105 by XiltroX
- Upvotes: 4
Selected Answer: A THE ANSWER C IS INCORRECT! Silver tables usually contain data that is commonly a little more refined than Bronze tables. Meaning they contain data that is likely cleaned and contains no duplicates. Gold tables usually contain aggregate or “corrected” data.
Question quFCk94vGp9eM47FRkbI
Question
Which of the following describes the relationship between Bronze tables and raw data?
Choices
- A: Bronze tables contain less data than raw data files.
- B: Bronze tables contain more truthful data than raw data.
- C: Bronze tables contain aggregates while raw data is unaggregated.
- D: Bronze tables contain a less refined view of data than raw data.
- E: Bronze tables contain raw data with a schema applied.
answer?
Answer: E Answer_ET: E Community answer E (100%) Discussion
Comment 858146 by XiltroX
- Upvotes: 14
Selected Answer: E Bronze tables are basically raw ingested data, often with schema borrowed from the original data source or table. Correct answer is E.
Comment 1314190 by 806e7d2
- Upvotes: 1
Selected Answer: E In the medallion architecture, Bronze tables are the first stage in the data pipeline and directly represent raw data ingested into the system. The raw data is stored in its original form but typically has a schema applied to make it queryable and usable within a structured data processing framework like Delta Lake.
Why E is correct: Bronze tables contain the raw data as-is but with a defined schema to enable easier downstream processing and integration. This schema provides structure to the otherwise unstructured or semi-structured raw data.
Comment 1291626 by joaosanti
- Upvotes: 1
Selected Answer: E Correct is E
Comment 1203443 by benni_ale
- Upvotes: 2
Selected Answer: E still i am not sure about the schema as i thought that correct types are usually defined in silver while in bronze are all strings
Comment 1113654 by SerGrey
- Upvotes: 4
Selected Answer: E Correct is E
Comment 1064869 by awofalus
- Upvotes: 1
Selected Answer: E E is correct
Comment 1058848 by DavidRou
- Upvotes: 3
Selected Answer: E E is the right answer. Bronze data are simply a more structured (in terms of schema) version of raw data to be found in the “landing area”.
Comment 997986 by vctrhugo
- Upvotes: 2
Selected Answer: E E. Bronze tables contain raw data with a schema applied.
In a typical data processing pipeline following a “Bronze-Silver-Gold” data lakehouse architecture, Bronze tables are the initial stage where raw data is ingested and transformed into a structured format with a schema applied. The schema provides structure and meaning to the raw data, making it more usable and accessible for downstream processing.
Therefore, Bronze tables contain the raw data but in a structured and schema-enforced format, which makes them distinct from the unprocessed, unstructured raw data files.
Comment 958794 by akk_1289
- Upvotes: 3
Ans : E
The Bronze layer is where we land all the data from external source systems. The table structures in this layer correspond to the source system table structures “as-is,” along with any additional metadata columns that capture the load date/time, process ID, etc. The focus in this layer is quick Change Data Capture and the ability to provide an historical archive of source (cold storage), data lineage, auditability, reprocessing if needed without rereading the data from the source system. https://www.databricks.com/glossary/medallion-architecture#:~:text=Bronze%20layer%20%28raw%20data%29
Comment 958791 by akk_1289
- Upvotes: 1
Ans: E https://www.databricks.com/glossary/medallion-architecture#:~:text=Bronze%20layer%20%28raw%20data%29
Comment 946471 by Atnafu
- Upvotes: 2
E Bronze tables are the foundation of the Delta Lake data lake architecture. They are created from raw data files and contain a schema that describes the data. This makes it easy to query and analyze the data in Bronze tables.
Raw data files, on the other hand, do not have a schema applied. This means that it can be difficult to query and analyze the data in raw data files.
Option A: Bronze tables typically contain more data than raw data files, because they include the schema.
Option B: There is no indication that Bronze tables contain more truthful data than raw data.
Option C: Bronze tables can contain aggregates, but they do not have to.
Option D: Bronze tables typically contain a more refined view of data than raw data, because they include the schema.
Comment 861139 by rafahb
- Upvotes: 2
Selected Answer: E E option
Comment 859699 by surrabhi_4
- Upvotes: 3
Selected Answer: E Option E
Question JCWgsp4MCn7dK07gf27M
Question
Which of the following is hosted completely in the control plane of the classic Databricks architecture?
Choices
- A: Worker node
- B: JDBC data source
- C: Databricks web application
- D: Databricks Filesystem
- E: Driver node
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 997858 by vctrhugo
- Upvotes: 15
Selected Answer: C C. Databricks web application
In the classic Databricks architecture, the control plane includes components like the Databricks web application, the Databricks REST API, and the Databricks Workspace. These components are responsible for managing and controlling the Databricks environment, including cluster provisioning, notebook management, access control, and job scheduling.
The other options, such as worker nodes, JDBC data sources, Databricks Filesystem (DBFS), and driver nodes, are typically part of the data plane or the execution environment, which is separate from the control plane. Worker nodes are responsible for executing tasks and computations, JDBC data sources are used to connect to external databases, DBFS is a distributed file system for data storage, and driver nodes are responsible for coordinating the execution of Spark jobs.
Comment 854101 by h79
- Upvotes: 15
I disagree with this answer. I think its the databricks web app that is always in the control plane
Comment 1339005 by Tedet
- Upvotes: 1
Selected Answer: C Refer to architecture of Lakehouse
Comment 1312056 by 806e7d2
- Upvotes: 2
Selected Answer: C In the classic Databricks architecture, the Databricks web application (which includes the Databricks user interface, job scheduling, and management components) is hosted entirely in the control plane. This control plane is managed by Databricks and contains the user-facing services and APIs that allow users to interact with the Databricks environment.
Here’s why the other options don’t fit:
A. Worker node and E. Driver node: Both the driver and worker nodes are part of the data plane, where actual data processing occurs. B. JDBC data source: This is external to Databricks, typically hosted wherever the source database resides, and does not exist within the control plane. D. Databricks Filesystem (DBFS): While DBFS metadata might be managed by the control plane, the actual data is stored in the data plane (often within the customer’s cloud account, such as in AWS S3 or Azure Data Lake Storage).
Comment 889031 by Majjjj
- Upvotes: 3
Selected Answer: C The control plane in the classic Databricks architecture is responsible for managing the Databricks workspace, user and group management, and cluster management, among other things. The Databricks web application is a part of the control plane that enables users to interact with the workspace, create and manage clusters, and work with notebooks, jobs, and data. Worker nodes and driver nodes are part of the data plane, which is responsible for executing data processing tasks. JDBC data sources and the Databricks Filesystem are services that are used by both the control plane and the data plane.
Comment 1274179 by afzalmp40
- Upvotes: 1
Selected Answer: C C is correct
Comment 1271032 by Teja_50
- Upvotes: 1
Selected Answer: C C is correct
Comment 1262384 by 80370eb
- Upvotes: 1
Selected Answer: C its “C” - Control plane has Databricks web application
Comment 1249369 by ranjan24
- Upvotes: 1
Its C. other options, such as worker nodes, JDBC data sources, Databricks Filesystem (DBFS), and driver nodes, are typically part of the data plane or the execution environment, which is separate from the control plane.
Comment 1227522 by mascarenhaslucas
- Upvotes: 1
Selected Answer: C The answer is C! Accordinglu with the Databricks documentation, a cluster consists of one driver node and zero or more worker nodes, by default the driver node uses the same instance type as the worker node.
Comment 1200789 by pierrickaosis
- Upvotes: 1
Who decide the correct answer on this website ? CertiIQ says C ; ITExams says E… For me it’s C
Comment 1188033 by benni_ale
- Upvotes: 1
Selected Answer: C Nodes are on the Data Plane. I think the Web App is the only one in the Control Pane.
Comment 1177153 by Itmma
- Upvotes: 1
Selected Answer: C C is correct
Comment 1166586 by kirshoff
- Upvotes: 1
Answer is C: https://docs.databricks.com/en/_images/databricks-architecture-aws.png
Comment 1137284 by agAshish
- Upvotes: 1
E.Driver Node , is the correct answer. In the classic Databricks architecture, the control plane includes components responsible for managing and coordinating the execution of tasks. The driver node is part of the control plane, and it handles the coordination and execution of the overall Spark application.
Comment 1105826 by poundmanluffy
- Upvotes: 2
Selected Answer: C Webapplication always resides in Control Plane
Comment 1104696 by SerGrey
- Upvotes: 1
Selected Answer: C C is correct
Comment 1086064 by CHHIPA
- Upvotes: 1
Selected Answer: C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
Comment 1081285 by Ajinkyavsawant7
- Upvotes: 1
Selected Answer: C C is correct
Comment 1071065 by Huroye
- Upvotes: 1
It is the web UI. C
Comment 1022306 by Sriramiyer92
- Upvotes: 2
Reading material: https://learn.microsoft.com/en-us/azure/databricks/getting-started/overview
Comment 1017338 by KalavathiP
- Upvotes: 1
Selected Answer: C Correct ans C
Comment 1016523 by d_b47
- Upvotes: 1
Selected Answer: C web application
Comment 991828 by Lipon23
- Upvotes: 1
Selected Answer: C Databricks Web App for sure
Comment 966937 by Gajen100
- Upvotes: 1
Selected Answer: C Databricks web application
Comment 955078 by Jannat_13
- Upvotes: 2
Can someone help me with access to all the questions?
Comment 886972 by rickwolfe86
- Upvotes: 3
Selected Answer: C Option C
Comment 873288 by rafahb
- Upvotes: 2
Selected Answer: C Option C
Comment 867429 by SireeJ
- Upvotes: 1
Option: C
Comment 863834 by Data_4ever
- Upvotes: 1
Selected Answer: C Databricks web application is hosted in Control plane
Comment 859600 by surrabhi_4
- Upvotes: 2
Selected Answer: C Option C
Comment 851907 by azurearch
- Upvotes: 3
https://learn.microsoft.com/en-us/azure/databricks/getting-started/overview databricks webapp is in control plane. driver nodes are in data plane
Question wlXDUQeT2XxwyiowLnS5
Question
Which of the following tools is used by Auto Loader process data incrementally?
Choices
- A: Checkpointing
- B: Spark Structured Streaming
- C: Data Explorer
- D: Unity Catalog
- E: Databricks SQL
answer?
Answer: B Answer_ET: B Community answer B (88%) 12% Discussion
Comment 858148 by XiltroX
- Upvotes: 10
Selected Answer: B B is the correct answer. Checkpointing is a method that is part of structured streaming.
Comment 1344888 by SatuPatu
- Upvotes: 2
Selected Answer: B Spark Structured Streaming is for incremental loading. Checkpointing is for failover.
Comment 1325502 by res3
- Upvotes: 3
Selected Answer: B Databricks uses Apache Spark Structured Streaming to back numerous product associated with ingestion workloads, including:
- Auto Loader
- COPY INTO
- Delta Live Tables pipelines
- Materialized views and streaming tables in Databricks SQL
Source: https://docs.databricks.com/en/ingestion/streaming.html
Comment 1321190 by heystatgal
- Upvotes: 3
Selected Answer: A B. Spark Structured Streaming: Spark Structured Streaming is a key underlying technology for Auto Loader to process streaming data. However, checkpointing is the specific mechanism that allows Auto Loader to track incremental progress. While Structured Streaming is essential for real-time data processing, checkpointing is the mechanism used to track what data has been processed.
Comment 1262732 by 80370eb
- Upvotes: 1
Selected Answer: B B. Spark Structured Streaming
Auto Loader uses Spark Structured Streaming to incrementally and efficiently process new data as it arrives, enabling scalable and reliable data ingestion in Databricks.
Comment 1124897 by RBKasemodel
- Upvotes: 2
The answer should be A. Auto Loader is used by Structured Streaming to process data incrementaly, not the other way around.
Comment 1113656 by SerGrey
- Upvotes: 1
Selected Answer: B Correct is B
Comment 1064870 by awofalus
- Upvotes: 1
Selected Answer: B B is correct
Comment 1039270 by anandpsg101
- Upvotes: 1
Selected Answer: B B is orrect
Comment 997987 by vctrhugo
- Upvotes: 2
Selected Answer: B B. Spark Structured Streaming
The Auto Loader process in Databricks is typically used in conjunction with Spark Structured Streaming to process data incrementally. Spark Structured Streaming is a real-time data processing framework that allows you to process data streams incrementally as new data arrives. The Auto Loader is a feature in Databricks that works with Structured Streaming to automatically detect and process new data files as they are added to a specified data source location. It allows for incremental data processing without the need for manual intervention.
Comment 958799 by akk_1289
- Upvotes: 2
ans:A How does Auto Loader track ingestion progress? As files are discovered, their metadata is persisted in a scalable key-value store (RocksDB) in the checkpoint location of your Auto Loader pipeline. This key-value store ensures that data is processed exactly once.
In case of failures, Auto Loader can resume from where it left off by information stored in the checkpoint location and continue to provide exactly-once guarantees when writing data into Delta Lake. You don’t need to maintain or manage any state yourself to achieve fault tolerance or exactly-once semantics. https://docs.databricks.com/ingestion/auto-loader/index.html
Comment 958796 by akk_1289
- Upvotes: 1
ans:B How does Auto Loader track ingestion progress? As files are discovered, their metadata is persisted in a scalable key-value store (RocksDB) in the checkpoint location of your Auto Loader pipeline. This key-value store ensures that data is processed exactly once.
In case of failures, Auto Loader can resume from where it left off by information stored in the checkpoint location and continue to provide exactly-once guarantees when writing data into Delta Lake. You don’t need to maintain or manage any state yourself to achieve fault tolerance or exactly-once semantics. https://docs.databricks.com/ingestion/auto-loader/index.html
Comment 946478 by Atnafu
- Upvotes: 3
B Auto Loader uses Spark Structured Streaming to process data incrementally. Spark Structured Streaming is a streaming engine that can be used to process data as it arrives. This makes it ideal for processing data that is being generated in real time.
Option A: Checkpointing is a technique used to ensure that data is not lost in case of a failure. It is not used to process data incrementally.
Option C: Data Explorer is a data exploration tool that can be used to explore data. It is not used to process data incrementally.
Option D: Unity Catalog is a metadata management tool that can be used to store and manage metadata about data assets. It is not used to process data incrementally.
Option E: Databricks SQL is a SQL engine that can be used to query data. It is not used to process data incrementally.
Comment 859702 by surrabhi_4
- Upvotes: 2
Selected Answer: B Option B
Question Y508IVfMmb6eNStkjkXi
Question
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table. The cade block used by the data engineer is below: //IMG//
If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the following lines of code should the data engineer use to fill in the blank?
Choices
- A: trigger(“5 seconds”)
- B: trigger()
- C: trigger(once=“5 seconds”)
- D: trigger(processingTime=“5 seconds”)
- E: trigger(continuous=“5 seconds”)
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 858823 by XiltroX
- Upvotes: 5
D is the correct answer
Comment 861969 by 4be8126
- Upvotes: 5
Selected Answer: D The correct line of code to fill in the blank to execute a micro-batch to process data every 5 seconds is:
D. trigger(processingTime=“5 seconds”)
Option A (“trigger(“5 seconds”)”) would not work because it does not specify that the trigger should be a processing time trigger, which is necessary to trigger a micro-batch processing at regular intervals.
Option B (“trigger()”) would not work because it would use the default trigger, which is not a processing time trigger.
Option C (“trigger(once=“5 seconds”)”) would not work because it would only trigger the query once, not at regular intervals.
Option E (“trigger(continuous=“5 seconds”)”) would not work because it would trigger the query to run continuously, without any pauses in between, which is not what the data engineer wants.
Comment 1291855 by Raghu_Dasara
- Upvotes: 1
D is correct answer ProcessingTime https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/triggers Continues Processing :
Comment 1203445 by benni_ale
- Upvotes: 1
Selected Answer: D correct syntax is D
Comment 1064874 by awofalus
- Upvotes: 1
Selected Answer: D Correct: D
Comment 997991 by vctrhugo
- Upvotes: 2
Selected Answer: D
ProcessingTime trigger with two-seconds micro-batch interval
df.writeStream
.format(“console”)
.trigger(processingTime=‘2 seconds’)
.start()https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers
Comment 985343 by AndreFR
- Upvotes: 1
Selected Answer: D https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers
Comment 946484 by Atnafu
- Upvotes: 1
D val query = sourceTable .writeStream .format(“delta”) .outputMode(“append”) .trigger(Trigger.ProcessingTime(“5 seconds”)) .start(destinationTable)
Comment 861140 by rafahb
- Upvotes: 2
Selected Answer: D D os correct
Comment 859703 by surrabhi_4
- Upvotes: 3
Selected Answer: D Option D