Questions and Answers
Question 4LLWPsJoV1fKhFEWcnpe
Question
When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM’s resources?
Choices
- A: The five Minute Load Average remains consistent/flat
- B: Bytes Received never exceeds 80 million bytes per second
- C: Network I/O never spikes
- D: Total Disk Space remains constant
- E: CPU Utilization is around 75%
answer?
Answer: E Answer_ET: E Community answer E (100%) Discussion
Comment 1062817 by sturcu
- Upvotes: 7
Selected Answer: E I would look at max CPU utilization and max Memory usage. Having 75% CPU usage would signify we have a proper utilization of cpu resources
Comment 1141671 by vctrhugo
- Upvotes: 2
Selected Answer: E Proper utilization of VM resources, especially in a distributed computing environment like Spark, often involves efficient usage of CPU resources. A CPU utilization around 75% indicates that the CPU is being utilized without being fully saturated, allowing room for additional processing without causing excessive contention.
Comment 1100376 by alexvno
- Upvotes: 1
Selected Answer: E 75% good
Comment 1076694 by aragorn_brego
- Upvotes: 3
Selected Answer: E An average CPU utilization around 75% is a good indicator of proper utilization of the VM’s resources in a distributed computing environment. It suggests that the CPUs are being actively used for computation without being maxed out, which could indicate a bottleneck. It leaves some headroom to handle additional load without causing excessive queuing or delays.
Question lahZOckwB9uht3pLJ1xF
Question
Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?
Choices
- A: Regex
- B: Julia
- C: pyspsark.ml.feature
- D: Scala Datasets
- E: C++
answer?
Answer: A Answer_ET: A Community answer A (89%) 11% Discussion
Comment 1141668 by vctrhugo
- Upvotes: 3
Selected Answer: A It allows us to define patterns that match the structure of the log entries and capture relevant data.
Comment 1076699 by aragorn_brego
- Upvotes: 4
Selected Answer: A Regular expressions (regex) can be used to identify and extract patterns from text data, which makes them very useful for parsing log files like the Spark Driver’s log4j output. By defining specific regex patterns, you can search for error messages, timestamps, specific log levels, or any other text that follows a particular format within the log files.
Comment 1057429 by sturcu
- Upvotes: 3
Selected Answer: A Regex to extract text
Comment 1057124 by hm358
- Upvotes: 2
Selected Answer: A regex is for string identification
Comment 1056075 by mouad_attaqi
- Upvotes: 4
Selected Answer: A Using regex, we can identify key ans values areas
Comment 1053521 by sturcu
- Upvotes: 1
Why C++, why not python or Java? Plus there are tools om parsing the log4j output like Chainsaw and xmlstarlet.
Question 60cj5Jh2N1zHOhH7R4gy
Question
You are testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function.
assert(myIntegrate(lambda x: x*x, 0, 3) [0] == 9)
Which kind of test would the above line exemplify?
Choices
- A: Unit
- B: Manual
- C: Functional
- D: Integration
- E: End-to-end
answer?
Answer: A Answer_ET: A Community answer A (75%) C (25%) Discussion
Comment 1207354 by Nickff
- Upvotes: 3
Selected Answer: A Answer is A, unit test
Comment 1180359 by barnac1es
- Upvotes: 3
Selected Answer: C I think it should be Functional Test
Comment 1141667 by vctrhugo
- Upvotes: 3
Selected Answer: A A. Unit
Comment 1111480 by divingbell17
- Upvotes: 3
Selected Answer: A A is correct
Question vMuyBdca92FBzYgQVb5D
Question
A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on Task A.
If task A fails during a scheduled run, which statement describes the results of this run?
Choices
- A: Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.
- B: Tasks B and C will attempt to run as configured; any changes made in task A will be rolled back due to task failure.
- C: Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task A failed, all commits will be rolled back automatically.
- D: Tasks B and C will be skipped; some logic expressed in task A may have been committed before task failure.
- E: Tasks B and C will be skipped; task A will not commit any changes because of stage failure.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1056076 by mouad_attaqi
- Upvotes: 6
Selected Answer: D D is correct, taks B and C will definitely be skipped, since Task A is notebook, the ACID logic is at cell level, some logic might be executed before failing cell
Comment 1076705 by aragorn_brego
- Upvotes: 4
Selected Answer: D In Databricks job execution, if a task that other tasks depend on fails, the dependent tasks will not be executed. Since Tasks B and C depend on the successful completion of Task A, they will be skipped if Task A fails. However, if Task A performs any operations that commit changes before the failure occurs (such as writing to a Delta table), those changes remain and are not automatically rolled back unless the logic within Task A specifically includes rollback mechanisms for partial failures.
Comment 1066315 by Dileepvikram
- Upvotes: 3
D is the answer
Comment 1053525 by sturcu
- Upvotes: 3
Selected Answer: D Some ops in task A may have fished before fail
Question yCkzkxWy84EEsEVpEYQv
Question
A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in the geo_lookup table. Before executing the code, running SHOW TABLES on the current database indicates the database contains only two tables: geo_lookup and sales. //IMG//
Which statement correctly describes the outcome of executing these command cells in order in an interactive notebook?
Choices
- A: Both commands will succeed. Executing show tables will show that countries_af and sales_af have been registered as views.
- B: Cmd 1 will succeed. Cmd 2 will search all accessible databases for a table or view named countries_af: if this entity exists, Cmd 2 will succeed.
- C: Cmd 1 will succeed and Cmd 2 will fail. countries_af will be a Python variable representing a PySpark DataFrame.
- D: Both commands will fail. No new variables, tables, or views will be created.
- E: Cmd 1 will succeed and Cmd 2 will fail. countries_af will be a Python variable containing a list of strings.
answer?
Answer: E Answer_ET: E Community answer E (91%) 9% Discussion
Comment 1075925 by aragorn_brego
- Upvotes: 11
Selected Answer: E Cmd 1 is a PySpark command that collects the list of countries from the ‘geo_lookup’ table where the continent is Africa (‘AF’). This command will execute successfully, resulting in countries_af being a list of country names (strings) in Python’s local memory.
Cmd 2 is an SQL command intended to create a view named ‘sales_af’ from the ‘sales’ table, filtered by the cities in the countries_af list. However, this will fail because the countries_af variable exists in the Python environment and is not recognized in the SQL context. SQL does not have access to Python variables directly; they are two separate execution contexts within a Databricks notebook. There is no table or view named countries_af that SQL can reference; it is merely a Python list variable.
The other options are incorrect because they either assume cross-contextual operation between Python and SQL within a Databricks notebook (which is not possible in the way described in the commands), or they do not correctly interpret the outcome of running the commands.
Comment 1292755 by benni_ale
- Upvotes: 1
Selected Answer: E E , the collect method outputs strings so the python variable bill be a list of string which should not be called as a spark table as in cmd 2
Comment 1224440 by imatheushenrique
- Upvotes: 1
E. Cmd 1 will succeed and Cmd 2 will fail. countries_af will be a Python variable containing a list of strings.
Comment 1191739 by juliom6
- Upvotes: 3
Selected Answer: E E is correct.
%sql create table geo_lookup (continent varchar(2), country varchar(15)); insert into geo_lookup (continent, country) values (‘AF’,‘Nigeria’), (‘AF’,‘Kenya’); create table sales (city varchar(15), continent varchar(2)); insert into sales (city, continent) values (‘Nigeria’,‘AF’), (‘Kenya’,‘AF’);
%python countries_af = [x[0] for x in spark.table(‘geo_lookup’).filter(“continent=‘AF’“).select(‘country’).collect()]
%sql create view sales_af as select * from sales where city in countries_af and continent = “AF”;
ParseException: [PARSE_SYNTAX_ERROR] Syntax error at or near ‘in’.(line 4, pos 11)
i.e. countries_af is a python list of strings and can’t be used inside a sql statement
Comment 1150223 by leopedroso1
- Upvotes: 1
By simulating this code in databricks we can see an error being thrown in the SQL statement
ParseException: [PARSE_SYNTAX_ERROR] Syntax error at or near ‘IN’.(line 1, pos 38)
SQL SELECT * FROM backup.sales WHERE CITY IN countries_af AND CONTINENT = “AF”
Comment 1145943 by RiktRikt007
- Upvotes: 1
Selected Answer: B B shows the actual flow of spark sql, where E shows the question context, i mean from databricks point of view E never looked, it’s true that question state that database has no other tables, so ?? that mean databricks will not check for that particular table ? it will right ? i also confused by “database has no other database statement” and E and B both are right, but again B state “if countries table exists then command 2 will run” here “if” used, but question want to describe the language interoperability, so most of us selected E
Comment 1144403 by PrashantTiwari
- Upvotes: 2
E is correct
Comment 1121649 by Jay_98_11
- Upvotes: 2
Selected Answer: E vote for E
Comment 1118553 by kz_data
- Upvotes: 1
Selected Answer: E E is correct answer
Comment 1062069 by ismoshkov
- Upvotes: 1
Selected Answer: B https://docs.databricks.com/en/notebooks/notebooks-code.html#mix-languages Variables defined in one language (and hence in the REPL for that language) are not available in the REPL of another language
Comment 1040245 by sturcu
- Upvotes: 1
Selected Answer: E correct
Comment 1000703 by lucasasterio
- Upvotes: 2
Selected Answer: E correct
Comment 991538 by Eertyy
- Upvotes: 2
E is right nswer