Questions and Answers
Question lF89nQz3ngxh6PM1Yc1R
Question
A data engineer wants to improve the performance of SQL queries in Amazon Athena that run against a sales data table.
The data engineer wants to understand the execution plan of a specific SQL statement. The data engineer also wants to see the computational cost of each operation in a SQL query.
Which statement does the data engineer need to run to meet these requirements?
Choices
- A: EXPLAIN SELECT * FROM sales;
- B: EXPLAIN ANALYZE FROM sales;
- C: EXPLAIN ANALYZE SELECT * FROM sales;
- D: EXPLAIN FROM sales;
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1233772 by FunkyFresco
- Upvotes: 5
Selected Answer: C use EXPLAIN ANALIZE https://docs.aws.amazon.com/athena/latest/ug/athena-explain-statement.html
Comment 1241795 by HunkyBunky
- Upvotes: 1
Selected Answer: C explain analyze + select * from table
Comment 1230811 by tgv
- Upvotes: 4
Selected Answer: C A - Only partially meets the requirements as it does not include computational costs. B - Incorrect syntax and does not meet the requirements. C - Fully meets the requirements by providing both the execution plan and the computational costs. D - Incorrect syntax and does not meet the requirements.
Question ypALFrZLCZIyw8BIdHLU
Question
A data engineer needs to schedule a workflow that runs a set of AWS Glue jobs every day. The data engineer does not require the Glue jobs to run or finish at a specific time. Which solution will run the Glue jobs in the MOST cost-effective way?
Choices
- A: Choose the FLEX execution class in the Glue job properties.
- B: Use the Spot Instance type in Glue job properties.
- C: Choose the STANDARD execution class in the Glue job properties.
- D: Choose the latest version in the GlueVersion field in the Glue job properties.
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1226790 by pypelyncar
- Upvotes: 7
Selected Answer: A The FLEX execution class leverages spare capacity within the AWS infrastructure to run Glue jobs at a discounted price compared to the standard execution class. Since the data engineer doesn’t have specific time constraints, utilizing spare capacity is ideal for cost savings. Today’s date its a checkbox in order to spare capacity and will mean we dont know when is going to finish, which is recommended to increase a timeout
Comment 1137887 by TonyStark0122
- Upvotes: 6
A. Choose the FLEX execution class in the Glue job properties.
Explanation: The FLEX execution class in AWS Glue allows jobs to use idle resources within the Glue service, which can significantly reduce costs compared to the STANDARD execution class. With FLEX, Glue jobs run when resources are available, which is a cost-effective approach for jobs that don’t need to be completed within a specific timeframe.
Comment 1254500 by GabrielSGoncalves
- Upvotes: 1
Selected Answer: A FLEX is how you lower Glue cost when you dont have urgency to run ETLs.
Comment 1209087 by k350Secops
- Upvotes: 3
Selected Answer: A As its said the FLEX job comes cheaper that hiring a spot instance
Comment 1188287 by lucas_rfsb
- Upvotes: 1
Selected Answer: A I’d go with A
Comment 1137269 by lalitjhawar
- Upvotes: 5
A Flex allows you to optimize your costs on your non-urgent or non-time sensitive data integration workloads such as testing, and one-time data loads. With Flex, AWS Glue jobs run on spare compute capacity instead of dedicated hardware. The start and runtimes of jobs using Flex can vary because spare compute resources aren’t readily available and can be reclaimed during the run of a job
https://aws.amazon.com/blogs/big-data/introducing-aws-glue-flex-jobs-cost-savings-on-etl-workloads/
Question XL69xrSDiTPxT7DGhuc4
Question
A company plans to provision a log delivery stream within a VPC. The company configured the VPC flow logs to publish to Amazon CloudWatch Logs. The company needs to send the flow logs to Splunk in near real time for further analysis.
Which solution will meet these requirements with the LEAST operational overhead?
Choices
- A: Configure an Amazon Kinesis Data Streams data stream to use Splunk as the destination. Create a CloudWatch Logs subscription filter to send log events to the data stream.
- B: Create an Amazon Kinesis Data Firehose delivery stream to use Splunk as the destination. Create a CloudWatch Logs subscription filter to send log events to the delivery stream.
- C: Create an Amazon Kinesis Data Firehose delivery stream to use Splunk as the destination. Create an AWS Lambda function to send the flow logs from CloudWatch Logs to the delivery stream.
- D: Configure an Amazon Kinesis Data Streams data stream to use Splunk as the destination. Create an AWS Lambda function to send the flow logs from CloudWatch Logs to the data stream.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1230814 by tgv
- Upvotes: 6
Selected Answer: B Kinesis Data Firehose has built-in support for Splunk as a destination, making the integration straightforward. Using a CloudWatch Logs subscription filter directly to Firehose simplifies the data flow, eliminating the need for additional Lambda functions or custom integrations.
Comment 1242310 by bakarys
- Upvotes: 4
Selected Answer: B Creating an Amazon Kinesis Data Firehose delivery stream to use Splunk as the destination and creating a CloudWatch Logs subscription filter to send log events to the delivery stream would meet these requirements with the least operational overhead.
Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics services. It can capture, transform, and deliver streaming data to Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, generic HTTP endpoints, and service providers like Splunk.
CloudWatch Logs subscription filters allow you to send real-time log events to Kinesis Data Firehose and are ideal for scenarios where you want to forward the logs to other services for further analysis.
Options A and D involve Kinesis Data Streams, which would require additional management and operational overhead. Option C involves creating a Lambda function, which also adds operational overhead. Therefore, option B is the best choice.
Question Vn7FPeskwiBIURn1NckS
Question
A company has a data lake on AWS. The data lake ingests sources of data from business units. The company uses Amazon Athena for queries. The storage layer is Amazon S3 with an AWS Glue Data Catalog as a metadata repository.
The company wants to make the data available to data scientists and business analysts. However, the company first needs to manage fine-grained, column-level data access for Athena based on the user roles and responsibilities.
Which solution will meet these requirements?
Choices
- A: Set up AWS Lake Formation. Define security policy-based rules for the users and applications by IAM role in Lake Formation.
- B: Define an IAM resource-based policy for AWS Glue tables. Attach the same policy to IAM user groups.
- C: Define an IAM identity-based policy for AWS Glue tables. Attach the same policy to IAM roles. Associate the IAM roles with IAM groups that contain the users.
- D: Create a resource share in AWS Resource Access Manager (AWS RAM) to grant access to IAM users.
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1244008 by Ja13
- Upvotes: 6
Selected Answer: A Correct Solution: A. Set up AWS Lake Formation. Define security policy-based rules for the users and applications by IAM role in Lake Formation.
Explanation: AWS Lake Formation: This service simplifies and automates the process of securing and managing data lakes. It allows you to define fine-grained access control policies at the database, table, and column levels. Security Policy-Based Rules: Lake Formation allows you to create policies that specify which users or roles have access to specific data, including column-level access controls. This makes it easier to manage access based on roles and responsibilities.
Comment 1329435 by HagarTheHorrible
- Upvotes: 1
Selected Answer: A A lake formation for any fine-grained access
Comment 1234146 by HunkyBunky
- Upvotes: 1
Selected Answer: A A - Lake formation
Comment 1230815 by tgv
- Upvotes: 4
Selected Answer: A Lake Formation supports fine-grained access control, including column-level permissions.
Question n5PPm7nmYeLrZ1gXh6P1
Question
A company has developed several AWS Glue extract, transform, and load (ETL) jobs to validate and transform data from Amazon S3. The ETL jobs load the data into Amazon RDS for MySQL in batches once every day. The ETL jobs use a DynamicFrame to read the S3 data.
The ETL jobs currently process all the data that is in the S3 bucket. However, the company wants the jobs to process only the daily incremental data.
Which solution will meet this requirement with the LEAST coding effort?
Choices
- A: Create an ETL job that reads the S3 file status and logs the status in Amazon DynamoDB.
- B: Enable job bookmarks for the ETL jobs to update the state after a run to keep track of previously processed data.
- C: Enable job metrics for the ETL jobs to help keep track of processed objects in Amazon CloudWatch.
- D: Configure the ETL jobs to delete processed objects from Amazon S3 after each run.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1230816 by tgv
- Upvotes: 8
Selected Answer: B AWS Glue job bookmarks are designed to handle incremental data processing by automatically tracking the state.
Comment 1249191 by andrologin
- Upvotes: 1
Selected Answer: B AWS Glue Bookmarks can be used to pin where the data processing last stopped hence help with incremental processing.
Comment 1241794 by HunkyBunky
- Upvotes: 1
Selected Answer: B B - bookmarks is a key
Comment 1240723 by bakarys
- Upvotes: 3
Selected Answer: B The solution that will meet this requirement with the least coding effort is Option B: Enable job bookmarks for the ETL jobs to update the state after a run to keep track of previously processed data.
AWS Glue job bookmarks help ETL jobs to keep track of data that has already been processed during previous runs. By enabling job bookmarks, the ETL jobs can skip the processed data and only process the new, incremental data. This feature is designed specifically for this use case and requires minimal coding effort.
Options A, C, and D would require additional coding and operational effort. Option A would require creating a new ETL job and managing a DynamoDB table. Option C would involve setting up job metrics and CloudWatch, which doesn’t directly address processing incremental data. Option D would involve deleting data from S3 after processing, which might not be desirable if the original data needs to be retained. Therefore, Option B is the most suitable solution.