Questions and Answers

Question P2ON0ECNRy7k0rDkxllw

Question

A data engineer needs to debug an AWS Glue job that reads from Amazon S3 and writes to Amazon Redshift. The data engineer enabled the bookmark feature for the AWS Glue job. The data engineer has set the maximum concurrency for the AWS Glue job to 1.

The AWS Glue job is successfully writing the output to Amazon Redshift. However, the Amazon S3 files that were loaded during previous runs of the AWS Glue job are being reprocessed by subsequent runs.

What is the likely reason the AWS Glue job is reprocessing the files?

Choices

A: The AWS Glue job does not have the s3:GetObjectAcl permission that is required for bookmarks to work correctly.
B: The maximum concurrency for the AWS Glue job is set to 1.
C: The data engineer incorrectly specified an older version of AWS Glue for the Glue job.
D: The AWS Glue job does not have a required commit statement.

answer?

Answer: D Answer_ET: D Community answer D (61%) A (39%) Discussion

Comment 1243389 by lool

Upvotes: 8

Selected Answer: D https://docs.aws.amazon.com/glue/latest/dg/glue-troubleshooting-errors.html#error-job-bookmarks-reprocess-data

Comment 1308487 by AgboolaKun

Upvotes: 2

Selected Answer: D A “commit” statement within your AWS Glue job script is absolutely required to update the job bookmark and properly track processed data, preventing the reprocessing of old data when running the job again; essentially, if you don’t include the commit statement, the job will not remember where it left off and may process data multiple times. For more information about job.commit(), please reference this documentation - https://docs.aws.amazon.com/glue/latest/dg/glue-troubleshooting-errors.html#error-job-bookmarks-reprocess-data

Comment 1301595 by rsmf

Upvotes: 2

Selected Answer: D It’s B the right answer

Comment 1300318 by mohamedTR

Upvotes: 2

Selected Answer: A Commit statements are relevant to transactional operations in databases like Redshift but are not related to S3 bookmarks or Glue’s tracking mechanism for processed files.

Comment 1293090 by proserv

Upvotes: 2

Selected Answer: D Ensure that your job run script ends with the following commit:

job.commit()

When you include this object, AWS Glue records the timestamp and path of the job run. If you run the job again with the same path, AWS Glue processes only the new files. If you don’t include this object and job bookmarks are enabled, the job reprocesses the already processed files along with the new files and creates redundancy in the job’s target data store.

https://docs.aws.amazon.com/glue/latest/dg/glue-troubleshooting-errors.html#error-job-bookmarks-reprocess-data

Comment 1285443 by azure_bimonster

Upvotes: 1

Selected Answer: A I would go with A option

Comment 1279301 by EJGisME

Upvotes: 1

Selected Answer: A A. The AWS Glue job does not have the s3:GetObjectAcl permission that is required for bookmarks to work correctly.

Comment 1268231 by mzansikiller

Upvotes: 1

Selected Answer: A Answer A

this is a job bookmarks permissions issue

Comment 1262573 by antun3ra

Upvotes: 4

Selected Answer: A For AWS Glue bookmarks to function correctly, the job needs the necessary permissions to read and write bookmark data, including the s3:GetObjectAcl permission. If these permissions are not correctly set, the job may not be able to track which files have already been processed, leading to reprocessing of previously processed files.

Comment 1250093 by andrologin

Upvotes: 2

Selected Answer: D AWS Glue Job requires the commit statement to save the last successful run/processing

Comment 1240833 by HunkyBunky

Upvotes: 3

Selected Answer: D For me - D looks correct

Comment 1239451 by Alagong

Upvotes: 3

Selected Answer: A The commit statement (Option D) is not required for AWS Glue jobs. AWS Glue commits any open transactions to the database when all the script statements finish running.

Comment 1239061 by Bmaster

Upvotes: 4

D is good

https://docs.aws.amazon.com/glue/latest/dg/glue-troubleshooting-errors.html#error-job-bookmarks-reprocess-data

Question JG1Sy4gdmRfIa3jod11G

Question

An ecommerce company wants to use AWS to migrate data pipelines from an on-premises environment into the AWS Cloud. The company currently uses a third-party tool in the on-premises environment to orchestrate data ingestion processes.

The company wants a migration solution that does not require the company to manage servers. The solution must be able to orchestrate Python and Bash scripts. The solution must not require the company to refactor any code.

Which solution will meet these requirements with the LEAST operational overhead?

Choices

A: AWS Lambda
B: Amazon Managed Workflows for Apache Airflow (Amazon MVVAA)
C: AWS Step Functions
D: AWS Glue

answer?

Answer: B Answer_ET: B Community answer B (86%) 14% Discussion

Comment 1242444 by HunkyBunky

Upvotes: 2

Selected Answer: B B - because company want to use same tool on premises and least operational overhead

Comment 1241595 by didorins

Upvotes: 1

Selected Answer: C “The company wants a migration solution that does not require the company to manage servers.”. How is it Amazon Managed Workflows for Apache Airflow and not Step Functions when Step Functions is the serverless of the two ?

Comment 1241497 by Ja13

Upvotes: 3

Selected Answer: B An ecommerce company wants to use AWS to migrate data pipelines from an on-premises environment into the AWS Cloud. The company currently uses a third-party tool in the on-premises environment to orchestrate data ingestion processes.

The company wants a migration solution that does not require the company to manage servers. The solution must be able to orchestrate Python and Bash scripts. The solution must not require the company to refactor any code.

Which solution will meet these requirements with the LEAST operational overhead?

A. AWS Lambda B. Amazon Managed Workflows for Apache Airflow (Amazon MVVAA) C. AWS Step Functions D. AWS Glue

Comment 1240834 by HunkyBunky

Upvotes: 1

Selected Answer: B B - best fits in task requirements

Comment 1239064 by Bmaster

Upvotes: 2

My Choice is B

Question l6atByHKAMPCauQHRONx

Question

A data engineer needs Amazon Athena queries to finish faster. The data engineer notices that all the files the Athena queries use are currently stored in uncompressed .csv format. The data engineer also notices that users perform most queries by selecting a specific column. Which solution will MOST speed up the Athena query performance?

Choices

A: Change the data format from .csv to JSON format. Apply Snappy compression.
B: Compress the .csv files by using Snappy compression.
C: Change the data format from .csv to Apache Parquet. Apply Snappy compression.
D: Compress the .csv files by using gzip compression.

answer?

Answer: C Answer_ET: C Community answer C (100%) Discussion

Comment 1125641 by milofficial

Upvotes: 11

Selected Answer: C If the exam would only have these kinds of questions everyone would be blessed

Comment 1137892 by TonyStark0122

Upvotes: 6

C. Change the data format from .csv to Apache Parquet. Apply Snappy compression.

Explanation: Apache Parquet is a columnar storage format optimized for analytical queries. It is highly efficient for query performance, especially when queries involve selecting specific columns, as it allows for column pruning and predicate pushdown optimizations.

Comment 1410025 by Scotty_Nguyen

Upvotes: 1

Selected Answer: C C is correct

Comment 1254504 by GabrielSGoncalves

Upvotes: 1

Selected Answer: C C is the way to do It based on best practices recommended by AWS (https://aws.amazon.com/pt/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/)

Comment 1210121 by hnk

Upvotes: 1

Selected Answer: C C is correct

Comment 1209089 by k350Secops

Upvotes: 1

Selected Answer: C switching to Apache Parquet format with Snappy compression offers the most significant improvement in Athena query performance, especially for queries that select specific columns

Comment 1207115 by d8945a1

Upvotes: 1

Selected Answer: C Parquet is columnar storage and the question specifies that users performs most queries by selecting a specific column.

Comment 1191578 by wa212

Upvotes: 2

Selected Answer: C https://aws.amazon.com/jp/blogs/news/top-10-performance-tuning-tips-for-amazon-athena/

Comment 1158040 by Alcee

Upvotes: 1

C easy

Question NN6uiHStJJsoMvogsNkU

Question

A retail company stores data from a product lifecycle management (PLM) application in an on-premises MySQL database. The PLM application frequently updates the database when transactions occur.

The company wants to gather insights from the PLM application in near real time. The company wants to integrate the insights with other business datasets and to analyze the combined dataset by using an Amazon Redshift data warehouse.

The company has already established an AWS Direct Connect connection between the on-premises infrastructure and AWS.

Which solution will meet these requirements with the LEAST development effort?

Choices

A: Run a scheduled AWS Glue extract, transform, and load (ETL) job to get the MySQL database updates by using a Java Database Connectivity (JDBC) connection. Set Amazon Redshift as the destination for the ETL job.
B: Run a full load plus CDC task in AWS Database Migration Service (AWS DMS) to continuously replicate the MySQL database changes. Set Amazon Redshift as the destination for the task.
C: Use the Amazon AppFlow SDK to build a custom connector for the MySQL database to continuously replicate the database changes. Set Amazon Redshift as the destination for the connector.
D: Run scheduled AWS DataSync tasks to synchronize data from the MySQL database. Set Amazon Redshift as the destination for the tasks.

answer?

Answer: B Answer_ET: B Community answer B (100%) Discussion

Comment 1254560 by andrologin

Upvotes: 1

Selected Answer: B AWS DMS allows for change data capture that will have the destination updated at near real time with changes from the source database

Comment 1254524 by Fredrik1

Upvotes: 1

Selected Answer: B Should B. Makes most sense.

Comment 1253127 by Chelseajcole

Upvotes: 1

Chatgpt Option B (AWS DMS) is the most suitable with the least development effort. AWS DMS supports continuous data replication with CDC capabilities, making it well-suited for near real-time data integration from MySQL to Amazon Redshift. It handles schema conversion and simplifies the setup process compared to custom development or scheduled ETL jobs. Given the existing AWS Direct Connect, AWS DMS can efficiently replicate MySQL updates to Redshift with minimal latency, meeting the company’s requirement for near real-time insights integration. Therefore, option B is the correct choice.

Comment 1239075 by Bmaster

Upvotes: 3

B is good. DMS+CDC…

https://aws.amazon.com/ko/blogs/apn/change-data-capture-from-on-premises-sql-server-to-amazon-redshift-target/

Question MYA0pP4zVyD0unCKwb2K

Question

A marketing company uses Amazon S3 to store clickstream data. The company queries the data at the end of each day by using a SQL JOIN clause on S3 objects that are stored in separate buckets.

The company creates key performance indicators (KPIs) based on the objects. The company needs a serverless solution that will give users the ability to query data by partitioning the data. The solution must maintain the atomicity, consistency, isolation, and durability (ACID) properties of the data.

Which solution will meet these requirements MOST cost-effectively?

Choices

A: Amazon S3 Select
B: Amazon Redshift Spectrum
C: Amazon Athena
D: Amazon EMR

answer?

Answer: C Answer_ET: C Community answer C (82%) B (18%) Discussion

Comment 1241500 by Ja13

Upvotes: 5

Selected Answer: C C. Amazon Athena

Here’s why Amazon Athena is suitable:

Serverless: Amazon Athena is a serverless query service that allows you to run SQL queries directly on data stored in Amazon S3 without the need to manage infrastructure. Partitioning: Athena supports querying data by partitioning, which can significantly improve query performance by limiting the amount of data scanned. ACID Properties: Although Amazon S3 itself does not provide ACID properties, Amazon Athena ensures consistency in query results and durability of the data stored in S3 through its managed query execution. Cost-effective: With Amazon Athena, you only pay for the queries you run and the amount of data scanned, making it a cost-effective choice compared to managing infrastructure or using dedicated services like Amazon Redshift Spectrum or Amazon EMR.

Comment 1282428 by EJGisME

Upvotes: 2

Selected Answer: C C. Amazon Athena

Comment 1279466 by EJGisME

Upvotes: 2

Selected Answer: C Amazon Redshift Spectrum is not serverless.

Comment 1250809 by andrologin

Upvotes: 2

Selected Answer: B Athena is cost effective as it only charges for queries run

Comment 1240838 by HunkyBunky

Upvotes: 1

C - cheapest solution in this case

Comment 1239081 by Bmaster

Upvotes: 1

C is good.

vuthanhdatt's Second Brain

Explorer

Associate-DEA-C01_3

Questions and Answers

Question P2ON0ECNRy7k0rDkxllw

Question

Choices

Comment 1243389 by lool

Comment 1308487 by AgboolaKun

Comment 1301595 by rsmf

Comment 1300318 by mohamedTR

Comment 1293090 by proserv

Comment 1285443 by azure_bimonster

Comment 1279301 by EJGisME

Comment 1268231 by mzansikiller

Comment 1262573 by antun3ra

Comment 1250093 by andrologin

Comment 1240833 by HunkyBunky

Comment 1239451 by Alagong

Comment 1239061 by Bmaster

Question JG1Sy4gdmRfIa3jod11G

Question

Choices

Comment 1242444 by HunkyBunky

Comment 1241595 by didorins

Comment 1241497 by Ja13

Comment 1240834 by HunkyBunky

Comment 1239064 by Bmaster

Question l6atByHKAMPCauQHRONx

Question

Choices

Comment 1125641 by milofficial

Comment 1137892 by TonyStark0122

Comment 1410025 by Scotty_Nguyen

Comment 1254504 by GabrielSGoncalves

Comment 1210121 by hnk

Comment 1209089 by k350Secops

Comment 1207115 by d8945a1

Comment 1191578 by wa212

Comment 1158040 by Alcee

Question NN6uiHStJJsoMvogsNkU

Question

Choices

Comment 1254560 by andrologin

Comment 1254524 by Fredrik1

Comment 1253127 by Chelseajcole

Comment 1239075 by Bmaster

Question MYA0pP4zVyD0unCKwb2K

Question

Choices

Comment 1241500 by Ja13

Comment 1282428 by EJGisME

Comment 1279466 by EJGisME

Comment 1250809 by andrologin

Comment 1240838 by HunkyBunky

Comment 1239081 by Bmaster

Graph View

Table of Contents