Questions and Answers
Question P2ON0ECNRy7k0rDkxllw
Question
A data engineer needs to debug an AWS Glue job that reads from Amazon S3 and writes to Amazon Redshift. The data engineer enabled the bookmark feature for the AWS Glue job. The data engineer has set the maximum concurrency for the AWS Glue job to 1.
The AWS Glue job is successfully writing the output to Amazon Redshift. However, the Amazon S3 files that were loaded during previous runs of the AWS Glue job are being reprocessed by subsequent runs.
What is the likely reason the AWS Glue job is reprocessing the files?
Choices
- A: The AWS Glue job does not have the s3:GetObjectAcl permission that is required for bookmarks to work correctly.
- B: The maximum concurrency for the AWS Glue job is set to 1.
- C: The data engineer incorrectly specified an older version of AWS Glue for the Glue job.
- D: The AWS Glue job does not have a required commit statement.
answer?
Answer: D Answer_ET: D Community answer D (61%) A (39%) Discussion
Comment 1243389 by lool
- Upvotes: 8
Selected Answer: D https://docs.aws.amazon.com/glue/latest/dg/glue-troubleshooting-errors.html#error-job-bookmarks-reprocess-data
Comment 1308487 by AgboolaKun
- Upvotes: 2
Selected Answer: D A “commit” statement within your AWS Glue job script is absolutely required to update the job bookmark and properly track processed data, preventing the reprocessing of old data when running the job again; essentially, if you don’t include the commit statement, the job will not remember where it left off and may process data multiple times. For more information about job.commit(), please reference this documentation - https://docs.aws.amazon.com/glue/latest/dg/glue-troubleshooting-errors.html#error-job-bookmarks-reprocess-data
Comment 1301595 by rsmf
- Upvotes: 2
Selected Answer: D It’s B the right answer
Comment 1300318 by mohamedTR
- Upvotes: 2
Selected Answer: A Commit statements are relevant to transactional operations in databases like Redshift but are not related to S3 bookmarks or Glue’s tracking mechanism for processed files.
Comment 1293090 by proserv
- Upvotes: 2
Selected Answer: D Ensure that your job run script ends with the following commit:
job.commit()
When you include this object, AWS Glue records the timestamp and path of the job run. If you run the job again with the same path, AWS Glue processes only the new files. If you don’t include this object and job bookmarks are enabled, the job reprocesses the already processed files along with the new files and creates redundancy in the job’s target data store.
Comment 1285443 by azure_bimonster
- Upvotes: 1
Selected Answer: A I would go with A option
Comment 1279301 by EJGisME
- Upvotes: 1
Selected Answer: A A. The AWS Glue job does not have the s3:GetObjectAcl permission that is required for bookmarks to work correctly.
Comment 1268231 by mzansikiller
- Upvotes: 1
Selected Answer: A Answer A
this is a job bookmarks permissions issue
Comment 1262573 by antun3ra
- Upvotes: 4
Selected Answer: A For AWS Glue bookmarks to function correctly, the job needs the necessary permissions to read and write bookmark data, including the s3:GetObjectAcl permission. If these permissions are not correctly set, the job may not be able to track which files have already been processed, leading to reprocessing of previously processed files.
Comment 1250093 by andrologin
- Upvotes: 2
Selected Answer: D AWS Glue Job requires the commit statement to save the last successful run/processing
Comment 1240833 by HunkyBunky
- Upvotes: 3
Selected Answer: D For me - D looks correct
Comment 1239451 by Alagong
- Upvotes: 3
Selected Answer: A The commit statement (Option D) is not required for AWS Glue jobs. AWS Glue commits any open transactions to the database when all the script statements finish running.
Comment 1239061 by Bmaster
- Upvotes: 4
D is good
Question JG1Sy4gdmRfIa3jod11G
Question
An ecommerce company wants to use AWS to migrate data pipelines from an on-premises environment into the AWS Cloud. The company currently uses a third-party tool in the on-premises environment to orchestrate data ingestion processes.
The company wants a migration solution that does not require the company to manage servers. The solution must be able to orchestrate Python and Bash scripts. The solution must not require the company to refactor any code.
Which solution will meet these requirements with the LEAST operational overhead?
Choices
- A: AWS Lambda
- B: Amazon Managed Workflows for Apache Airflow (Amazon MVVAA)
- C: AWS Step Functions
- D: AWS Glue
answer?
Answer: B Answer_ET: B Community answer B (86%) 14% Discussion
Comment 1242444 by HunkyBunky
- Upvotes: 2
Selected Answer: B B - because company want to use same tool on premises and least operational overhead
Comment 1241595 by didorins
- Upvotes: 1
Selected Answer: C “The company wants a migration solution that does not require the company to manage servers.”. How is it Amazon Managed Workflows for Apache Airflow and not Step Functions when Step Functions is the serverless of the two ?
Comment 1241497 by Ja13
- Upvotes: 3
Selected Answer: B An ecommerce company wants to use AWS to migrate data pipelines from an on-premises environment into the AWS Cloud. The company currently uses a third-party tool in the on-premises environment to orchestrate data ingestion processes.
The company wants a migration solution that does not require the company to manage servers. The solution must be able to orchestrate Python and Bash scripts. The solution must not require the company to refactor any code.
Which solution will meet these requirements with the LEAST operational overhead?
A. AWS Lambda B. Amazon Managed Workflows for Apache Airflow (Amazon MVVAA) C. AWS Step Functions D. AWS Glue
Comment 1240834 by HunkyBunky
- Upvotes: 1
Selected Answer: B B - best fits in task requirements
Comment 1239064 by Bmaster
- Upvotes: 2
My Choice is B
Question l6atByHKAMPCauQHRONx
Question
A data engineer needs Amazon Athena queries to finish faster. The data engineer notices that all the files the Athena queries use are currently stored in uncompressed .csv format. The data engineer also notices that users perform most queries by selecting a specific column. Which solution will MOST speed up the Athena query performance?
Choices
- A: Change the data format from .csv to JSON format. Apply Snappy compression.
- B: Compress the .csv files by using Snappy compression.
- C: Change the data format from .csv to Apache Parquet. Apply Snappy compression.
- D: Compress the .csv files by using gzip compression.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1125641 by milofficial
- Upvotes: 11
Selected Answer: C If the exam would only have these kinds of questions everyone would be blessed
Comment 1137892 by TonyStark0122
- Upvotes: 6
C. Change the data format from .csv to Apache Parquet. Apply Snappy compression.
Explanation: Apache Parquet is a columnar storage format optimized for analytical queries. It is highly efficient for query performance, especially when queries involve selecting specific columns, as it allows for column pruning and predicate pushdown optimizations.
Comment 1410025 by Scotty_Nguyen
- Upvotes: 1
Selected Answer: C C is correct
Comment 1254504 by GabrielSGoncalves
- Upvotes: 1
Selected Answer: C C is the way to do It based on best practices recommended by AWS (https://aws.amazon.com/pt/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/)
Comment 1210121 by hnk
- Upvotes: 1
Selected Answer: C C is correct
Comment 1209089 by k350Secops
- Upvotes: 1
Selected Answer: C switching to Apache Parquet format with Snappy compression offers the most significant improvement in Athena query performance, especially for queries that select specific columns
Comment 1207115 by d8945a1
- Upvotes: 1
Selected Answer: C Parquet is columnar storage and the question specifies that users performs most queries by selecting a specific column.
Comment 1191578 by wa212
- Upvotes: 2
Selected Answer: C https://aws.amazon.com/jp/blogs/news/top-10-performance-tuning-tips-for-amazon-athena/
Comment 1158040 by Alcee
- Upvotes: 1
C easy
Question NN6uiHStJJsoMvogsNkU
Question
A retail company stores data from a product lifecycle management (PLM) application in an on-premises MySQL database. The PLM application frequently updates the database when transactions occur.
The company wants to gather insights from the PLM application in near real time. The company wants to integrate the insights with other business datasets and to analyze the combined dataset by using an Amazon Redshift data warehouse.
The company has already established an AWS Direct Connect connection between the on-premises infrastructure and AWS.
Which solution will meet these requirements with the LEAST development effort?
Choices
- A: Run a scheduled AWS Glue extract, transform, and load (ETL) job to get the MySQL database updates by using a Java Database Connectivity (JDBC) connection. Set Amazon Redshift as the destination for the ETL job.
- B: Run a full load plus CDC task in AWS Database Migration Service (AWS DMS) to continuously replicate the MySQL database changes. Set Amazon Redshift as the destination for the task.
- C: Use the Amazon AppFlow SDK to build a custom connector for the MySQL database to continuously replicate the database changes. Set Amazon Redshift as the destination for the connector.
- D: Run scheduled AWS DataSync tasks to synchronize data from the MySQL database. Set Amazon Redshift as the destination for the tasks.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1254560 by andrologin
- Upvotes: 1
Selected Answer: B AWS DMS allows for change data capture that will have the destination updated at near real time with changes from the source database
Comment 1254524 by Fredrik1
- Upvotes: 1
Selected Answer: B Should B. Makes most sense.
Comment 1253127 by Chelseajcole
- Upvotes: 1
Chatgpt Option B (AWS DMS) is the most suitable with the least development effort. AWS DMS supports continuous data replication with CDC capabilities, making it well-suited for near real-time data integration from MySQL to Amazon Redshift. It handles schema conversion and simplifies the setup process compared to custom development or scheduled ETL jobs. Given the existing AWS Direct Connect, AWS DMS can efficiently replicate MySQL updates to Redshift with minimal latency, meeting the company’s requirement for near real-time insights integration. Therefore, option B is the correct choice.
Comment 1239075 by Bmaster
- Upvotes: 3
B is good. DMS+CDC…
Question MYA0pP4zVyD0unCKwb2K
Question
A marketing company uses Amazon S3 to store clickstream data. The company queries the data at the end of each day by using a SQL JOIN clause on S3 objects that are stored in separate buckets.
The company creates key performance indicators (KPIs) based on the objects. The company needs a serverless solution that will give users the ability to query data by partitioning the data. The solution must maintain the atomicity, consistency, isolation, and durability (ACID) properties of the data.
Which solution will meet these requirements MOST cost-effectively?
Choices
- A: Amazon S3 Select
- B: Amazon Redshift Spectrum
- C: Amazon Athena
- D: Amazon EMR
answer?
Answer: C Answer_ET: C Community answer C (82%) B (18%) Discussion
Comment 1241500 by Ja13
- Upvotes: 5
Selected Answer: C C. Amazon Athena
Here’s why Amazon Athena is suitable:
Serverless: Amazon Athena is a serverless query service that allows you to run SQL queries directly on data stored in Amazon S3 without the need to manage infrastructure. Partitioning: Athena supports querying data by partitioning, which can significantly improve query performance by limiting the amount of data scanned. ACID Properties: Although Amazon S3 itself does not provide ACID properties, Amazon Athena ensures consistency in query results and durability of the data stored in S3 through its managed query execution. Cost-effective: With Amazon Athena, you only pay for the queries you run and the amount of data scanned, making it a cost-effective choice compared to managing infrastructure or using dedicated services like Amazon Redshift Spectrum or Amazon EMR.
Comment 1282428 by EJGisME
- Upvotes: 2
Selected Answer: C C. Amazon Athena
Comment 1279466 by EJGisME
- Upvotes: 2
Selected Answer: C Amazon Redshift Spectrum is not serverless.
Comment 1250809 by andrologin
- Upvotes: 2
Selected Answer: B Athena is cost effective as it only charges for queries run
Comment 1240838 by HunkyBunky
- Upvotes: 1
C - cheapest solution in this case
Comment 1239081 by Bmaster
- Upvotes: 1
C is good.