Questions and Answers
Question SzadKFLfb5b1S658kDok
Question
A company uses Amazon EMR as an extract, transform, and load (ETL) pipeline to transform data that comes from multiple sources. A data engineer must orchestrate the pipeline to maximize performance.
Which AWS service will meet this requirement MOST cost effectively?
Choices
- A: Amazon EventBridge
- B: Amazon Managed Workflows for Apache Airflow (Amazon MWAA)
- C: AWS Step Functions
- D: AWS Glue Workflows
answer?
Answer: C Answer_ET: C Community answer C (76%) 12% 12% Discussion
Comment 1230571 by artworkad
- Upvotes: 5
Selected Answer: C Glue Workflows is for Glue job orchestration. C is for orchestration with different AWS services.
Comment 1269398 by hcong
- Upvotes: 1
Selected Answer: B Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is the best service for orchestrating complex data pipelines, especially for workloads already using Amazon EMR. Airflow is a powerful workflow orchestration tool that can be integrated with various AWS services, including EMR, to provide flexible scheduling, task dependency management, and monitoring capabilities. Using a hosted Airflow service (MWAA) can reduce administrative overhead while maintaining a familiar workflow orchestration environment.
Comment 1253680 by chrispchrisp
- Upvotes: 3
Selected Answer: C B is not cost effective, D is only to orchestrate Glue Jobs and Crawlers within AWS Glue itself. Hence C is correct, Step functions is cost effective and can link together your different AWS services.
Comment 1249177 by andrologin
- Upvotes: 2
Selected Answer: C This is EMR not Glue workflows hence step functions EventBridge is best for event driven architecture
Comment 1249139 by LR2023
- Upvotes: 1
Selected Answer: B https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/
Comment 1240685 by bakarys
- Upvotes: 2
Selected Answer: D The most cost-effective AWS service for orchestrating an ETL pipeline that maximizes performance is D. AWS Glue Workflows.
AWS Glue is a fully managed ETL service that makes it easy to move data between your data stores. AWS Glue simplifies and automates the difficult and time-consuming tasks of data discovery, conversion mapping, and job scheduling. AWS Glue Workflows allows you to orchestrate complex ETL jobs involving multiple crawlers, jobs, and triggers.
While the other services mentioned (Amazon EventBridge, Amazon MWAA, and AWS Step Functions) can be used for workflow orchestration, they are not specifically designed for ETL workloads and may not be as cost-effective for this use case. AWS Glue is designed for ETL workloads, and its workflows feature is specifically designed for orchestrating ETL jobs, making it the most suitable and cost-effective choice.
Comment 1233782 by HunkyBunky
- Upvotes: 1
Selected Answer: C C - becuase AWS Glue can be used only for glue based ETL jobs
Comment 1230798 by tgv
- Upvotes: 2
Selected Answer: C While AWS Glue Workflows are excellent for orchestrating Glue-specific ETL tasks, AWS Step Functions is more suitable for orchestrating an Amazon EMR-based ETL pipeline due to its greater flexibility, broader integration capabilities, and effective cost management. Therefore, the correct choice remains [C]
Question 9KEyo1OeYMdXLmlbCDMM
Question
An online retail company stores Application Load Balancer (ALB) access logs in an Amazon S3 bucket. The company wants to use Amazon Athena to query the logs to analyze traffic patterns.
A data engineer creates an unpartitioned table in Athena. As the amount of the data gradually increases, the response time for queries also increases. The data engineer wants to improve the query performance in Athena.
Which solution will meet these requirements with the LEAST operational effort?
Choices
- A: Create an AWS Glue job that determines the schema of all ALB access logs and writes the partition metadata to AWS Glue Data Catalog.
- B: Create an AWS Glue crawler that includes a classifier that determines the schema of all ALB access logs and writes the partition metadata to AWS Glue Data Catalog.
- C: Create an AWS Lambda function to transform all ALB access logs. Save the results to Amazon S3 in Apache Parquet format. Partition the metadata. Use Athena to query the transformed data.
- D: Use Apache Hive to create bucketed tables. Use an AWS Lambda function to transform all ALB access logs.
answer?
Answer: B Answer_ET: B Community answer B (90%) 10% Discussion
Comment 1234751 by PGGuy
- Upvotes: 5
Selected Answer: B Creating an AWS Glue crawler (Option B) is the most straightforward and least operationally intensive approach to automatically determine the schema, partition the data, and keep the AWS Glue Data Catalog updated. This ensures Athena queries are optimized without requiring extensive manual management or additional processing steps.
Comment 1249179 by andrologin
- Upvotes: 1
Selected Answer: C AWS Crawler with classifiers allow you to determine the schema pattern on files/data that can then be used to partition the data for Athena query optimization
Comment 1230799 by tgv
- Upvotes: 4
Selected Answer: B An AWS Glue crawler can automatically determine the schema of the logs, infer partitions, and update the Glue Data Catalog. Crawlers can be scheduled to run at intervals, minimizing manual intervention.
Question D3XBg1g4cPunXac7Mmm5
Question
A company has a business intelligence platform on AWS. The company uses an AWS Storage Gateway Amazon S3 File Gateway to transfer files from the company’s on-premises environment to an Amazon S3 bucket.
A data engineer needs to setup a process that will automatically launch an AWS Glue workflow to run a series of AWS Glue jobs when each file transfer finishes successfully.
Which solution will meet these requirements with the LEAST operational overhead?
Choices
- A: Determine when the file transfers usually finish based on previous successful file transfers. Set up an Amazon EventBridge scheduled event to initiate the AWS Glue jobs at that time of day.
- B: Set up an Amazon EventBridge event that initiates the AWS Glue workflow after every successful S3 File Gateway file transfer event.
- C: Set up an on-demand AWS Glue workflow so that the data engineer can start the AWS Glue workflow when each file transfer is complete.
- D: Set up an AWS Lambda function that will invoke the AWS Glue Workflow. Set up an event for the creation of an S3 object as a trigger for the Lambda function.
answer?
Answer: B Answer_ET: B Community answer B (80%) 10% 10% Discussion
Comment 1230802 by tgv
- Upvotes: 5
Selected Answer: B Using EventBridge directly to trigger the AWS Glue workflow upon S3 events is straightforward and leverages AWS’s event-driven architecture, requiring minimal maintenance.
Comment 1249181 by andrologin
- Upvotes: 1
Selected Answer: C Event driven architecture with S3 file creation can only be EventBridge
Comment 1242304 by bakarys
- Upvotes: 2
Selected Answer: B Setting up an Amazon EventBridge event that initiates the AWS Glue workflow after every successful S3 File Gateway file transfer event would meet these requirements with the least operational overhead.
This solution is event-driven and does not require manual intervention or reliance on a schedule that might not align with the actual completion time of the file transfers. The AWS Glue workflow is triggered automatically when a new file is added to the S3 bucket, ensuring that the AWS Glue workflow starts processing the new data as soon as it’s available.
Comment 1240704 by bakarys
- Upvotes: 1
Selected Answer: D The solution that will meet these requirements with the least operational overhead is Option D.
Setting up an AWS Lambda function that will invoke the AWS Glue Workflow, and setting up an event for the creation of an S3 object as a trigger for the Lambda function, will ensure that the workflow is automatically initiated each time a file transfer is successfully completed. This approach requires minimal operational overhead as it automates the process and does not require manual intervention or scheduling based on estimated completion times.
Options A and C involve manual intervention or assumptions about transfer times, which could lead to inefficiencies or inaccuracies. Option B is not feasible because Amazon EventBridge does not directly support triggering events based on S3 File Gateway file transfer events. Therefore, Option D is the most suitable solution.
Comment 1234753 by PGGuy
- Upvotes: 1
Selected Answer: B Setting up an Amazon EventBridge event (Option B) to initiate the AWS Glue workflow after every successful S3 File Gateway file transfer event is the most efficient solution. It provides real-time automation with minimal operational overhead, ensuring that the Glue workflow starts immediately after the file transfer is complete.
Question qs0SfxzxHTOyetBSxoCb
Question
A retail company uses Amazon Aurora PostgreSQL to process and store live transactional data. The company uses an Amazon Redshift cluster for a data warehouse.
An extract, transform, and load (ETL) job runs every morning to update the Redshift cluster with new data from the PostgreSQL database. The company has grown rapidly and needs to cost optimize the Redshift cluster.
A data engineer needs to create a solution to archive historical data. The data engineer must be able to run analytics queries that effectively combine data from live transactional data in PostgreSQL, current data in Redshift, and archived historical data. The solution must keep only the most recent 15 months of data in Amazon Redshift to reduce costs.
Which combination of steps will meet these requirements? (Choose two.)
Choices
- A: Configure the Amazon Redshift Federated Query feature to query live transactional data that is in the PostgreSQL database.
- B: Configure Amazon Redshift Spectrum to query live transactional data that is in the PostgreSQL database.
- C: Schedule a monthly job to copy data that is older than 15 months to Amazon S3 by using the UNLOAD command. Delete the old data from the Redshift cluster. Configure Amazon Redshift Spectrum to access historical data in Amazon S3.
- D: Schedule a monthly job to copy data that is older than 15 months to Amazon S3 Glacier Flexible Retrieval by using the UNLOAD command. Delete the old data from the Redshift cluster. Configure Redshift Spectrum to access historical data from S3 Glacier Flexible Retrieval.
- E: Create a materialized view in Amazon Redshift that combines live, current, and historical data from different sources.
answer?
Answer: A Answer_ET: A Community answer A (93%) 7% Discussion
Comment 1230684 by lalitjhawar
- Upvotes: 7
Option A (A): Configuring Amazon Redshift Federated Query allows Redshift to directly query the live transactional data in the PostgreSQL database without needing to import it. This ensures that you can access the most recent live data efficiently.
Option C (C): Scheduling a monthly job to copy data older than 15 months to Amazon S3 and then using Amazon Redshift Spectrum to access this historical data provides a cost-effective way to manage storage. This ensures that only the most recent 15 months of data are kept in Amazon Redshift, reducing storage costs. The historical data is still accessible via Redshift Spectrum for analytics queries.
Comment 1398916 by Palee
- Upvotes: 1
Selected Answer: D Option A and D. Option C doesn’t talk about archiving Historical data
Comment 1323939 by Vidhi212
- Upvotes: 2
Selected Answer: A The correct combination of steps is:
A. Configure the Amazon Redshift Federated Query feature to query live transactional data that is in the PostgreSQL database.
This feature allows Amazon Redshift to directly query live transactional data in the PostgreSQL database without moving the data, enabling seamless integration with the data warehouse. C. Schedule a monthly job to copy data that is older than 15 months to Amazon S3 by using the UNLOAD command. Delete the old data from the Redshift cluster. Configure Amazon Redshift Spectrum to access historical data in Amazon S3.
This step archives older data to Amazon S3, which is more cost-effective than storing it in Redshift. Redshift Spectrum allows querying this archived data directly from S3, ensuring analytics queries can still access historical data.
Comment 1322618 by SambitParida
- Upvotes: 1
Selected Answer: A A & C. Redshift spectrum cant read from glacier
Comment 1301570 by rsmf
- Upvotes: 1
Selected Answer: A A & C is the best choice
Comment 1294147 by mohamedTR
- Upvotes: 1
Selected Answer: A A & C: allows exporting Redshift data to Amazon S3 and ability to frequent access
Comment 1233788 by HunkyBunky
- Upvotes: 1
Selected Answer: A A / C is a best choice
Comment 1230883 by artworkad
- Upvotes: 4
Selected Answer: A AC is correct. D is not correct, because Redshift Spectrum cannot read from S3 Glacier Flexible Retrieval.
Comment 1230807 by tgv
- Upvotes: 4
Selected Answer: A Choice A ensures that live transactional data from PostgreSQL can be accessed directly within Redshift queries.
Choice C archives historical data in Amazon S3, reducing storage costs in Redshift while still making the data accessible via Redshift Spectrum.
(to Admin: I can’t select multiple answers on the voting comment)
Comment 1230611 by GHill1982
- Upvotes: 2
Correct answer is A and C.
Question 9Mnl748IBDSbnORjNFvG
Question
A manufacturing company has many IoT devices in facilities around the world. The company uses Amazon Kinesis Data Streams to collect data from the devices. The data includes device ID, capture date, measurement type, measurement value, and facility ID. The company uses facility ID as the partition key.
The company’s operations team recently observed many WriteThroughputExceeded exceptions. The operations team found that some shards were heavily used but other shards were generally idle.
How should the company resolve the issues that the operations team observed?
Choices
- A: Change the partition key from facility ID to a randomly generated key.
- B: Increase the number of shards.
- C: Archive the data on the producer’s side.
- D: Change the partition key from facility ID to capture date.
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1230808 by tgv
- Upvotes: 6
Selected Answer: A The best solution to resolve the issue of uneven shard usage and WriteThroughputExceeded exceptions is to balance the load more evenly across the shards. This can be effectively achieved by changing the partition key to something that ensures a more uniform distribution of data across the shards.
Comment 1242308 by bakarys
- Upvotes: 2
Selected Answer: A The correct answer is A. Change the partition key from facility ID to a randomly generated key.
Amazon Kinesis Data Streams uses the partition key that you specify to segregate the data records in the stream into shards. If the company uses the facility ID as the partition key, and if some facilities produce more data than others, then the data will be unevenly distributed across the shards. This can lead to some shards being heavily used while others are idle, and can cause
WriteThroughputExceededexceptions.By changing the partition key to a randomly generated key, the data records are more likely to be evenly distributed across all the shards, which can help to avoid the issue of some shards being heavily used and others being idle. This solution requires the least operational overhead and does not involve increasing costs (as in option B), archiving data (which might not be desirable or feasible, as in option C), or changing to a partition key that might also lead to uneven distribution (as in option D).
Comment 1241374 by didorins
- Upvotes: 2
Selected Answer: A D is not good, because you’re effectively making things worse by partitioning by date. My answer is A