Questions and Answers
Question ULHo9HLLi3Q5IqiL0hzi
Question
A company loads transaction data for each day into Amazon Redshift tables at the end of each day. The company wants to have the ability to track which tables have been loaded and which tables still need to be loaded. A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS Lambda function to publish the details of the load statuses to DynamoDB. How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table?
Choices
- A: Use a second Lambda function to invoke the first Lambda function based on Amazon CloudWatch events.
- B: Use the Amazon Redshift Data API to publish an event to Amazon EventBridge. Configure an EventBridge rule to invoke the Lambda function.
- C: Use the Amazon Redshift Data API to publish a message to an Amazon Simple Queue Service (Amazon SQS) queue. Configure the SQS queue to invoke the Lambda function.
- D: Use a second Lambda function to invoke the first Lambda function based on AWS CloudTrail events.
answer?
Answer: B Answer_ET: B Community answer B (88%) 13% Discussion
Comment 1127208 by milofficial
- Upvotes: 13
Selected Answer: B https://docs.aws.amazon.com/redshift/latest/mgmt/data-api-monitoring-events.html
Comment 1137923 by TonyStark0122
- Upvotes: 9
The most appropriate way for the data engineer to invoke the Lambda function to write load statuses to the DynamoDB table is:
B. Use the Amazon Redshift Data API to publish an event to Amazon EventBridge. Configure an EventBridge rule to invoke the Lambda function.
Explanation: Option B leverages the Amazon Redshift Data API to publish events to Amazon EventBridge, which provides a serverless event bus service for handling events across AWS services. By configuring an EventBridge rule to invoke the Lambda function in response to events published by the Redshift Data API, the data engineer can ensure that the Lambda function is triggered whenever there is a new transaction data load in Amazon Redshift. This approach offers a straightforward and scalable solution for tracking table load statuses without relying on additional Lambda functions or services.
Comment 1395673 by MephiboshethGumani
- Upvotes: 1
Selected Answer: B the data engineer should use Amazon EventBridge (formerly CloudWatch Events) to trigger the Lambda function based on a schedule or events that correspond to the completion of the data load process in Amazon Redshift.
Comment 1321607 by altonh
- Upvotes: 2
Selected Answer: D The statement in B is inaccurate. You don’t ‘use Amazon Redshift Data API to publish’ event to EventBridge. Redshift Data API has no function to write to EventBridge. Instead, the statement should be “Use EventBridge to monitor Data API events…” Perhaps this is a typo.
But if I assume there are no typos in all the statements, then I would go for D. Although not a perfect solution, the cloud trail events have more info than the Redshift Data API events.
Comment 1268439 by taxo
- Upvotes: 1
This job doesn’t need a real time check
Comment 1235417 by John2025
- Upvotes: 1
Why not used SQS to keep API change in the Queue ?
Comment 1226983 by pypelyncar
- Upvotes: 1
Im not 100% sure of B or C, this is a tricky question. The reason is due to either SQS or EventBridge has not direct connection natively speaking to Redshift Data API. There is no way to publish events by itself. So, this means either SQS / EventBridge eventually need a “proxy” (e.g lambda function) in order to publish events or process events to this 2 sources. In both services we need something to publish those events from Redshift. so Yes, we need a lamda function between Redshift Data API and (SQS|EB). so either B,C doesnt seem to be 100% right. I think this question its a good candidate to be “Choose two options” but none has 100% right. Both are valid considering that there is an adapter function between 2 solutions.
Question Iu0xZ0VfliO1qEqOGtO5
Question
A company receives test results from testing facilities that are located around the world. The company stores the test results in millions of 1 KB JSON files in an Amazon S3 bucket. A data engineer needs to process the files, convert them into Apache Parquet format, and load them into Amazon Redshift tables. The data engineer uses AWS Glue to process the files, AWS Step Functions to orchestrate the processes, and Amazon EventBridge to schedule jobs.
The company recently added more testing facilities. The time required to process files is increasing. The data engineer must reduce the data processing time.
Which solution will MOST reduce the data processing time?
Choices
- A: Use AWS Lambda to group the raw input files into larger files. Write the larger files back to Amazon S3. Use AWS Glue to process the files. Load the files into the Amazon Redshift tables.
- B: Use the AWS Glue dynamic frame file-grouping option to ingest the raw input files. Process the files. Load the files into the Amazon Redshift tables.
- C: Use the Amazon Redshift COPY command to move the raw input files from Amazon S3 directly into the Amazon Redshift tables. Process the files in Amazon Redshift.
- D: Use Amazon EMR instead of AWS Glue to group the raw input files. Process the files in Amazon EMR. Load the files into the Amazon Redshift tables.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1336642 by minhhnh
- Upvotes: 1
Selected Answer: B The key requirement is to reduce processing time for millions of small JSON files stored in Amazon S3. The solution needs to address the inefficiencies caused by the large number of small files while leveraging the existing AWS Glue and Amazon Redshift setup.
Comment 1265764 by aragon_saa
- Upvotes: 1
Selected Answer: B Answer is B
Comment 1265701 by matt200
- Upvotes: 1
Selected Answer: B Option B: Use the AWS Glue dynamic frame file-grouping option to ingest the raw input files. Process the files. Load the files into the Amazon Redshift tables.
Question 0cMk3YyDdi2I9p7JQY7R
Question
A data engineer uses Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to run data pipelines in an AWS account.
A workflow recently failed to run. The data engineer needs to use Apache Airflow logs to diagnose the failure of the workflow.
Which log type should the data engineer use to diagnose the cause of the failure?
Choices
- A: YourEnvironmentName-WebServer
- B: YourEnvironmentName-Scheduler
- C: YourEnvironmentName-DAGProcessing
- D: YourEnvironmentName-Task
answer?
Answer: D Answer_ET: D Community answer D (82%) C (18%) Discussion
Comment 1264521 by 150b64e
- Upvotes: 6
Selected Answer: D https://pupuweb.com/amazon-dea-c01-which-apache-airflow-log-type-should-you-use-to-diagnose-workflow-failures-in-amazon-mwaa/
When a workflow fails to run in Amazon MWAA, the task logs (YourEnvironmentName-Task) are the most relevant for diagnosing the issue. Task logs contain detailed information about the execution of individual tasks within the workflow, including any error messages or stack traces that can help pinpoint the cause of the failure.
Comment 1265000 by teo2157
- Upvotes: 3
Selected Answer: D Agree with D based on 150b64e comments
Comment 1261445 by Shanmahi
- Upvotes: 2
Selected Answer: C Reference ⇒ https://aws.amazon.com/managed-workflows-for-apache-airflow/
Question UvwwSX81mqcpmbINLquT
Question
A finance company uses Amazon Redshift as a data warehouse. The company stores the data in a shared Amazon S3 bucket. The company uses Amazon Redshift Spectrum to access the data that is stored in the S3 bucket. The data comes from certified third-party data providers. Each third-party data provider has unique connection details.
To comply with regulations, the company must ensure that none of the data is accessible from outside the company’s AWS environment.
Which combination of steps should the company take to meet these requirements? (Choose two.)
Choices
- A: Replace the existing Redshift cluster with a new Redshift cluster that is in a private subnet. Use an interface VPC endpoint to connect to the Redshift cluster. Use a NAT gateway to give Redshift access to the S3 bucket.
- B: Create an AWS CloudHSM hardware security module (HSM) for each data provider. Encrypt each data provider’s data by using the corresponding HSM for each data provider.
- C: Turn on enhanced VPC routing for the Amazon Redshift cluster. Set up an AWS Direct Connect connection and configure a connection between each data provider and the finance company’s VPC.
- D: Define table constraints for the primary keys and the foreign keys.
- E: Use federated queries to access the data from each data provider. Do not upload the data to the S3 bucket. Perform the federated queries through a gateway VPC endpoint.
answer?
Answer: AC Answer_ET: AC Community answer AC (45%) AE (36%) CE (18%) Discussion
Comment 1336874 by BigMrT
- Upvotes: 2
Selected Answer: CE A doesn’t make sense considering the NAT gateway since that’s usually used to facilitate traffic to the internet? Maybe if it was a S3 Gateway Endpoint it would make more sense but E makes sense if the configurations are correct?
Comment 1330177 by kailu
- Upvotes: 4
Selected Answer: AE Shouldn’t it be E and not C? Federated Queries: This method allows Redshift to query data directly from external sources without needing to store the data in Amazon S3. By using federated queries, the company can query third-party data sources without moving data into S3, reducing the attack surface. Gateway VPC Endpoint: A gateway VPC endpoint allows secure access to S3 from within the VPC without routing traffic over the public internet. This is crucial for maintaining compliance with regulations by ensuring that no data leaves the AWS environment.
Comment 1327307 by paali
- Upvotes: 2
Selected Answer: AC Why do we need NAT GW when we can have VPC GW or Interface Endpoints for S3 as well.
Comment 1326041 by hk0308
- Upvotes: 1
Selected Answer: AC None of the answers satisfy the constraints. A C both dont address how s3 bucket will be accessed through a VPC.
Comment 1286078 by EJGisME
- Upvotes: 2
Selected Answer: AC A. Replace the existing Redshift cluster with a new Redshift cluster that is in a private subnet. Use an interface VPC endpoint to connect to the Redshift cluster. Use a NAT gateway to give Redshift access to the S3 bucket. C. Turn on enhanced VPC routing for the Amazon Redshift cluster. Set up an AWS Direct Connect connection and configure a connection between each data provider and the finance company’s VPC.
Question gZLQNKxvlJKrveq37BGa
Question
Files from multiple data sources arrive in an Amazon S3 bucket on a regular basis. A data engineer wants to ingest new files into Amazon Redshift in near real time when the new files arrive in the S3 bucket.
Which solution will meet these requirements?
Choices
- A: Use the query editor v2 to schedule a COPY command to load new files into Amazon Redshift.
- B: Use the zero-ETL integration between Amazon Aurora and Amazon Redshift to load new files into Amazon Redshift.
- C: Use AWS Glue job bookmarks to extract, transform, and load (ETL) load new files into Amazon Redshift.
- D: Use S3 Event Notifications to invoke an AWS Lambda function that loads new files into Amazon Redshift.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1358482 by italiancloud2025
- Upvotes: 1
Selected Answer: D A: No, Query Editor v2 isn’t designed for event-driven near-real-time loads. B: No, the zero-ETL integration applies to Aurora, not to files arriving in S3. C: No, Glue job bookmarks are for batch ETL jobs, not near real-time processing. D: Sí, because using S3 Event Notifications to trigger a Lambda function enables near-real-time ingestion via COPY commands into Redshift.
Comment 1278699 by dashapetr
- Upvotes: 3
Selected Answer: D Seems like the trigger on upload would be the fastest option