Questions and Answers
Question eedud0JQ5JiPlsYPEvyV
Question
A data engineer is building an automated extract, transform, and load (ETL) ingestion pipeline by using AWS Glue. The pipeline ingests compressed files that are in an Amazon S3 bucket. The ingestion pipeline must support incremental data processing.
Which AWS Glue feature should the data engineer use to meet this requirement?
Choices
- A: Workflows
- B: Triggers
- C: Job bookmarks
- D: Classifiers
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1250840 by andrologin
- Upvotes: 2
Selected Answer: C C AWS GLue bookmarks are used to implement incremental processing
Comment 1241522 by Ja13
- Upvotes: 1
Selected Answer: C C. Job bookmarks
Here’s why job bookmarks are the appropriate feature:
Incremental Processing: Job bookmarks in AWS Glue help track the last processed state of data in Amazon S3. They enable the ETL job to resume from where it left off in case of interruptions or subsequent runs, ensuring that only new or modified data since the last successful run is processed (incremental processing). Automated ETL: Job bookmarks work seamlessly within AWS Glue ETL jobs, allowing the job to efficiently manage the state of processed data without the need for manual intervention. Support for Compressed Files: AWS Glue natively supports reading compressed files from Amazon S3, so the ingestion pipeline can handle compressed data formats efficiently.
Comment 1240852 by HunkyBunky
- Upvotes: 1
Selected Answer: C C - is right
Comment 1239120 by Bmaster
- Upvotes: 4
C is correct answer..
https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
Question j0H4UnaIK98Fih4Eieke
Question
A banking company uses an application to collect large volumes of transactional data. The company uses Amazon Kinesis Data Streams for real-time analytics. The company’s application uses the PutRecord action to send data to Kinesis Data Streams.
A data engineer has observed network outages during certain times of day. The data engineer wants to configure exactly-once delivery for the entire processing pipeline.
Which solution will meet this requirement?
Choices
- A: Design the application so it can remove duplicates during processing by embedding a unique ID in each record at the source.
- B: Update the checkpoint configuration of the Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) data collection application to avoid duplicate processing of events.
- C: Design the data source so events are not ingested into Kinesis Data Streams multiple times.
- D: Stop using Kinesis Data Streams. Use Amazon EMR instead. Use Apache Flink and Apache Spark Streaming in Amazon EMR.
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1387429 by Ramdi1
- Upvotes: 1
Selected Answer: A Amazon Kinesis Data Streams does not provide exactly-once delivery natively. It ensures at-least-once delivery, meaning that under certain conditions (e.g., network failures, retries), duplicate records can occur. To achieve exactly-once processing, deduplication must be handled at the application level.
Comment 1285245 by PashoQ
- Upvotes: 1
Selected Answer: A A. Design the application so it can remove duplicates during processing by embedding a unique ID in each record at the source.
Comment 1244467 by Ja13
- Upvotes: 2
Selected Answer: A A. Design the application so it can remove duplicates during processing by embedding a unique ID in each record at the source.
Explanation: Exactly-Once Delivery: Ensuring exactly-once delivery is a challenge in distributed systems, especially in the presence of network outages and retries. By embedding a unique ID in each record at the source, you can track and identify duplicate records during processing. This approach allows you to implement idempotent processing, where duplicate records can be detected and discarded, ensuring that each record is processed exactly once. De-duplication Logic: Implementing de-duplication logic based on unique IDs ensures that even if the same record is ingested multiple times due to retries or network issues, it will be processed only once by the downstream applications.
Comment 1240798 by bakarys
- Upvotes: 3
Selected Answer: A A. Design the application so it can remove duplicates during processing by embedding a unique ID in each record at the source.
This approach ensures that even if a record is sent more than once due to network outages or other issues, it will only be processed once because the unique ID can be used to identify and remove any duplicates. This is a common pattern for achieving exactly-once processing semantics in distributed systems. The other options do not guarantee exactly-once delivery across the entire pipeline. Option B is partially correct but it only avoids duplicate processing within the Amazon Managed Service for Apache Flink, not across the entire pipeline. Option C is not always feasible because network issues and other factors can lead to events being ingested into Kinesis Data Streams multiple times. Option D involves changing the entire technology stack, which is not necessary to achieve the desired outcome and could introduce additional complexity and cost.
Question O49O72oWO4VRwFiLsOMs
Question
A company stores logs in an Amazon S3 bucket. When a data engineer attempts to access several log files, the data engineer discovers that some files have been unintentionally deleted.
The data engineer needs a solution that will prevent unintentional file deletion in the future.
Which solution will meet this requirement with the LEAST operational overhead?
Choices
- A: Manually back up the S3 bucket on a regular basis.
- B: Enable S3 Versioning for the S3 bucket.
- C: Configure replication for the S3 bucket.
- D: Use an Amazon S3 Glacier storage class to archive the data that is in the S3 bucket.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1302984 by 0c2d840
- Upvotes: 1
B seems to be the correct answer. But the side effect is, versioning of log files will be expensive as every change to the file will versioned.
Comment 1240887 by sdas1
- Upvotes: 3
S3 Versioning keeps multiple versions of an object in the same bucket. When you enable versioning, every time an object is overwritten or deleted, a new version of that object is created, and the previous version is retained. This ensures that no data is lost permanently due to accidental deletions or overwrites.
Comment 1240854 by HunkyBunky
- Upvotes: 2
Selected Answer: B B - is right answer
Comment 1239857 by Bmaster
- Upvotes: 3
B is good..
Question t6je9nAx2nSvKcfNDsLQ
Question
A manufacturing company collects sensor data from its factory floor to monitor and enhance operational efficiency. The company uses Amazon Kinesis Data Streams to publish the data that the sensors collect to a data stream. Then Amazon Kinesis Data Firehose writes the data to an Amazon S3 bucket. The company needs to display a real-time view of operational efficiency on a large screen in the manufacturing facility. Which solution will meet these requirements with the LOWEST latency?
Choices
- A: Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to process the sensor data. Use a connector for Apache Flink to write data to an Amazon Timestream database. Use the Timestream database as a source to create a Grafana dashboard.
- B: Configure the S3 bucket to send a notification to an AWS Lambda function when any new object is created. Use the Lambda function to publish the data to Amazon Aurora. Use Aurora as a source to create an Amazon QuickSight dashboard.
- C: Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to process the sensor data. Create a new Data Firehose delivery stream to publish data directly to an Amazon Timestream database. Use the Timestream database as a source to create an Amazon QuickSight dashboard.
- D: Use AWS Glue bookmarks to read sensor data from the S3 bucket in real time. Publish the data to an Amazon Timestream database. Use the Timestream database as a source to create a Grafana dashboard.
answer?
Answer: A Answer_ET: A Community answer A (97%) 3% Discussion
Comment 1180881 by fceb2c1
- Upvotes: 12
Selected Answer: A https://aws.amazon.com/blogs/database/near-real-time-processing-with-amazon-kinesis-amazon-timestream-and-grafana/
Look at the architecture diagram
Comment 1125644 by milofficial
- Upvotes: 7
Selected Answer: A real time → no Quicksight. And bookmarks to read sensor data real time is just as stupid as the flat earth theory. A it is.
Comment 1558831 by Salmanbutt786
- Upvotes: 1
Selected Answer: A A is correct
C is close to A, but creating an additional Data Firehose delivery stream adds unnecessary complexity. Writing directly to Amazon Timestream from Apache Flink, as in option A, is more straightforward and ensures lower latency.
Comment 1410032 by Scotty_Nguyen
- Upvotes: 1
Selected Answer: A A is correct
Comment 1291381 by Adrifersilva
- Upvotes: 3
Selected Answer: A Grafana: Real-time Performance: Grafana is known for its excellent real-time data visualization capabilities. It’s often used for operational dashboards that require frequent updates.
Integration: Works well with time-series databases and streaming data sources. [2]
Comment 1269047 by deepcloud
- Upvotes: 5
Selected Answer: A Firehose cannot use Timestream as destination. Answer is A
Comment 1265555 by samadal
- Upvotes: 1
Option A is for processing data in Flink and then sending it to Timestream. This is advantageous when complex data processing is required in Flink, but the processing step where complex analytics are processed can handle additional latency.
Option C performs data processing in Flink, sends the data directly to Timestream without any additional steps, and provides dashboards via QuickSight. Since data can be started immediately after arriving in Timestream, latency is likely to be higher.
Therefore, option C is preferable because it can handle latency by performing data processing, publishing data directly to Timestream, and provides fast dashboards using QuickSight.
Comment 1265186 by teo2157
- Upvotes: 4
Selected Answer: A Amazon QuickSight is primarily designed for business intelligence and data visualization, and it can provide near real-time views depending on the data refresh rate. However, it is not typically used for real-time streaming data visualization with very low latency. For real-time dashboards with very low latency, services like Grafana are more suitable. You can use Amazon Managed Grafana to setup the dashboard so you’re using an AWS service which is always preferible on these exams.
Comment 1259589 by jyrajan69
- Upvotes: 2
Based on this it should be C, why use an open source app when you can an AWS Service
Comment 1207849 by Just_Ninja
- Upvotes: 1
The Question is: Which solution will meet these requirements with the LOWEST latency? So just A can be the right answer “lowest latency!!!!”
Comment 1191887 by LanoraMoe
- Upvotes: 1
I go with Option A. Kinesis Data Firehose can connect to 3 AWS destinations so far S3, Redshift and OpenSearch.
Comment 1177603 by certplan
- Upvotes: 1
Option A:
- Involves additional steps: Option A requires writing data to Amazon Timestream after processing with Apache Flink, potentially introducing additional latency compared to a more direct approach like Option C.
- Grafana integration: While Grafana is a powerful visualization tool, setting up and configuring Grafana dashboards might require additional effort compared to using Amazon QuickSight, which offers more straightforward integration with AWS services like Amazon Timestream.
Comment 1177600 by certplan
- Upvotes: 2
C. - Processing Sensor Data with Amazon Flink: Similar to option A, this approach uses Amazon Managed Service for Apache Flink to process sensor data, providing real-time analytics or transformation capabilities.
- Data Firehose Delivery Stream to Timestream: Sets up a new Amazon Data Firehose delivery stream to publish processed data directly to Amazon Timestream. Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as data lakes, databases, and analytics services.
- Timestream Database as a Source for QuickSight Dashboard: Similar to option B, the data stored in Amazon Timestream serves as the data source for creating an Amazon QuickSight dashboard.
Comment 1177595 by certplan
- Upvotes: 1
Considerations:
Option A utilizes Amazon Managed Service for Apache Flink to process sensor data and then writes the processed data to Amazon Timestream. From there, the Timestream database serves as a source to create a Grafana dashboard. Thus the data goes through Apache Flink for processing, then to Timestream, and finally to Grafana. “Each additional step introduces potential latency”.
Option C processes sensor data using Amazon Managed Service for Apache Flink and then publishes data directly to Amazon Timestream via a Data Firehose delivery stream. Finally, it uses Timestream as a source to create an Amazon QuickSight dashboard.
So, in terms of latency, both options involve processing data in real-time using Apache Flink. However, Option C has a more direct data flow by publishing data directly to Timestream, potentially reducing latency compared to Option A, where the data has to go through an additional step of writing to Timestream.
Comment 1137897 by TonyStark0122
- Upvotes: 2
A. Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to process the sensor data. Use a connector for Apache Flink to write data to an Amazon Timestream database. Use the Timestream database as a source to create a Grafana dashboard.
Explanation: Amazon Managed Service for Apache Flink provides real-time stream processing capabilities, which can process sensor data with low latency. By using Apache Flink connectors, the processed data can be efficiently written to Amazon Timestream, which is optimized for time-series data storage and querying.
Question Cb8LQnxiimgbNW143umY
Question
A telecommunications company collects network usage data throughout each day at a rate of several thousand data points each second. The company runs an application to process the usage data in real time. The company aggregates and stores the data in an Amazon Aurora DB instance.
Sudden drops in network usage usually indicate a network outage. The company must be able to identify sudden drops in network usage so the company can take immediate remedial actions.
Which solution will meet this requirement with the LEAST latency?
Choices
- A: Create an AWS Lambda function to query Aurora for drops in network usage. Use Amazon EventBridge to automatically invoke the Lambda function every minute.
- B: Modify the processing application to publish the data to an Amazon Kinesis data stream. Create an Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) application to detect drops in network usage.
- C: Replace the Aurora database with an Amazon DynamoDB table. Create an AWS Lambda function to query the DynamoDB table for drops in network usage every minute. Use DynamoDB Accelerator (DAX) between the processing application and DynamoDB table.
- D: Create an AWS Lambda function within the Database Activity Streams feature of Aurora to detect drops in network usage.
answer?
Answer: B Answer_ET: B Community answer B (77%) D (23%) Discussion
Comment 1295754 by Adrifersilva
- Upvotes: 3
Selected Answer: B Regarding D, Database Activity Streams in Aurora are primarily for auditing databases actities, not for analyzing app data.
Comment 1262585 by antun3ra
- Upvotes: 2
Selected Answer: B B is the correct answer
Comment 1261862 by portland
- Upvotes: 2
Selected Answer: B reduces latency because it is analyze before the data even gets to the Aurora DB
Comment 1247828 by sdas1
- Upvotes: 2
Option D is the optimal choice because it leverages Aurora’s Database Activity Streams to enable real-time monitoring and immediate response to changes in network usage data. This approach ensures the least latency in detecting and responding to sudden drops in network usage, crucial for the telecommunications company to take immediate remedial actions during network outages.
Comment 1247827 by sdas1
- Upvotes: 2
Option D
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/DBActivityStreams.Overview.html
In Amazon Aurora, you start a database activity stream at the cluster level. All DB instances within your cluster have database activity streams enabled.
Your Aurora DB cluster pushes activities to an Amazon Kinesis data stream in near real time. The Kinesis stream is created automatically. From Kinesis, you can configure AWS services such as Amazon Data Firehose and AWS Lambda to consume the stream and store the data.
Comment 1244501 by Ja13
- Upvotes: 3
Selected Answer: B The best solution to identify sudden drops in network usage with the least latency is:
B. Modify the processing application to publish the data to an Amazon Kinesis data stream. Create an Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) application to detect drops in network usage.
This approach ensures real-time processing with minimal latency and allows immediate detection and response to network usage drops.
Comment 1240857 by HunkyBunky
- Upvotes: 3
Selected Answer: D I guess D. The question is which solution helps to identitfy sudden drops to take immediate actions
Comment 1239861 by Bmaster
- Upvotes: 2
B is good