Questions and Answers
Question 4r5fcqVbLrOjfQLtuux9
Question
A company has a production AWS account that runs company workloads. The company’s security team created a security AWS account to store and analyze security logs from the production AWS account. The security logs in the production AWS account are stored in Amazon CloudWatch Logs. The company needs to use Amazon Kinesis Data Streams to deliver the security logs to the security AWS account. Which solution will meet these requirements?
Choices
- A: Create a destination data stream in the production AWS account. In the security AWS account, create an IAM role that has cross-account permissions to Kinesis Data Streams in the production AWS account.
- B: Create a destination data stream in the security AWS account. Create an IAM role and a trust policy to grant CloudWatch Logs the permission to put data into the stream. Create a subscription filter in the security AWS account.
- C: Create a destination data stream in the production AWS account. In the production AWS account, create an IAM role that has cross-account permissions to Kinesis Data Streams in the security AWS account.
- D: Create a destination data stream in the security AWS account. Create an IAM role and a trust policy to grant CloudWatch Logs the permission to put data into the stream. Create a subscription filter in the production AWS account.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1194563 by Christina666
- Upvotes: 6
Selected Answer: D Cross-Account Delivery: Kinesis Data Streams in the security account ensures the logs reside in the designated security-focused environment. CloudWatch Logs Integration: Granting CloudWatch Logs permissions to put records into the Kinesis Data Stream directly establishes a streamlined and secure data flow from the production account. Filtering Controls: The subscription filter in the production account provides precise control over which log events are sent to the security account.
Comment 1346223 by Salam9
- Upvotes: 2
Selected Answer: D https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SubscriptionFilters-AccountLevel.html#DestinationKinesisExample-AccountLevel
Comment 1179496 by certplan
Upvotes: 2
- Cross-Account Access:
AWS Documentation: [Cross-Account Access] https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html
This documentation provides detailed instructions on how to set up cross-account access using IAM roles and trust policies, which is essential for allowing CloudWatch Logs in one AWS account to put data into a Kinesis Data Stream in another AWS account.
- Configuring CloudWatch Logs Subscription Filters:
- AWS Documentation: [Subscription Filters for Amazon CloudWatch Logs] https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SubscriptionFilters.html
- This documentation explains how to create subscription filters for CloudWatch Logs, which enable you to route log data to various destinations, including Kinesis Data Streams. Placing the subscription filter in the production AWS account ensures that only the relevant security logs are sent to the Kinesis Data Stream in the security AWS account.
Comment 1170894 by GiorgioGss
- Upvotes: 2
Selected Answer: D https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CrossAccountSubscriptions-Kinesis.html
Comment 1142142 by Aesthet
- Upvotes: 1
Both ChatGPT and me agree with anser D
Question KLgXUAqR0p6UmCGuHpBf
Question
A company uses Amazon S3 to store semi-structured data in a transactional data lake. Some of the data files are small, but other data files are tens of terabytes. A data engineer must perform a change data capture (CDC) operation to identify changed data from the data source. The data source sends a full snapshot as a JSON file every day and ingests the changed data into the data lake. Which solution will capture the changed data MOST cost-effectively?
Choices
- A: Create an AWS Lambda function to identify the changes between the previous data and the current data. Configure the Lambda function to ingest the changes into the data lake.
- B: Ingest the data into Amazon RDS for MySQL. Use AWS Database Migration Service (AWS DMS) to write the changed data to the data lake.
- C: Use an open source data lake format to merge the data source with the S3 data lake to insert the new data and update the existing data.
- D: Ingest the data into an Amazon Aurora MySQL DB instance that runs Aurora Serverless. Use AWS Database Migration Service (AWS DMS) to write the changed data to the data lake.
answer?
Answer: C Answer_ET: C Community answer C (93%) 7% Discussion
Comment 1170898 by GiorgioGss
- Upvotes: 7
Selected Answer: C https://aws.amazon.com/blogs/big-data/implement-a-cdc-based-upsert-in-a-data-lake-using-apache-iceberg-and-aws-glue/
Comment 1339524 by plutonash
- Upvotes: 1
Selected Answer: A Generally, AWS questions never give preference to the others solution than an AWS service so even if C could be better the answer is A
Comment 1262346 by influxy
- Upvotes: 1
Comment 1218982 by FunkyFresco
- Upvotes: 2
Selected Answer: C Ill go with Delta or something like that. is C
Comment 1179805 by certplan
- Upvotes: 2
Relative to cost, here are docs for the reason for option C: https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html https://aws.amazon.com/blogs/big-data/ https://docs.aws.amazon.com/glue/latest/dg/welcome.html https://docs.aws.amazon.com/emr/
Here are docs for reasons the others are not correct: https://aws.amazon.com/lambda/pricing/ https://aws.amazon.com/rds/pricing/ https://aws.amazon.com/dms/pricing/
Comment 1163736 by damaldon
- Upvotes: 1
Answ. D You can migrate data from any MySQL-compatible database (MySQL, MariaDB, or Amazon Aurora MySQL) using AWS Database Migration Service. https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html
Comment 1127563 by [Removed]
- Upvotes: 4
Selected Answer: C This is a tricky one. Although option A seems like the best choice since it uses an AWS service, I believe using Delta/Iceberg APIs would be easier than writing custom code on Lambda
Question 4ajoskCfp3cQI0udeXFx
Question
A data engineer runs Amazon Athena queries on data that is in an Amazon S3 bucket. The Athena queries use AWS Glue Data Catalog as a metadata table. The data engineer notices that the Athena query plans are experiencing a performance bottleneck. The data engineer determines that the cause of the performance bottleneck is the large number of partitions that are in the S3 bucket. The data engineer must resolve the performance bottleneck and reduce Athena query planning time. Which solutions will meet these requirements? (Choose two.)
Choices
- A: Create an AWS Glue partition index. Enable partition filtering.
- B: Bucket the data based on a column that the data have in common in a WHERE clause of the user query.
- C: Use Athena partition projection based on the S3 bucket prefix.
- D: Transform the data that is in the S3 bucket to Apache Parquet format.
- E: Use the Amazon EMR S3DistCP utility to combine smaller objects in the S3 bucket into larger objects.
answer?
Answer: AC Answer_ET: AC Community answer AC (73%) 14% 9% Discussion
Comment 1137343 by rralucard_
- Upvotes: 7
Selected Answer: AC https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ Optimizing Partition Processing using partition projection Processing partition information can be a bottleneck for Athena queries when you have a very large number of partitions and aren’t using AWS Glue partition indexing. You can use partition projection in Athena to speed up query processing of highly partitioned tables and automate partition management. Partition projection helps minimize this overhead by allowing you to query partitions by calculating partition information rather than retrieving it from a metastore. It eliminates the need to add partitions’ metadata to the AWS Glue table.
Comment 1355458 by Mahidbdwh
- Upvotes: 2
Selected Answer: AC Bucketing not address the problem of having a large number of partitions in the metadata, which is the root cause of the query planning bottleneck. Converting to a columnar format like Apache Parquet will not directly reduce the overhead associated with managing a large number of partitions. Combining small objects will not mitigate the planning overhead that comes from a large number of partitions in the data catalog. Hence A and C
Comment 1339666 by SMALLAM
- Upvotes: 1
Selected Answer: AE https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
Comment 1227563 by pypelyncar
- Upvotes: 1
Selected Answer: AC Creating an AWS Glue partition index and enabling partition filtering can significantly improve query performance when dealing with large datasets with many partitions. The partition index allows Athena to quickly identify the relevant partitions for a query, reducing the time spent scanning unnecessary data. Partition filtering further optimizes the query by only scanning the partitions that match the filter conditions. Athena partition projection based on the S3 bucket prefix is another effective technique to improve query performance. By leveraging the bucket prefix structure, Athena can prune partitions that are not relevant to the query, reducing the amount of data that needs to be scanned and processed. This approach is particularly useful when the data is organized in a hierarchical structure within the S3 bucket.
Comment 1213759 by VerRi
- Upvotes: 1
Selected Answer: AC D is not correct because the issue is related to partitioning.
Comment 1204844 by HunkyBunky
- Upvotes: 1
Selected Answer: AC I guess A / C, beucase we faced with - query plans performance bottleneck, so indexing should be improved
Comment 1203041 by khchan123
- Upvotes: 2
A. Creating an AWS Glue partition index and enabling partition filtering can help improve query performance by allowing Athena to prune unnecessary partitions from the query plan. This can reduce the number of partitions that need to be scanned, resulting in faster query planning times.
C. Athena partition projection allows you to define a partition scheme based on the S3 bucket prefix. This can help reduce the number of partitions that need to be scanned, as Athena can use the prefix to determine which partitions are relevant to the query. This can also help improve query performance and reduce planning times.
Comment 1195093 by okechi
- Upvotes: 1
The right answer is BD
Comment 1194568 by Christina666
- Upvotes: 3
Selected Answer: AD A. Create an AWS Glue partition index. Enable partition filtering. Targeted Optimization: Partition indexes within the Glue Data Catalog help Athena efficiently identify the relevant partitions, significantly reducing query planning time. Partition filtering further refines the search during query execution. D. Transform the data that is in the S3 bucket to Apache Parquet format. Efficient Columnar Format: Parquet’s columnar storage and built-in metadata often allow Athena to skip over large portions of data irrelevant to the query, leading to faster query planning and execution.
Comment 1181325 by fceb2c1
- Upvotes: 4
Selected Answer: AC Keyword: Athena query planning time
See explanation in the link: https://www.myexamcollection.com/Data-Engineer-Associate-vce-questions.htm
B & D are related to analytical queries performance, not about “query planning” performance.
Comment 1173449 by ottarg
- Upvotes: 2
Just finished the exam and I went with AD. I agree with GiorgioGss, but the reason why I picked A over C was becaues the table is already using Glue catalog. If we use the indexes, there’s no reason to use C as we already have the partitions indexed. No reason to pick B if we have C selected. Thus I picked D with this to optimize the query e.g. if I’m only selecting a subset of the columns.
Comment 1171090 by GiorgioGss
- Upvotes: 1
Strange questions… it can be ABCD
Comment 1137345 by rralucard_
- Upvotes: 1
If your table stored in an AWS Glue Data Catalog has tens and hundreds of thousands and millions of partitions, you can enable partition indexes on the table. With partition indexes, only the metadata for the partition value in the query’s filter is retrieved from the catalog instead of retrieving all the partitions’ metadata. The result is faster queries for such highly partitioned tables. The following table compares query runtimes between a partitioned table with no partition indexing and with partition indexing. The table contains approximately 100,000 partitions and uncompressed text data. The orders table is partitioned by the o_custkey column.
Comment 1127567 by [Removed]
- Upvotes: 2
Selected Answer: BD https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
Question uQvw7iigkdZEguu48hkD
Question
A data engineer must manage the ingestion of real-time streaming data into AWS. The data engineer wants to perform real-time analytics on the incoming streaming data by using time-based aggregations over a window of up to 30 minutes. The data engineer needs a solution that is highly fault tolerant. Which solution will meet these requirements with the LEAST operational overhead?
Choices
- A: Use an AWS Lambda function that includes both the business and the analytics logic to perform time-based aggregations over a window of up to 30 minutes for the data in Amazon Kinesis Data Streams.
- B: Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to analyze the data that might occasionally contain duplicates by using multiple types of aggregations.
- C: Use an AWS Lambda function that includes both the business and the analytics logic to perform aggregations for a tumbling window of up to 30 minutes, based on the event timestamp.
- D: Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to analyze the data by using multiple types of aggregations to perform time-based analytics over a window of up to 30 minutes.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1139312 by rralucard_
- Upvotes: 7
Selected Answer: D D. Amazon Managed Service for Apache Flink for Time-Based Analytics over 30 Minutes: This option correctly identifies the use of Amazon Managed Service for Apache Flink for performing time-based analytics over a window of up to 30 minutes. Apache Flink is adept at handling such scenarios, providing capabilities for complex event processing, time-windowed aggregations, and maintaining state over time. This option would offer high fault tolerance and minimal operational overhead due to the managed nature of the service.
Comment 1344641 by div_div
- Upvotes: 2
Selected Answer: D Lambda can not be used because it max processing limit of time is 15 min and remaining two option related to flink and using flink we can perfrom time-series and window size aggregation
Comment 1255929 by Linuslin
- Upvotes: 2
This link is not AWS documents but I think you guys can take a look. https://amandeep-singh-johar.medium.com/real-time-stream-processing-with-apache-flink-153992840f16
Comment 1212424 by Just_Ninja
- Upvotes: 2
Selected Answer: D Show the Docs
Comment 1207204 by DevoteamAnalytix
- Upvotes: 1
Selected Answer: D https://docs.aws.amazon.com/managed-flink/latest/java/how-operators.html#how-operators-agg
Comment 1186186 by harrura
- Upvotes: 2
this is crazy, the answers by bot are wrong, please don’t rely on them. please care to open discussions and look for reasoning
Question pV6eYyZ7owMPeqbYAtvw
Question
A company is planning to upgrade its Amazon Elastic Block Store (Amazon EBS) General Purpose SSD storage from gp2 to gp3. The company wants to prevent any interruptions in its Amazon EC2 instances that will cause data loss during the migration to the upgraded storage. Which solution will meet these requirements with the LEAST operational overhead?
Choices
- A: Create snapshots of the gp2 volumes. Create new gp3 volumes from the snapshots. Attach the new gp3 volumes to the EC2 instances.
- B: Create new gp3 volumes. Gradually transfer the data to the new gp3 volumes. When the transfer is complete, mount the new gp3 volumes to the EC2 instances to replace the gp2 volumes.
- C: Change the volume type of the existing gp2 volumes to gp3. Enter new values for volume size, IOPS, and throughput.
- D: Use AWS DataSync to create new gp3 volumes. Transfer the data from the original gp2 volumes to the new gp3 volumes.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1177062 by GiorgioGss
- Upvotes: 6
Selected Answer: C https://aws.amazon.com/blogs/storage/migrate-your-amazon-ebs-volumes-from-gp2-to-gp3-and-save-up-to-20-on-costs/
Comment 1181328 by fceb2c1
- Upvotes: 6
Selected Answer: C Option C: Check section under “To modify an Amazon EBS volume using the AWS Management Console“ in GiorgioGss’s link Amazon EBS Elastic Volumes enable you to modify your volume type from gp2 to gp3 without detaching volumes or restarting instances (requirements for modification), which means that there are no interruptions to your applications during modification.
Comment 1347105 by lcsantos99
- Upvotes: 1
Selected Answer: C the correct answer is C
Comment 1139792 by rralucard_
- Upvotes: 2
Selected Answer: C Option C is the most straightforward and efficient approach to upgrading from gp2 to gp3 EBS volumes, providing an in-place upgrade path with minimal operational overhead and no interruption in service.