Questions and Answers
Question eBkb3ssamRqN1UPPLzkV
Question
An online retail company has an application that runs on Amazon EC2 instances that are in a VPC. The company wants to collect flow logs for the VPC and analyze network traffic.
Which solution will meet these requirements MOST cost-effectively?
Choices
- A: Publish flow logs to Amazon CloudWatch Logs. Use Amazon Athena for analytics.
- B: Publish flow logs to Amazon CloudWatch Logs. Use an Amazon OpenSearch Service cluster for analytics.
- C: Publish flow logs to Amazon S3 in text format. Use Amazon Athena for analytics.
- D: Publish flow logs to Amazon S3 in Apache Parquet format. Use Amazon Athena for analytics.
answer?
Answer: D Answer_ET: D Community answer D (93%) 7% Discussion
Comment 1230819 by tgv
- Upvotes: 6
Selected Answer: D Flow Logs can be published to S3 in Parquet format: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-s3.html#flow-logs-s3-path
Comment 1234755 by PGGuy
- Upvotes: 5
Selected Answer: D Publishing flow logs to Amazon S3 in Apache Parquet format and using Amazon Athena for analytics (D) is the most cost-effective solution. This approach minimizes storage costs due to the efficient compression of Parquet, and optimizes query performance and cost in Athena due to the reduced data size and optimized columnar storage.
Comment 1258718 by jyrajan69
- Upvotes: 2
The question says clearly most cost effective, so on comparison between C and D, has to be C
Comment 1249669 by LR2023
- Upvotes: 1
Selected Answer: B Flow logs acn be published to S3 but then option D sas in Parquet format - it is not automatically converted into parquet… https://aws.amazon.com/solutions/implementations/centralized-logging-with-opensearch/
Comment 1236241 by HunkyBunky
- Upvotes: 2
Selected Answer: D Apache parquet and S3 = most cost-effective solution
Question j0er2KlKvcfUvm5wf8Zt
Question
A retail company stores transactions, store locations, and customer information tables in four reserved ra3.4xlarge Amazon Redshift cluster nodes. All three tables use even table distribution.
The company updates the store location table only once or twice every few years.
A data engineer notices that Redshift queues are slowing down because the whole store location table is constantly being broadcast to all four compute nodes for most queries. The data engineer wants to speed up the query performance by minimizing the broadcasting of the store location table.
Which solution will meet these requirements in the MOST cost-effective way?
Choices
- A: Change the distribution style of the store location table from EVEN distribution to ALL distribution.
- B: Change the distribution style of the store location table to KEY distribution based on the column that has the highest dimension.
- C: Add a join column named store_id into the sort key for all the tables.
- D: Upgrade the Redshift reserved node to a larger instance size in the same instance family.
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1249195 by andrologin
- Upvotes: 2
Selected Answer: A ALL distribution is optimal for slowly changing dimension tables and generally small in size to allow for optimal joins.
Comment 1240742 by bakarys
- Upvotes: 2
Selected Answer: A The most cost-effective solution to speed up the query performance by minimizing the broadcasting of the store location table would be:
A. Change the distribution style of the store location table from EVEN distribution to ALL distribution.
In Amazon Redshift, the ALL distribution style replicates the entire table to all nodes in the cluster, which eliminates the need to redistribute the data when executing a query. This can significantly improve query performance. Given that the store location table is updated only once or twice every few years, the overhead of maintaining the replicated data would be minimal. This makes it a cost-effective solution for improving the query performance.
Comment 1234756 by PGGuy
- Upvotes: 4
Selected Answer: A Changing the distribution style of the store location table to ALL distribution (A) is the most cost-effective solution. It directly addresses the issue of broadcasting by ensuring the entire table is available on each node, significantly improving join performance without incurring substantial additional costs.
Comment 1230821 by tgv
- Upvotes: 2
Selected Answer: A Using ALL distribution means the table is replicated to all nodes, eliminating the need for broadcasting during queries. Since the store location table is updated infrequently, this will significantly speed up queries without incurring frequent update costs.
Question rkajapRRoYHwJVHKbNfW
Question
A company has a data warehouse that contains a table that is named Sales. The company stores the table in Amazon Redshift. The table includes a column that is named city_name. The company wants to query the table to find all rows that have a city_name that starts with “San” or “El”.
Which SQL query will meet this requirement?
Choices
- A: Select * from Sales where city_name ~ ‘$(San|El)*’;
- B: Select * from Sales where city_name ~ ‘^(San|El)*’;
- C: Select * from Sales where city_name ~’$(San&El)*’;
- D: Select * from Sales where city_name ~ ‘^(San&El)*’;
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1253703 by chrispchrisp
- Upvotes: 7
Selected Answer: B Regex Patterns for everyone’s reference
. : Matches any single character.
- : Matches zero or more of the preceding element.
- : Matches one or more of the preceding element. [abc] : Matches any of the enclosed characters. [^abc] : Matches any character not enclosed. ^ : Matches the start of a string. $ : Matches the end of a string. | : Logical OR operator. (abc) : Matches ‘abc’ and remembers the match.
Answer is B
Comment 1249199 by andrologin
- Upvotes: 1
Selected Answer: B Regex patterns: ^ - used to capture the start of the text/string | - used as an OR operator
Comment 1240748 by bakarys
- Upvotes: 1
Selected Answer: B B. Select * from Sales where city_name ~ ‘^(San|El)*’;
This query uses a regular expression pattern with the ~ operator. The caret ^ at the beginning of the pattern indicates that the match must start at the beginning of the string. (San|El) matches either “San” or “El”, and * means zero or more of the preceding element. So this query will return all rows where city_name starts with either “San” or “El”.
Comment 1234148 by HunkyBunky
- Upvotes: 1
Selected Answer: B B - becuase of regexp
Comment 1232616 by JohnYang
- Upvotes: 3
Selected Answer: B ^ asserts the position at the start of the string. (San|El) matches either “San” or “El”.
Comment 1230822 by tgv
- Upvotes: 2
Selected Answer: B ~: This operator indicates the use of a regular expression. ^: This symbol signifies the start of the string. (San|El): This pattern matches strings that start with either “San” or “El”.
Question qcP7hpvIXqnPUhakugeB
Question
A company needs to send customer call data from its on-premises PostgreSQL database to AWS to generate near real-time insights. The solution must capture and load updates from operational data stores that run in the PostgreSQL database. The data changes continuously.
A data engineer configures an AWS Database Migration Service (AWS DMS) ongoing replication task. The task reads changes in near real time from the PostgreSQL source database transaction logs for each table. The task then sends the data to an Amazon Redshift cluster for processing.
The data engineer discovers latency issues during the change data capture (CDC) of the task. The data engineer thinks that the PostgreSQL source database is causing the high latency.
Which solution will confirm that the PostgreSQL database is the source of the high latency?
Choices
- A: Use Amazon CloudWatch to monitor the DMS task. Examine the CDCIncomingChanges metric to identify delays in the CDC from the source database.
- B: Verify that logical replication of the source database is configured in the postgresql.conf configuration file.
- C: Enable Amazon CloudWatch Logs for the DMS endpoint of the source database. Check for error messages.
- D: Use Amazon CloudWatch to monitor the DMS task. Examine the CDCLatencySource metric to identify delays in the CDC from the source database.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1230825 by tgv
- Upvotes: 5
Selected Answer: D CDCLatencySource Metric: This metric measures the latency between the source database and the DMS task. It shows how long it takes for changes to be read from the source database’s transaction logs.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Monitoring.html#CHAP_Monitoring.Metrics
Comment 1241838 by HunkyBunky
- Upvotes: 1
Selected Answer: D only D makes sense
Comment 1235875 by sdas1
- Upvotes: 1
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Troubleshooting_Latency.html A high CDCLatencySource metric indicates that the process of capturing changes from the source is delayed. Answer is D
Question O9JuY09dDIEtmoCbGzly
Question
A lab uses IoT sensors to monitor humidity, temperature, and pressure for a project. The sensors send 100 KB of data every 10 seconds. A downstream process will read the data from an Amazon S3 bucket every 30 seconds.
Which solution will deliver the data to the S3 bucket with the LEAST latency?
Choices
- A: Use Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose to deliver the data to the S3 bucket. Use the default buffer interval for Kinesis Data Firehose.
- B: Use Amazon Kinesis Data Streams to deliver the data to the S3 bucket. Configure the stream to use 5 provisioned shards.
- C: Use Amazon Kinesis Data Streams and call the Kinesis Client Library to deliver the data to the S3 bucket. Use a 5 second buffer interval from an application.
- D: Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) and Amazon Kinesis Data Firehose to deliver the data to the S3 bucket. Use a 5 second buffer interval for Kinesis Data Firehose.
answer?
Answer: C Answer_ET: C Community answer C (59%) D (23%) A (18%) Discussion
Comment 1230839 by tgv
- Upvotes: 8
Selected Answer: C C - This option ensures low latency by using a short buffer interval (5 seconds). The use of KCL allows for customized processing logic and timely delivery of data to S3. This makes it a strong candidate for minimal latency.
D - While this option provides low latency with a 5-second buffer interval, it introduces unnecessary complexity by using Apache Flink for what seems to be a straightforward data ingestion task. This option is overkill for the given use case and may add more operational overhead than necessary.
Comment 1230890 by artworkad
- Upvotes: 5
Selected Answer: D Kinesis Data Streams cannot deliver directly to S3. Data has to go through Firehose. A is correct but is not lowest latency. I would go with D, as we can set the buffer interval to a low value. We do not need Flink, tho. That’s a bit confusing.
Comment 1316237 by Eleftheriia
- Upvotes: 2
Selected Answer: A Why could not be A? https://aws.amazon.com/blogs/big-data/optimize-downstream-data-processing-with-amazon-data-firehose-and-amazon-emr-running-apache-spark/ It uses Data Firehose + Kinesis Data Streams
Comment 1303498 by Parandhaman_Margan
- Upvotes: 1
Answer:D
Comment 1254085 by andrologin
- Upvotes: 2
Selected Answer: C Use data streams and KCL, option A would be right but the default buffer for Firehose does not allow it to be correct. D adds extra components that are not needed for delivery of data.
Comment 1249712 by LR2023
- Upvotes: 2
Selected Answer: A https://aws.amazon.com/about-aws/whats-new/2023/12/amazon-kinesis-data-firehose-zero-buffering/
Comment 1245768 by 4bc91ae
- Upvotes: 1
its C - option D uses 1/ Analytics which summarizes data and gence has delay then passses to 2/ Firehose for deliver and Firehose doesnt say its using zero buffering
Comment 1236436 by sdas1
- Upvotes: 1
Firehose uses multi-part upload for S3 destination when you configure a buffer time interval less than 60 seconds to offer lower latencies. Due to multi-part upload for S3 destination, you will see some increase in S3 PUT API costs if you choose a buffer time interval less than 60 seconds.
Comment 1230804 by GHill1982
- Upvotes: 3
Selected Answer: C I think the answer is C. Kinesis Data Firehose has a minimum buffer interval of 60 seconds (1 minute) or 1 MB of data.