Questions and Answers

Question eBkb3ssamRqN1UPPLzkV

Question

An online retail company has an application that runs on Amazon EC2 instances that are in a VPC. The company wants to collect flow logs for the VPC and analyze network traffic.

Which solution will meet these requirements MOST cost-effectively?

Choices

A: Publish flow logs to Amazon CloudWatch Logs. Use Amazon Athena for analytics.
B: Publish flow logs to Amazon CloudWatch Logs. Use an Amazon OpenSearch Service cluster for analytics.
C: Publish flow logs to Amazon S3 in text format. Use Amazon Athena for analytics.
D: Publish flow logs to Amazon S3 in Apache Parquet format. Use Amazon Athena for analytics.

answer?

Answer: D Answer_ET: D Community answer D (93%) 7% Discussion

Comment 1230819 by tgv

Upvotes: 6

Selected Answer: D Flow Logs can be published to S3 in Parquet format: https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-s3.html#flow-logs-s3-path

Comment 1234755 by PGGuy

Upvotes: 5

Selected Answer: D Publishing flow logs to Amazon S3 in Apache Parquet format and using Amazon Athena for analytics (D) is the most cost-effective solution. This approach minimizes storage costs due to the efficient compression of Parquet, and optimizes query performance and cost in Athena due to the reduced data size and optimized columnar storage.

Comment 1258718 by jyrajan69

Upvotes: 2

The question says clearly most cost effective, so on comparison between C and D, has to be C

Comment 1249669 by LR2023

Upvotes: 1

Selected Answer: B Flow logs acn be published to S3 but then option D sas in Parquet format - it is not automatically converted into parquet… https://aws.amazon.com/solutions/implementations/centralized-logging-with-opensearch/

Comment 1236241 by HunkyBunky

Upvotes: 2

Selected Answer: D Apache parquet and S3 = most cost-effective solution

Question j0er2KlKvcfUvm5wf8Zt

Question

A retail company stores transactions, store locations, and customer information tables in four reserved ra3.4xlarge Amazon Redshift cluster nodes. All three tables use even table distribution.

The company updates the store location table only once or twice every few years.

A data engineer notices that Redshift queues are slowing down because the whole store location table is constantly being broadcast to all four compute nodes for most queries. The data engineer wants to speed up the query performance by minimizing the broadcasting of the store location table.

Which solution will meet these requirements in the MOST cost-effective way?

Choices

A: Change the distribution style of the store location table from EVEN distribution to ALL distribution.
B: Change the distribution style of the store location table to KEY distribution based on the column that has the highest dimension.
C: Add a join column named store_id into the sort key for all the tables.
D: Upgrade the Redshift reserved node to a larger instance size in the same instance family.

answer?

Answer: A Answer_ET: A Community answer A (100%) Discussion

Comment 1249195 by andrologin

Upvotes: 2

Selected Answer: A ALL distribution is optimal for slowly changing dimension tables and generally small in size to allow for optimal joins.

Comment 1240742 by bakarys

Upvotes: 2

Selected Answer: A The most cost-effective solution to speed up the query performance by minimizing the broadcasting of the store location table would be:

A. Change the distribution style of the store location table from EVEN distribution to ALL distribution.

In Amazon Redshift, the ALL distribution style replicates the entire table to all nodes in the cluster, which eliminates the need to redistribute the data when executing a query. This can significantly improve query performance. Given that the store location table is updated only once or twice every few years, the overhead of maintaining the replicated data would be minimal. This makes it a cost-effective solution for improving the query performance.

Comment 1234756 by PGGuy

Upvotes: 4

Selected Answer: A Changing the distribution style of the store location table to ALL distribution (A) is the most cost-effective solution. It directly addresses the issue of broadcasting by ensuring the entire table is available on each node, significantly improving join performance without incurring substantial additional costs.

Comment 1230821 by tgv

Upvotes: 2

Selected Answer: A Using ALL distribution means the table is replicated to all nodes, eliminating the need for broadcasting during queries. Since the store location table is updated infrequently, this will significantly speed up queries without incurring frequent update costs.

Question rkajapRRoYHwJVHKbNfW

Question

A company has a data warehouse that contains a table that is named Sales. The company stores the table in Amazon Redshift. The table includes a column that is named city_name. The company wants to query the table to find all rows that have a city_name that starts with “San” or “El”.

Which SQL query will meet this requirement?

Choices

A: Select * from Sales where city_name ~ ‘$(San|El)*’;
B: Select * from Sales where city_name ~ ‘^(San|El)*’;
C: Select * from Sales where city_name ~’$(San&El)*’;
D: Select * from Sales where city_name ~ ‘^(San&El)*’;

answer?

Answer: B Answer_ET: B Community answer B (100%) Discussion

Comment 1253703 by chrispchrisp

Upvotes: 7

Selected Answer: B Regex Patterns for everyone’s reference

. : Matches any single character.

: Matches zero or more of the preceding element.

: Matches one or more of the preceding element. [abc] : Matches any of the enclosed characters. [^abc] : Matches any character not enclosed. ^ : Matches the start of a string. $ : Matches the end of a string. | : Logical OR operator. (abc) : Matches ‘abc’ and remembers the match.

Answer is B

Comment 1249199 by andrologin

Upvotes: 1

Selected Answer: B Regex patterns: ^ - used to capture the start of the text/string | - used as an OR operator

Comment 1240748 by bakarys

Upvotes: 1

Selected Answer: B B. Select * from Sales where city_name ~ ‘^(San|El)*’;

This query uses a regular expression pattern with the ~ operator. The caret ^ at the beginning of the pattern indicates that the match must start at the beginning of the string. (San|El) matches either “San” or “El”, and * means zero or more of the preceding element. So this query will return all rows where city_name starts with either “San” or “El”.

Comment 1234148 by HunkyBunky

Upvotes: 1

Selected Answer: B B - becuase of regexp

Comment 1232616 by JohnYang

Upvotes: 3

Selected Answer: B ^ asserts the position at the start of the string. (San|El) matches either “San” or “El”.

Comment 1230822 by tgv

Upvotes: 2

Selected Answer: B ~: This operator indicates the use of a regular expression. ^: This symbol signifies the start of the string. (San|El): This pattern matches strings that start with either “San” or “El”.

Question qcP7hpvIXqnPUhakugeB

Question

A company needs to send customer call data from its on-premises PostgreSQL database to AWS to generate near real-time insights. The solution must capture and load updates from operational data stores that run in the PostgreSQL database. The data changes continuously.

A data engineer configures an AWS Database Migration Service (AWS DMS) ongoing replication task. The task reads changes in near real time from the PostgreSQL source database transaction logs for each table. The task then sends the data to an Amazon Redshift cluster for processing.

The data engineer discovers latency issues during the change data capture (CDC) of the task. The data engineer thinks that the PostgreSQL source database is causing the high latency.

Which solution will confirm that the PostgreSQL database is the source of the high latency?

Choices

A: Use Amazon CloudWatch to monitor the DMS task. Examine the CDCIncomingChanges metric to identify delays in the CDC from the source database.
B: Verify that logical replication of the source database is configured in the postgresql.conf configuration file.
C: Enable Amazon CloudWatch Logs for the DMS endpoint of the source database. Check for error messages.
D: Use Amazon CloudWatch to monitor the DMS task. Examine the CDCLatencySource metric to identify delays in the CDC from the source database.

answer?

Answer: D Answer_ET: D Community answer D (100%) Discussion

Comment 1230825 by tgv

Upvotes: 5

Selected Answer: D CDCLatencySource Metric: This metric measures the latency between the source database and the DMS task. It shows how long it takes for changes to be read from the source database’s transaction logs.

https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Monitoring.html#CHAP_Monitoring.Metrics

Comment 1241838 by HunkyBunky

Upvotes: 1

Selected Answer: D only D makes sense

Comment 1235875 by sdas1

Upvotes: 1

https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Troubleshooting_Latency.html A high CDCLatencySource metric indicates that the process of capturing changes from the source is delayed. Answer is D

Question O9JuY09dDIEtmoCbGzly

Question

A lab uses IoT sensors to monitor humidity, temperature, and pressure for a project. The sensors send 100 KB of data every 10 seconds. A downstream process will read the data from an Amazon S3 bucket every 30 seconds.

Which solution will deliver the data to the S3 bucket with the LEAST latency?

Choices

A: Use Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose to deliver the data to the S3 bucket. Use the default buffer interval for Kinesis Data Firehose.
B: Use Amazon Kinesis Data Streams to deliver the data to the S3 bucket. Configure the stream to use 5 provisioned shards.
C: Use Amazon Kinesis Data Streams and call the Kinesis Client Library to deliver the data to the S3 bucket. Use a 5 second buffer interval from an application.
D: Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) and Amazon Kinesis Data Firehose to deliver the data to the S3 bucket. Use a 5 second buffer interval for Kinesis Data Firehose.

answer?

Answer: C Answer_ET: C Community answer C (59%) D (23%) A (18%) Discussion

Comment 1230839 by tgv

Upvotes: 8

Selected Answer: C C - This option ensures low latency by using a short buffer interval (5 seconds). The use of KCL allows for customized processing logic and timely delivery of data to S3. This makes it a strong candidate for minimal latency.

D - While this option provides low latency with a 5-second buffer interval, it introduces unnecessary complexity by using Apache Flink for what seems to be a straightforward data ingestion task. This option is overkill for the given use case and may add more operational overhead than necessary.

Comment 1230890 by artworkad

Upvotes: 5

Selected Answer: D Kinesis Data Streams cannot deliver directly to S3. Data has to go through Firehose. A is correct but is not lowest latency. I would go with D, as we can set the buffer interval to a low value. We do not need Flink, tho. That’s a bit confusing.

Comment 1316237 by Eleftheriia

Upvotes: 2

Selected Answer: A Why could not be A? https://aws.amazon.com/blogs/big-data/optimize-downstream-data-processing-with-amazon-data-firehose-and-amazon-emr-running-apache-spark/ It uses Data Firehose + Kinesis Data Streams

Comment 1303498 by Parandhaman_Margan

Upvotes: 1

Answer:D

Comment 1254085 by andrologin

Upvotes: 2

Selected Answer: C Use data streams and KCL, option A would be right but the default buffer for Firehose does not allow it to be correct. D adds extra components that are not needed for delivery of data.

Comment 1249712 by LR2023

Upvotes: 2

Selected Answer: A https://aws.amazon.com/about-aws/whats-new/2023/12/amazon-kinesis-data-firehose-zero-buffering/

Comment 1245768 by 4bc91ae

Upvotes: 1

its C - option D uses 1/ Analytics which summarizes data and gence has delay then passses to 2/ Firehose for deliver and Firehose doesnt say its using zero buffering

Comment 1236436 by sdas1

Upvotes: 1

Firehose uses multi-part upload for S3 destination when you configure a buffer time interval less than 60 seconds to offer lower latencies. Due to multi-part upload for S3 destination, you will see some increase in S3 PUT API costs if you choose a buffer time interval less than 60 seconds.

Comment 1230804 by GHill1982

Upvotes: 3

Selected Answer: C I think the answer is C. Kinesis Data Firehose has a minimum buffer interval of 60 seconds (1 minute) or 1 MB of data.

vuthanhdatt's Second Brain

Explorer

Associate-DEA-C01_41

Questions and Answers

Question eBkb3ssamRqN1UPPLzkV

Question

Choices

Comment 1230819 by tgv

Comment 1234755 by PGGuy

Comment 1258718 by jyrajan69

Comment 1249669 by LR2023

Comment 1236241 by HunkyBunky

Question j0er2KlKvcfUvm5wf8Zt

Question

Choices

Comment 1249195 by andrologin

Comment 1240742 by bakarys

Comment 1234756 by PGGuy

Comment 1230821 by tgv

Question rkajapRRoYHwJVHKbNfW

Question

Choices

Comment 1253703 by chrispchrisp

Comment 1249199 by andrologin

Comment 1240748 by bakarys

Comment 1234148 by HunkyBunky

Comment 1232616 by JohnYang

Comment 1230822 by tgv

Question qcP7hpvIXqnPUhakugeB

Question

Choices

Comment 1230825 by tgv

Comment 1241838 by HunkyBunky

Comment 1235875 by sdas1

Question O9JuY09dDIEtmoCbGzly

Question

Choices

Comment 1230839 by tgv

Comment 1230890 by artworkad

Comment 1316237 by Eleftheriia

Comment 1303498 by Parandhaman_Margan

Comment 1254085 by andrologin

Comment 1249712 by LR2023

Comment 1245768 by 4bc91ae

Comment 1236436 by sdas1

Comment 1230804 by GHill1982

Graph View

Table of Contents