Questions and Answers
Question MKrVF9K3aA6KlbZET9Dx
Question
A media company wants to improve a system that recommends media content to customer based on user behavior and preferences. To improve the recommendation system, the company needs to incorporate insights from third-party datasets into the company’s existing analytics platform. The company wants to minimize the effort and time required to incorporate third-party datasets. Which solution will meet these requirements with the LEAST operational overhead?
Choices
- A: Use API calls to access and integrate third-party datasets from AWS Data Exchange.
- B: Use API calls to access and integrate third-party datasets from AWS DataSync.
- C: Use Amazon Kinesis Data Streams to access and integrate third-party datasets from AWS CodeCommit repositories.
- D: Use Amazon Kinesis Data Streams to access and integrate third-party datasets from Amazon Elastic Container Registry (Amazon ECR).
answer?
Answer: A Answer_ET: A Community answer A (93%) 7% Discussion
Comment 1198355 by KelvinPun
- Upvotes: 13
Selected Answer: A AWS DataSync is primarily used for data transfer services designed to simplify, automate, and accelerate moving data between on-premises storage systems and AWS storage services, as well as between different AWS storage services. Its primary role is not for accessing third-party datasets but for efficiently transferring large volumes of data. In contrast, AWS Data Exchange is designed specifically for discovering and subscribing to third-party data in the cloud, providing direct API access to these datasets, which aligns perfectly with the company’s need to integrate this data into their recommendation systems with minimal overhead.
Comment 1330879 by ttpro1995
- Upvotes: 3
Selected Answer: A Yeah, AWS want people to buy data from their marketplace. So, … you know.
Comment 1288853 by Shubham1989
- Upvotes: 1
Should be AWS Data Exchange.
Comment 1250160 by lunachi4
- Upvotes: 1
Selected Answer: A I will go with A. Kinesis Data Stram is more operational overhead.
Comment 1245911 by Manohar24
- Upvotes: 1
Selected Answer: A A is correct
Comment 1223805 by sudohogan
- Upvotes: 2
A is correct, DataSync doesn’t really rely on API calls.
Comment 1223303 by nanaw770
- Upvotes: 1
Selected Answer: A A is correct answer.
Comment 1215808 by 0060594
- Upvotes: 1
Selected Answer: A AWS DataExchange
Comment 1208679 by k350Secops
- Upvotes: 1
Selected Answer: A options B, C, and D involve using Amazon Kinesis Data Streams or other services that may not be directly suited for integrating third-party datasets from external sources like AWS Data Exchange. These options might require additional configurations, data processing steps, or infrastructure management, resulting in higher operational overhead compared to directly leveraging AWS Data Exchange’s capabilities through API calls (Option A).
Comment 1176326 by hsnin
- Upvotes: 4
Selected Answer: A AWS Data Exchange is a service that makes it easy to share and manage data permissions from other organizations
Comment 1172533 by kj07
- Upvotes: 1
I will go with A.
Comment 1170191 by Josa2
- Upvotes: 2
Selected Answer: B There is no info or guarantee this third-party dataset is available in AWS to be part of a data-share, hence the more assertive answer is B
Comment 1167771 by GiorgioGss
- Upvotes: 3
Selected Answer: A A for me. “You can also discover and subscribe to new third-party data sets available through AWS Data Exchange” https://docs.aws.amazon.com/data-exchange/latest/userguide/what-is.html
Comment 1148917 by ceramem
- Upvotes: 2
A Data exchange is primarily designed for this purpose.
Comment 1137864 by TonyStark0122
- Upvotes: 3
A Data exchange is primarily designed for this purpose.
Comment 1137262 by lalitjhawar
- Upvotes: 3
A Data Exchange is the AWS official third-party datasets repository: https://aws.amazon.com/data-exchange
Question anmqEXJWq0fWcXuASMXo
Question
A company is migrating a legacy application to an Amazon S3 based data lake. A data engineer reviewed data that is associated with the legacy application. The data engineer found that the legacy data contained some duplicate information. The data engineer must identify and remove duplicate information from the legacy application data. Which solution will meet these requirements with the LEAST operational overhead?
Choices
- A: Write a custom extract, transform, and load (ETL) job in Python. Use the DataFrame.drop_duplicates() function by importing the Pandas library to perform data deduplication.
- B: Write an AWS Glue extract, transform, and load (ETL) job. Use the FindMatches machine learning (ML) transform to transform the data to perform data deduplication.
- C: Write a custom extract, transform, and load (ETL) job in Python. Import the Python dedupe library. Use the dedupe library to perform data deduplication.
- D: Write an AWS Glue extract, transform, and load (ETL) job. Import the Python dedupe library. Use the dedupe library to perform data deduplication.
answer?
Answer: B Answer_ET: B Community answer B (85%) A (15%) Discussion
Comment 1140101 by rralucard_
- Upvotes: 6
Selected Answer: B Option B, writing an AWS Glue ETL job with the FindMatches ML transform, is likely to meet the requirements with the least operational overhead. This solution leverages a managed service (AWS Glue) and incorporates a built-in ML transform specifically designed for deduplication, thus minimizing the need for manual setup, maintenance, and machine learning expertise.
Comment 1328213 by JP
- Upvotes: 2
Selected Answer: A I disagree with B. That option requires additional effort just to train the ML model with labeled data. Option A is as simple as to use the robust pandas library
Comment 1260940 by V0811
- Upvotes: 1
Selected Answer: B 100 % B
Comment 1170879 by GiorgioGss
- Upvotes: 4
Selected Answer: B B. https://docs.aws.amazon.com/glue/latest/dg/machine-learning.html “Find matches Finds duplicate records in the source data. You teach this machine learning transform by labeling example datasets to indicate which rows match. The machine learning transform learns which rows should be matches the more you teach it with example labeled data.”
Comment 1142138 by Aesthet
- Upvotes: 1
Remove duplicates from already migrated data - probably D. Remove duplicates from data before migration - A is preferable.
Question UUGBzJe20I95UfeKQPcG
Question
A company is building an analytics solution. The solution uses Amazon S3 for data lake storage and Amazon Redshift for a data warehouse. The company wants to use Amazon Redshift Spectrum to query the data that is in Amazon S3. Which actions will provide the FASTEST queries? (Choose two.)
Choices
- A: Use gzip compression to compress individual files to sizes that are between 1 GB and 5 GB.
- B: Use a columnar storage file format.
- C: Partition the data based on the most common query predicates.
- D: Split the data into files that are less than 10 KB.
- E: Use file formats that are not splittable.
answer?
Answer: BC Answer_ET: BC Community answer BC (100%) Discussion
Comment 1170883 by GiorgioGss
- Upvotes: 6
Selected Answer: BC https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-performance.html
Comment 1139271 by rralucard_
- Upvotes: 5
Selected Answer: BC B. Use a columnar storage file format: This is an excellent approach. Columnar storage formats like Parquet and ORC are highly recommended for use with Redshift Spectrum. They store data in columns, which allows Spectrum to scan only the needed columns for a query, significantly improving query performance and reducing the amount of data scanned.
C. Partition the data based on the most common query predicates: Partitioning data in S3 based on commonly used query predicates (like date, region, etc.) allows Redshift Spectrum to skip large portions of data that are irrelevant to a particular query. This can lead to substantial performance improvements, especially for large datasets.
Comment 1245258 by andrologin
- Upvotes: 1
Selected Answer: BC Partioning helps filter the data and columnar storage is optimised for analytical (OLAP) queries
Comment 1227404 by pypelyncar
- Upvotes: 3
Selected Answer: BC Redshift Spectrum is optimized for querying data stored in columnar formats like Parquet or ORC. These formats store each data column separately, allowing Redshift Spectrum to only scan the relevant columns for a specific query, significantly improving performance compared to row-oriented formats Partitioning organizes data files in S3 based on specific column values (e.g., date, region). When your queries filter or join data based on these partitioning columns (common query predicates), Redshift Spectrum can quickly locate the relevant data files, minimizing the amount of data scanned and accelerating query execution
Comment 1207703 by d8945a1
- Upvotes: 1
Selected Answer: BC https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/
Comment 1179397 by certplan
Upvotes: 1
- Partitioning: AWS documentation for Amazon Redshift Spectrum highlights the importance of partitioning data based on commonly used query predicates to improve query performance. By partitioning data, Redshift Spectrum can prune unnecessary partitions during query execution, reducing the amount of data scanned and improving overall query performance. This guidance can be found in the AWS documentation for Amazon Redshift Spectrum under “Using Partitioning to Improve Query Performance”: https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum-partitioning.html
Question I07dgTyOtcTFYZgTuwet
Question
A company uses Amazon RDS to store transactional data. The company runs an RDS DB instance in a private subnet. A developer wrote an AWS Lambda function with default settings to insert, update, or delete data in the DB instance. The developer needs to give the Lambda function the ability to connect to the DB instance privately without using the public internet. Which combination of steps will meet this requirement with the LEAST operational overhead? (Choose two.)
Choices
- A: Turn on the public access setting for the DB instance.
- B: Update the security group of the DB instance to allow only Lambda function invocations on the database port.
- C: Configure the Lambda function to run in the same subnet that the DB instance uses.
- D: Attach the same security group to the Lambda function and the DB instance. Include a self-referencing rule that allows access through the database port.
- E: Update the network ACL of the private subnet to include a self-referencing rule that allows access through the database port.
answer?
Answer: CD Answer_ET: CD Community answer CD (62%) BD (24%) 14% Discussion
Comment 1185183 by Alagong
- Upvotes: 6
Selected Answer: CD This solution only modifies the inbound rules of the security group of the DB instance, but it does not modify the outbound rules of the security group of the Lambda function. Additionally, this solution does not facilitate a private connection from the Lambda function to the DB instance, hence, the Lambda function would still need to use the public internet to access the DB instance. Therefore, this option does not fulfill the requirements.
Comment 1348602 by rr01
- Upvotes: 1
Selected Answer: BD I would go with B & D. As C would have operational overhead in my opinion.
Comment 1321836 by altonh
- Upvotes: 1
Selected Answer: BC D is wrong. It is a bad security practice for a DB to share SG with the client. C is correct compared to the other opinions (A & E).
Comment 1294149 by proserv
- Upvotes: 1
Selected Answer: BD B & D C is wrong While you want the Lambda function to access the RDS instance privately, it does not need to run in the same subnet. As long as both are in the same VPC, the Lambda function can connect.
Comment 1223034 by tgv
- Upvotes: 2
Selected Answer: CD I will go with C and D on this one, because in my opinion B is not correctly phrased.
The correct way to phrase it would be something like:
Update the security group of the RDS instance to allow inbound traffic on the database port (3306) only from the security group associated with the Lambda function.
Comment 1206802 by sdas1
- Upvotes: 2
While placing the Lambda function in the same subnet as the DB instance would technically allow them to communicate privately within the same network, it introduces additional complexity and operational overhead. Lambda functions typically run in AWS-managed VPCs, and configuring them to run in a specific subnet might require manual intervention and ongoing maintenance.
Comment 1203747 by Snape
- Upvotes: 3
Selected Answer: BD bbb ddd
Comment 1188693 by lucas_rfsb
- Upvotes: 1
Selected Answer: CD I would go with CD, since it’s less operational effort, in my opinion
Comment 1186505 by arvehisa
- Upvotes: 4
Selected Answer: CD B: need update security group. and there there may be other application need to access db except for lambda function D: it works and reuse security group which has less operational overhead
Comment 1186157 by harrura
- Upvotes: 2
A is not an option as it exposes the data to public B is not an option as we don’t want the lambda to be the only entity accessing the db, there can be many other apps. doing this is not scalable
Comment 1179443 by certplan
- Upvotes: 1
B. - While updating the security group of the DB instance to allow only Lambda function invocations on the database port may seem like a viable solution, it’s not the most efficient approach. This option overlooks the need for the Lambda function to be able to communicate securely with the DB instance within the same VPC/subnet.
- Reference: Amazon RDS documentation on security groups
Comment 1179433 by certplan
Upvotes: 1
AWS Lambda supports VPC configurations, allowing you to run Lambda functions within your own VPC. This enables private connectivity between Lambda functions and resources within the VPC, such as RDS DB instances. Reference AWS Lambda documentation on VPC configurations: [AWS Lambda VPC Settings]https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html
AWS security groups provide a flexible and scalable way to control traffic to your instances or resources. By attaching the same security group to both the Lambda function and the RDS DB instance, you can ensure they share the same set of rules for inbound and outbound traffic.
Self-referencing rules within security groups enable instances within the same security group to communicate with each other over specified ports.
Reference AWS documentation on security groups and self-referencing rules: [Security Groups for Your VPC]https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html
Comment 1179432 by certplan
- Upvotes: 4
So, there coudl be a justified argument for the following:
C. Configure the Lambda function to run in the same subnet that the DB instance uses: By running the Lambda function in the same subnet as the RDS DB instance, you enable them to communicate privately within the same network, eliminating the need for public internet access and reducing operational overhead.
D. Attach the same security group to the Lambda function and the DB instance. Include a self-referencing rule that allows access through the database port: By attaching the same security group to both the Lambda function and the RDS DB instance, and including a self-referencing rule that allows access through the database port, you ensure secure communication between them within the same VPC without exposing the database to the public internet. This approach minimizes operational overhead by centralizing security management and simplifying access control.
Comment 1179415 by certplan
- Upvotes: 1
Here’s how you would implement this:
Attach the same security group to both the Lambda function and the RDS DB instance: Ensure that both resources are associated with the same security group.
Create an inbound rule in the security group: Configure the security group to allow inbound traffic on the database port (e.g., 3306 for MySQL) from the security group itself.
For example, if the security group ID is sg-1234567890 and the database port is 3306, the inbound rule would look something like this:
Type: Custom TCP Rule Protocol: TCP Port Range: 3306 (or the port your database uses) Source: sg-1234567890 (the security group ID itself)
This rule allows the Lambda function, which is also part of the same security group, to communicate with the RDS DB instance through the specified port. It effectively creates a loopback or self-referencing rule within the security group, allowing internal communication between resources while maintaining security boundaries.
Comment 1179413 by certplan
- Upvotes: 2
The phrase “Include a self-referencing rule that allows access through the database port” refers to configuring the security group associated with the resources (in this case, the Lambda function and the RDS DB instance) to allow inbound traffic from the resources themselves on a specific port, typically the port used for database communication.
In AWS security groups, a self-referencing rule means allowing traffic from the security group itself. This setup is often used to facilitate communication between resources within the same security group or VPC without needing to specify individual IP addresses.
Comment 1170884 by GiorgioGss
- Upvotes: 1
Selected Answer: BC When you want Lambda to “privately” connect to a resource (RDS in this case) that sits inside a VPC, then you deploy Lambda inside VPC. = C Then you attach a proper IAM role to lambda. Then, to be more secure you open the RDS security group only on the specific port: MySQL/Aurora MySQL: 3306 SQL Server: 1433 PostgreSQL: 5432 Oracle: 1521
Comment 1147314 by BartoszGolebiowski24
- Upvotes: 1
what does “Include a self-referencing rule that allows access through the database port.” mean?
Comment 1139272 by rralucard_
- Upvotes: 1
Selected Answer: BC B. Update the security group of the DB instance to allow only Lambda function invocations on the database port: Modifying the security group of the RDS instance to allow incoming connections on the database port (e.g., port 3306 for MySQL, 5432 for PostgreSQL) from the Lambda function is a crucial step. This ensures that the RDS instance can accept connections from the Lambda function.
C. Configure the Lambda function to run in the same subnet that the DB instance uses: Placing the Lambda function in the same VPC and subnet as the RDS instance ensures private connectivity. AWS Lambda needs to be configured with a VPC configuration that includes the subnet(s) and security group(s) that allow access to the RDS instance.
Question BF47CG7McNndhYmimi0r
Question
A company has a frontend ReactJS website that uses Amazon API Gateway to invoke REST APIs. The APIs perform the functionality of the website. A data engineer needs to write a Python script that can be occasionally invoked through API Gateway. The code must return results to API Gateway. Which solution will meet these requirements with the LEAST operational overhead?
Choices
- A: Deploy a custom Python script on an Amazon Elastic Container Service (Amazon ECS) cluster.
- B: Create an AWS Lambda Python function with provisioned concurrency.
- C: Deploy a custom Python script that can integrate with API Gateway on Amazon Elastic Kubernetes Service (Amazon EKS).
- D: Create an AWS Lambda function. Ensure that the function is warm by scheduling an Amazon EventBridge rule to invoke the Lambda function every 5 minutes by using mock events.
answer?
Answer: B Answer_ET: B Community answer B (83%) D (17%) Discussion
Comment 1398908 by MephiboshethGumani
- Upvotes: 2
Selected Answer: D D. Create an AWS Lambda function. Ensure that the function is warm by scheduling an Amazon EventBridge rule to invoke the Lambda function every 5 minutes by using mock events.
Comment 1306087 by royalrum
- Upvotes: 1
I AM THINKING B. Dont u think provisionning concurrency add additional cost even when the function is not in active use, which is unnecessary for an occasionally invoked function.
Comment 1227412 by pypelyncar
- Upvotes: 4
Selected Answer: B B and D are both ok. Still, since it says LEAST operational overhead, then keep it simple. B then.
Comment 1204449 by HunkyBunky
- Upvotes: 2
Selected Answer: B B - simple and clear
Comment 1187366 by lucas_rfsb
- Upvotes: 1
Selected Answer: B I would go in B
Comment 1170888 by GiorgioGss
- Upvotes: 1
Selected Answer: B Although D seems a good choice but the questions asks for “LEAST operational overhead” will result in B
Comment 1163730 by damaldon
- Upvotes: 1
Answ. B You can create a web API with an HTTP endpoint for your Lambda function by using Amazon API Gateway. API Gateway provides tools for creating and documenting web APIs that route HTTP requests to Lambda functions. You can secure access to your API with authentication and authorization controls. Your APIs can serve traffic over the internet or can be accessible only within your VPC. https://docs.aws.amazon.com/lambda/latest/dg/services-apigateway.html
Comment 1137328 by rralucard_
- Upvotes: 2
Selected Answer: B B. AWS Lambda functions can be easily integrated with Amazon API Gateway to create RESTful APIs. This integration allows API Gateway to directly invoke the Lambda function when the API endpoint is hit.