Questions and Answers

Question uVdTM49xE1hTJEPm2gBE

Question

A company has implemented a lake house architecture in Amazon Redshift. The company needs to give users the ability to authenticate into Redshift query editor by using a third-party identity provider (IdP).

A data engineer must set up the authentication mechanism.

What is the first step the data engineer should take to meet this requirement?

Choices

  • A: Register the third-party IdP as an identity provider in the configuration settings of the Redshift cluster.
  • B: Register the third-party IdP as an identity provider from within Amazon Redshift.
  • C: Register the third-party IdP as an identity provider for AVS Secrets Manager. Configure Amazon Redshift to use Secrets Manager to manage user credentials.
  • D: Register the third-party IdP as an identity provider for AWS Certificate Manager (ACM). Configure Amazon Redshift to use ACM to manage user credentials.

Question 1a338l71iO2SKBFAXhNl

Question

A company currently uses a provisioned Amazon EMR cluster that includes general purpose Amazon EC2 instances. The EMR cluster uses EMR managed scaling between one to five task nodes for the company’s long-running Apache Spark extract, transform, and load (ETL) job. The company runs the ETL job every day.

When the company runs the ETL job, the EMR cluster quickly scales up to five nodes. The EMR cluster often reaches maximum CPU usage, but the memory usage remains under 30%.

The company wants to modify the EMR cluster configuration to reduce the EMR costs to run the daily ETL job.

Which solution will meet these requirements MOST cost-effectively?

Choices

  • A: Increase the maximum number of task nodes for EMR managed scaling to 10.
  • B: Change the task node type from general purpose EC2 instances to memory optimized EC2 instances.
  • C: Switch the task node type from general purpose Re instances to compute optimized EC2 instances.
  • D: Reduce the scaling cooldown period for the provisioned EMR cluster.

Question 49QDnPfaUjMeGru4qTz4

Question

A company uploads .csv files to an Amazon S3 bucket. The company’s data platform team has set up an AWS Glue crawler to perform data discovery and to create the tables and schemas.

An AWS Glue job writes processed data from the tables to an Amazon Redshift database. The AWS Glue job handles column mapping and creates the Amazon Redshift tables in the Redshift database appropriately.

If the company reruns the AWS Glue job for any reason, duplicate records are introduced into the Amazon Redshift tables. The company needs a solution that will update the Redshift tables without duplicates.

Which solution will meet these requirements?

Choices

  • A: Modify the AWS Glue job to copy the rows into a staging Redshift table. Add SQL commands to update the existing rows with new values from the staging Redshift table.
  • B: Modify the AWS Glue job to load the previously inserted data into a MySQL database. Perform an upsert operation in the MySQL database. Copy the results to the Amazon Redshift tables.
  • C: Use Apache Spark’s DataFrame dropDuplicates() API to eliminate duplicates. Write the data to the Redshift tables.
  • D: Use the AWS Glue ResolveChoice built-in transform to select the value of the column from the most recent record.

Question T7rfRHxj8NtTPis414k4

Question

A company is using Amazon Redshift to build a data warehouse solution. The company is loading hundreds of files into a fact table that is in a Redshift cluster.

The company wants the data warehouse solution to achieve the greatest possible throughput. The solution must use cluster resources optimally when the company loads data into the fact table.

Which solution will meet these requirements?

Choices

  • A: Use multiple COPY commands to load the data into the Redshift cluster.
  • B: Use S3DistCp to load multiple files into Hadoop Distributed File System (HDFS). Use an HDFS connector to ingest the data into the Redshift cluster.
  • C: Use a number of INSERT statements equal to the number of Redshift cluster nodes. Load the data in parallel into each node.
  • D: Use a single COPY command to load the data into the Redshift cluster.

Question x4CMK1CDTJYduX0iE8pr

Question

A company ingests data from multiple data sources and stores the data in an Amazon S3 bucket. An AWS Glue extract, transform, and load (ETL) job transforms the data and writes the transformed data to an Amazon S3 based data lake. The company uses Amazon Athena to query the data that is in the data lake.

The company needs to identify matching records even when the records do not have a common unique identifier.

Which solution will meet this requirement?

Choices

  • A: Use Amazon Macie pattern matching as part of the ETL job.
  • B: Train and use the AWS Glue PySpark Filter class in the ETL job.
  • C: Partition tables and use the ETL job to partition the data on a unique identifier.
  • D: Train and use the AWS Lake Formation FindMatches transform in the ETL job.