Questions and Answers
Question jdF2V1IHrIh9F64AgOdg
Question
A company uses AWS Glue Data Catalog to index data that is uploaded to an Amazon S3 bucket every day. The company uses a daily batch processes in an extract, transform, and load (ETL) pipeline to upload data from external sources into the S3 bucket.
The company runs a daily report on the S3 data. Some days, the company runs the report before all the daily data has been uploaded to the S3 bucket. A data engineer must be able to send a message that identifies any incomplete data to an existing Amazon Simple Notification Service (Amazon SNS) topic.
Which solution will meet this requirement with the LEAST operational overhead?
Choices
- A: Create data quality checks for the source datasets that the daily reports use. Create a new AWS managed Apache Airflow cluster. Run the data quality checks by using Airflow tasks that run data quality queries on the columns data type and the presence of null values. Configure Airflow Directed Acyclic Graphs (DAGs) to send an email notification that informs the data engineer about the incomplete datasets to the SNS topic.
- B: Create data quality checks on the source datasets that the daily reports use. Create a new Amazon EMR cluster. Use Apache Spark SQL to create Apache Spark jobs in the EMR cluster that run data quality queries on the columns data type and the presence of null values. Orchestrate the ETL pipeline by using an AWS Step Functions workflow. Configure the workflow to send an email notification that informs the data engineer about the incomplete datasets to the SNS topic.
- C: Create data quality checks on the source datasets that the daily reports use. Create data quality actions by using AWS Glue workflows to confirm the completeness and consistency of the datasets. Configure the data quality actions to create an event in Amazon EventBridge if a dataset is incomplete. Configure EventBridge to send the event that informs the data engineer about the incomplete datasets to the Amazon SNS topic.
- D: Create AWS Lambda functions that run data quality queries on the columns data type and the presence of null values. Orchestrate the ETL pipeline by using an AWS Step Functions workflow that runs the Lambda functions. Configure the Step Functions workflow to send an email notification that informs the data engineer about the incomplete datasets to the SNS topic.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1328393 by 7a1d491
- Upvotes: 2
Selected Answer: C C LEAST operational overhead
Comment 1317302 by emupsx1
- Upvotes: 1
Question UPm9nHOaC1Ol2lifCyzb
Question
A company stores customer data that contains personally identifiable information (PII) in an Amazon Redshift cluster. The company’s marketing, claims, and analytics teams need to be able to access the customer data.
The marketing team should have access to obfuscated claim information but should have full access to customer contact information. The claims team should have access to customer information for each claim that the team processes. The analytics team should have access only to obfuscated PII data.
Which solution will enforce these data access requirements with the LEAST administrative overhead?
Choices
- A: Create a separate Redshift cluster for each team. Load only the required data for each team. Restrict access to clusters based on the teams.
- B: Create views that include required fields for each of the data requirements. Grant the teams access only to the view that each team requires.
- C: Create a separate Amazon Redshift database role for each team. Define masking policies that apply for each team separately. Attach appropriate masking policies to each team role.
- D: Move the customer data to an Amazon S3 bucket. Use AWS Lake Formation to create a data lake. Use fine-grained security capabilities to grant each team appropriate permissions to access the data.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1312694 by michele_scar
- Upvotes: 3
Selected Answer: C It’s the only answer that match least operation and masking information
Comment 1308012 by tucobbad
- Upvotes: 4
Selected Answer: C To me, it seems C is the best approach as Redshift has Dynamic Data Masking feature: https://docs.aws.amazon.com/redshift/latest/dg/t_ddm.html
Question t5GGzFQalaXm9CLcW8en
Question
A financial company recently added more features to its mobile app. The new features required the company to create a new topic in an existing Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster.
A few days after the company added the new topic, Amazon CloudWatch raised an alarm on the RootDiskUsed metric for the MSK cluster.
How should the company address the CloudWatch alarm?
Choices
- A: Expand the storage of the MSK broker. Configure the MSK cluster storage to expand automatically.
- B: Expand the storage of the Apache ZooKeeper nodes.
- C: Update the MSK broker instance to a larger instance type. Restart the MSK cluster.
- D: Specify the Target Volume-in-GiB parameter for the existing topic.
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1341184 by MerryLew
- Upvotes: 1
Selected Answer: A “RootDiskUsed” is the percentage of the percentage of root disk used by the broker. Expanding storage and enabling automatic scaling seems like the best bet.
Comment 1317307 by emupsx1
- Upvotes: 2
Selected Answer: A https://docs.aws.amazon.com/msk/latest/developerguide/metrics-details.html
Question 8N1yJt5DkQQRKJ9K6abi
Question
A data engineer is building a data pipeline on AWS by using AWS Glue extract, transform, and load (ETL) jobs. The data engineer needs to process data from Amazon RDS and MongoDB, perform transformations, and load the transformed data into Amazon Redshift for analytics. The data updates must occur every hour. Which combination of tasks will meet these requirements with the LEAST operational overhead? (Choose two.)
Choices
- A: Configure AWS Glue triggers to run the ETL jobs every hour.
- B: Use AWS Glue DataBrew to clean and prepare the data for analytics.
- C: Use AWS Lambda functions to schedule and run the ETL jobs every hour.
- D: Use AWS Glue connections to establish connectivity between the data sources and Amazon Redshift.
- E: Use the Redshift Data API to load transformed data into Amazon Redshift.
answer?
Answer: AD Answer_ET: AD Community answer AD (89%) 7% Discussion
Comment 1128365 by rralucard_
- Upvotes: 7
Selected Answer: AD AWS Glue triggers provide a simple and integrated way to schedule ETL jobs. By configuring these triggers to run hourly, the data engineer can ensure that the data processing and updates occur as required without the need for external scheduling tools or custom scripts. This approach is directly integrated with AWS Glue, reducing the complexity and operational overhead. AWS Glue supports connections to various data sources, including Amazon RDS and MongoDB. By using AWS Glue connections, the data engineer can easily configure and manage the connectivity between these data sources and Amazon Redshift. This method leverages AWS Glue’s built-in capabilities for data source integration, thus minimizing operational complexity and ensuring a seamless data flow from the sources to the destination (Amazon Redshift).
Comment 1226989 by pypelyncar
- Upvotes: 6
Selected Answer: AD A. Configure AWS Glue triggers to run the ETL jobs every hour. Reduced Code Complexity: Glue triggers eliminate the need to write custom code for scheduling ETL jobs. This simplifies the pipeline and reduces maintenance overhead. Scalability and Integration: Glue triggers work seamlessly with Glue ETL jobs, ensuring efficient scheduling and execution within the Glue ecosystem. D. Use AWS Glue connections to establish connectivity between the data sources and Amazon Redshift. Pre-Built Connectors: Glue connections offer pre-built connectors for various data sources like RDS and Redshift. This eliminates the need for manual configuration and simplifies data source access within the ETL jobs. Centralized Management: Glue connections are centrally managed within the Glue service, streamlining connection management and reducing operational overhead.
Comment 1356172 by saransh_001
- Upvotes: 2
Selected Answer: AD A. AWS Glue provides a built-in mechanism to trigger ETL jobs at scheduled intervals, such as every hour. Using Glue triggers minimizes the need for additional custom code or services, reducing operational overhead. D. AWS Glue connections simplify the process of establishing secure and reliable connections to various data sources (Amazon RDS, MongoDB) and the destination (Amazon Redshift). This approach reduces the need for manually configuring connection settings and makes the ETL pipeline easier to maintain.
Comment 1270594 by San_Juan
- Upvotes: 1
Selected Answer: AC A. because the question is saying that the jobs are build in Glue, and must run every hour. C. because you can run the jobs as Lambda functions every hour.
B. discarted, because the question is saying that “DE” is using Glue, DataBrew is for cleaning data without code, but it seems that the “DE” is writing code for transforming the data. D. Discarted, because the connections are not directly related to the question, that it is saying that you should run every hour Glue jobs, and the connections doesn’t seem relevant. E. Discarted, because is saying that the data source is RDS and MongoDB, not Redshift, so you cannot use the Redshift Data API for getting the data and transform it.
Comment 1264231 by sachin
- Upvotes: 1
AE D is not valid. as it shoyld be Use AWS Glue connections to establish connectivity between the data sources (including Amazon Redshift) and Glue Job
Comment 1206008 by DevoteamAnalytix
- Upvotes: 3
Selected Answer: AD I was not sure about A - But in AWS console ⇒ Glue ⇒ Triggers ⇒ Add Trigger I have found the Trigger type: “Schedule - Fire the trigger on a timer.”
Comment 1186975 by lucas_rfsb
- Upvotes: 1
Selected Answer: CD I found this question actually confusing. In which step the transformation would be implemented itself? I can be wrong, but with Glue triggers we would only run the job, but not the transformation logic itself. In this way, I would go in C and D
Comment 1176597 by milofficial
- Upvotes: 3
Selected Answer: AD Not a clear question - B would kinda make sense - but AD seems to be more correct
Comment 1167963 by GiorgioGss
- Upvotes: 4
Selected Answer: AD A - this is obvious and D -https://docs.aws.amazon.com/glue/latest/dg/console-connections.html
Comment 1137937 by TonyStark0122
- Upvotes: 3
A. Configure AWS Glue triggers to run the ETL jobs every hour. D. Use AWS Glue connections to establish connectivity between the data sources and Amazon Redshift.
Explanation:
Option A: Configuring AWS Glue triggers allows the ETL jobs to be scheduled and run automatically every hour without the need for manual intervention. This reduces operational overhead by automating the data processing pipeline.
Option D: Using AWS Glue connections simplifies connectivity between the data sources (Amazon RDS and MongoDB) and Amazon Redshift. Glue connections abstract away the details of connection configuration, making it easier to manage and maintain the data pipeline.
Comment 1127218 by milofficial
- Upvotes: 2
Selected Answer: AB Lambda triggers for Glue jobs make me dizzy
Question U7V6JGtJgE5QlSKO9IfW
Question
A data engineer needs to build an enterprise data catalog based on the company’s Amazon S3 buckets and Amazon RDS databases. The data catalog must include storage format metadata for the data in the catalog.
Which solution will meet these requirements with the LEAST effort?
Choices
- A: Use an AWS Glue crawler to scan the S3 buckets and RDS databases and build a data catalog. Use data stewards to inspect the data and update the data catalog with the data format.
- B: Use an AWS Glue crawler to build a data catalog. Use AWS Glue crawler classifiers to recognize the format of data and store the format in the catalog.
- C: Use Amazon Macie to build a data catalog and to identify sensitive data elements. Collect the data format information from Macie.
- D: Use scripts to scan data elements and to assign data classifications based on the format of the data.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1331258 by imymoco
- Upvotes: 1
https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html
Comment 1317309 by emupsx1
- Upvotes: 1
Selected Answer: B https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html