Questions and Answers
Question jw5igyacjyPIFGDEjzXe
Question
A company wants to use machine learning (ML) to perform analytics on data that is in an Amazon S3 data lake. The company has two data transformation requirements that will give consumers within the company the ability to create reports.
The company must perform daily transformations on 300 GB of data that is in a variety format that must arrive in Amazon S3 at a scheduled time. The company must perform one-time transformations of terabytes of archived data that is in the S3 data lake. The company uses Amazon Managed Workflows for Apache Airflow (Amazon MWAA) Directed Acyclic Graphs (DAGs) to orchestrate processing.
Which combination of tasks should the company schedule in the Amazon MWAA DAGs to meet these requirements MOST cost-effectively? (Choose two.)
Choices
- A: For daily incoming data, use AWS Glue crawlers to scan and identify the schema.
- B: For daily incoming data, use Amazon Athena to scan and identify the schema.
- C: For daily incoming data, use Amazon Redshift to perform transformations.
- D: For daily and archived data, use Amazon EMR to perform data transformations.
- E: For archived data, use Amazon SageMaker to perform data transformations.
answer?
Answer: AD Answer_ET: AD Community answer AD (100%) Discussion
Comment 1241451 by Ja13
- Upvotes: 5
A. For daily incoming data, use AWS Glue crawlers to scan and identify the schema. D. For daily and archived data, use Amazon EMR to perform data transformations.
Here’s why:
A. AWS Glue crawlers are well-suited for scanning and identifying the schema of data in S3. They are cost-effective and efficient for daily incoming data. D. Amazon EMR is a cost-effective solution for performing large-scale data transformations. It can handle both the daily transformations of 300 GB of data and the one-time transformations of terabytes of archived data efficiently.
Comment 1254086 by andrologin
- Upvotes: 2
Selected Answer: AD Glue crawlers for identifying the schema, EMR to run batch processing on the data
Comment 1241804 by HunkyBunky
- Upvotes: 1
A / D - Looks good for me
Comment 1241454 by Ja13
- Upvotes: 2
Selected Answer: AD According to ChatGPT
Comment 1230846 by tgv
- Upvotes: 3
Selected Answer: AD A. For daily incoming data, use AWS Glue crawlers to scan and identify the schema. This is cost-effective and simplifies the process of managing metadata.
D. For daily and archived data, use Amazon EMR to perform data transformations. EMR is suitable for both large-scale and regular transformations, offering flexibility and cost efficiency.
Question POeLfaoMm1IkmxZNH7ll
Question
A retail company uses AWS Glue for extract, transform, and load (ETL) operations on a dataset that contains information about customer orders. The company wants to implement specific validation rules to ensure data accuracy and consistency.
Which solution will meet these requirements?
Choices
- A: Use AWS Glue job bookmarks to track the data for accuracy and consistency.
- B: Create custom AWS Glue Data Quality rulesets to define specific data quality checks.
- C: Use the built-in AWS Glue Data Quality transforms for standard data quality validations.
- D: Use AWS Glue Data Catalog to maintain a centralized data schema and metadata repository.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1241805 by HunkyBunky
- Upvotes: 2
Selected Answer: B Only B - makes sense
Comment 1241457 by Ja13
- Upvotes: 4
Selected Answer: B B. Create custom AWS Glue Data Quality rulesets to define specific data quality checks.
Custom AWS Glue Data Quality rulesets allow you to define precise data quality checks tailored to your specific needs, ensuring that the data meets the required standards of accuracy and consistency. This approach provides flexibility to implement a wide range of validation rules based on your business requirements.
Comment 1230850 by tgv
- Upvotes: 3
Selected Answer: B This option provides the necessary flexibility to define and implement custom validation rules tailored to the company’s specific requirements for data accuracy and consistency.