Questions and Answers
Question GkmZ6p9vxzHjbFdz4D2Z
Question
A company uses Amazon Athena for one-time queries against data that is in Amazon S3. The company has several use cases. The company must implement permission controls to separate query processes and access to query history among users, teams, and applications that are in the same AWS account. Which solution will meet these requirements?
Choices
- A: Create an S3 bucket for each use case. Create an S3 bucket policy that grants permissions to appropriate individual IAM users. Apply the S3 bucket policy to the S3 bucket.
- B: Create an Athena workgroup for each use case. Apply tags to the workgroup. Create an IAM policy that uses the tags to apply appropriate permissions to the workgroup.
- C: Create an IAM role for each use case. Assign appropriate permissions to the role for each use case. Associate the role with Athena.
- D: Create an AWS Glue Data Catalog resource policy that grants permissions to appropriate individual IAM users for each use case. Apply the resource policy to the specific tables that Athena uses.
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1125636 by milofficial
- Upvotes: 17
Selected Answer: B Haha they copied this from the old DA Specialty. It’s B
https://docs.aws.amazon.com/athena/latest/ug/user-created-workgroups.html
Comment 1137885 by TonyStark0122
- Upvotes: 14
B. Create an Athena workgroup for each use case. Apply tags to the workgroup. Create an IAM policy that uses the tags to apply appropriate permissions to the workgroup.
Explanation: Athena workgroups allow you to isolate and manage different workloads, users, and permissions. By creating a separate workgroup for each use case, you can control access to query history, manage permissions, and enforce resource usage limits independently for each workload. Applying tags to workgroups allows you to categorize and organize them based on the use case, which simplifies policy management.
Comment 1409673 by Scotty_Nguyen
- Upvotes: 1
Selected Answer: B B is correct
Comment 1245917 by Manohar24
- Upvotes: 2
Selected Answer: B B is correct.
Comment 1209086 by k350Secops
- Upvotes: 2
Selected Answer: B The only other answer that’s confusing is C But its not the one. Creating separate IAM roles for each use case and associating them with Athena would not provide the necessary isolation and access control for query processes and query history.
Comment 1182440 by dev_vicente
- Upvotes: 1
Selected Answer: B B is more granular
Question 9RGcb52pOGtimcTYEV9O
Question
A data engineer needs to build an extract, transform, and load (ETL) job. The ETL job will process daily incoming .csv files that users upload to an Amazon S3 bucket. The size of each S3 object is less than 100 MB. Which solution will meet these requirements MOST cost-effectively?
Choices
- A: Write a custom Python application. Host the application on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster.
- B: Write a PySpark ETL script. Host the script on an Amazon EMR cluster.
- C: Write an AWS Glue PySpark job. Use Apache Spark to transform the data.
- D: Write an AWS Glue Python shell job. Use pandas to transform the data.
answer?
Answer: D Answer_ET: D Community answer D (65%) C (35%) Discussion
Comment 1184529 by halogi
- Upvotes: 10
Selected Answer: C AWS Glue Python Shell Job is billed 0.29 per DPU-Hour for each job with flexible execution and $0.44 per DPU-Hour for each job with standard execution Source: https://aws.amazon.com/glue/pricing/
Comment 1152392 by atu1789
- Upvotes: 9
Selected Answer: D Option D: Write an AWS Glue Python shell job and use pandas to transform the data, is the most cost-effective solution for the described scenario.
AWS Glue’s Python shell jobs are a good fit for smaller-scale ETL tasks, especially when dealing with .csv files that are less than 100 MB each. The use of pandas, a powerful and efficient data manipulation library in Python, makes it an ideal tool for processing and transforming these types of files. This approach avoids the overhead and additional costs associated with more complex solutions like Amazon EKS or EMR, which are generally more suited for larger-scale, more complex data processing tasks.
Given the requirements – processing daily incoming small-sized .csv files – this solution provides the necessary functionality with minimal resources, aligning well with the goal of cost-effectiveness.
Comment 1347367 by YUICH
- Upvotes: 2
Selected Answer: D It is important not to compare just the “price per DPU hour,” but to consider the total cost by factoring in overhead for job startup, minimum DPU count, execution time, and data volume. For a relatively lightweight workload—such as processing approximately 100 MB of CSV files on a daily basis—option (D), using an AWS Glue Python shell job, is the most cost-effective choice.
Comment 1248620 by LR2023
- Upvotes: 3
Selected Answer: D going with D https://docs.aws.amazon.com/whitepapers/latest/aws-glue-best-practices-build-performant-data-pipeline/additional-considerations.html
Comment 1228736 by pypelyncar
- Upvotes: 7
Selected Answer: D good candidate to be (2 options) for real, either spark and py have similar approaches. I would go with Pandas, although… 50/50.. it could be Spark. I hope not to find this question in the exam
Comment 1214850 by VerRi
- Upvotes: 3
Selected Answer: C PySpark with Spark(Flexible Execution): 0.44/hr for 1 DPU Python Shell with Pandas: $0.44/hr for 1 DPU
Comment 1206732 by cloudata
- Upvotes: 6
Selected Answer: D Python Shell is cheaper and can handle small to medium tasks. https://docs.aws.amazon.com/whitepapers/latest/aws-glue-best-practices-build-performant-data-pipeline/additional-considerations.html
Comment 1204707 by chakka90
- Upvotes: 3
D. Because the pyspark is still being the cheap you have to use minimum of 2 DPU. Which would increase the cost anyway so, i feel that d should be correct
Comment 1203460 by khchan123
- Upvotes: 4
Selected Answer: D D.
While AWS Glue PySpark jobs are scalable and suitable for large workloads, C may be overkill for processing small .csv files (less than 100 MB each). The overhead of using Apache Spark may not be cost-effective for this specific use case.
Comment 1202991 by Leo87656789
- Upvotes: 4
Selected Answer: D Option D:
Even though the Python Shell Job is more expensive on a DPU-Hour basis, you can select the option “1/16 DPU” in the Job details for a Python Shell Job, which is definetly cheaper than a Pyspark job.
Comment 1188279 by lucas_rfsb
- Upvotes: 6
Selected Answer: C AWS Glue Python Shell Job is billed 0.29 per DPU-Hour for each job with flexible execution and $0.44 per DPU-Hour for each job with standard execution Source: https://aws.amazon.com/glue/pricing/
Comment 1187209 by [Removed]
- Upvotes: 5
Comment 1177379 by GiorgioGss
- Upvotes: 4
Selected Answer: D D is more cheaper than C. Not so scalable but is cheaper…
Comment 1138573 by rralucard_
- Upvotes: 5
Selected Answer: C AWS Glue is a fully managed ETL service, which means you don’t need to manage infrastructure, and it automatically scales to handle your data processing needs. This reduces operational overhead and cost.
PySpark, as a part of AWS Glue, is a powerful and widely-used framework for distributed data processing, and it’s well-suited for handling data transformations on a large scale.
Question u5mWVT3watK8Ki7sFWun
Question
A data engineer creates an AWS Glue Data Catalog table by using an AWS Glue crawler that is named Orders. The data engineer wants to add the following new partitions:
s3://transactions/orders/order_date=2023-01-01 s3://transactions/orders/order_date=2023-01-02
The data engineer must edit the metadata to include the new partitions in the table without scanning all the folders and files in the location of the table.
Which data definition language (DDL) statement should the data engineer use in Amazon Athena?
Choices
- A: ALTER TABLE Orders ADD PARTITION(order_date=’2023-01-01’) LOCATION ‘s3://transactions/orders/order_date=2023-01-01’; ALTER TABLE Orders ADD PARTITION(order_date=’2023-01-02’) LOCATION ‘s3://transactions/orders/order_date=2023-01-02’;
- B: MSCK REPAIR TABLE Orders;
- C: REPAIR TABLE Orders;
- D: ALTER TABLE Orders MODIFY PARTITION(order_date=’2023-01-01’) LOCATION ‘s3://transactions/orders/2023-01-01’; ALTER TABLE Orders MODIFY PARTITION(order_date=’2023-01-02’) LOCATION ‘s3://transactions/orders/2023-01-02’;
answer?
Answer: A Answer_ET: A Community answer A (100%) Discussion
Comment 1243973 by Ja13
- Upvotes: 5
Selected Answer: A Why the Other Options Are Incorrect: Option B: MSCK REPAIR TABLE Orders: This command is used to repair the partitions of a table by scanning all the files in the specified location. This is not efficient if you know the specific partitions you want to add, as it will scan the entire table location. Option C: REPAIR TABLE Orders: This is not a valid Athena DDL command. Option D: ALTER TABLE Orders MODIFY PARTITION: This command is used to modify the location of existing partitions, not to add new partitions. It would not work for adding new partitions.
Comment 1230570 by artworkad
- Upvotes: 4
Selected Answer: A A is correct as per https://docs.aws.amazon.com/athena/latest/ug/alter-table-add-partition.html
Comment 1230567 by artworkad
- Upvotes: 1
A is correct as per https://docs.aws.amazon.com/athena/latest/ug/alter-table-add-partition.html
Comment 1230471 by tgv
- Upvotes: 4
Selected Answer: A A is correct because it uses the appropriate DDL statements to add the new partitions directly without scanning all folders and files, meeting the requirements stated in the question. B is incorrect because while it would update the partitions, it would involve scanning all files and folders. C is incorrect because REPAIR TABLE is not a valid command. D is incorrect because it modifies partitions instead of adding new ones.
Question JN0MELepQLjJs6tvi69T
Question
A company stores 10 to 15 TB of uncompressed .csv files in Amazon S3. The company is evaluating Amazon Athena as a one-time query engine.
The company wants to transform the data to optimize query runtime and storage costs.
Which file format and compression solution will meet these requirements for Athena queries?
Choices
- A: .csv format compressed with zip
- B: JSON format compressed with bzip2
- C: Apache Parquet format compressed with Snappy
- D: Apache Avro format compressed with LZO
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1230532 by tgv
- Upvotes: 6
Selected Answer: C Parquet provides efficient columnar storage, enabling Athena to read only the necessary data for queries, which reduces scan times and speeds up query performance. Snappy compression offers a good balance between compression speed and efficiency, reducing storage costs without significantly impacting query times.
Comment 1230569 by artworkad
- Upvotes: 3
Selected Answer: C Parquet + Snappy
Question gDxEtvdwCnNwQ96GwkvW
Question
A company uses Apache Airflow to orchestrate the company’s current on-premises data pipelines. The company runs SQL data quality check tasks as part of the pipelines. The company wants to migrate the pipelines to AWS and to use AWS managed services.
Which solution will meet these requirements with the LEAST amount of refactoring?
Choices
- A: Setup AWS Outposts in the AWS Region that is nearest to the location where the company uses Airflow. Migrate the servers into Outposts hosted Amazon EC2 instances. Update the pipelines to interact with the Outposts hosted EC2 instances instead of the on-premises pipelines.
- B: Create a custom Amazon Machine Image (AMI) that contains the Airflow application and the code that the company needs to migrate. Use the custom AMI to deploy Amazon EC2 instances. Update the network connections to interact with the newly deployed EC2 instances.
- C: Migrate the existing Airflow orchestration configuration into Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Create the data quality checks during the ingestion to validate the data quality by using SQL tasks in Airflow.
- D: Convert the pipelines to AWS Step Functions workflows. Recreate the data quality checks in SQL as Python based AWS Lambda functions.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1240682 by bakarys
- Upvotes: 3
Selected Answer: C The solution that will meet these requirements with the least amount of refactoring is Option C: Migrate the existing Airflow orchestration configuration into Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Create the data quality checks during the ingestion to validate the data quality by using SQL tasks in Airflow.
Amazon Managed Workflows for Apache Airflow (MWAA) is a fully managed service that makes it easy to run open-source versions of Apache Airflow on AWS. It allows you to build workflows to design and visualize pipelines, automate complex tasks, and monitor executions. Since the company is already using Apache Airflow for orchestration, migrating to Amazon MWAA would require minimal refactoring.
Comment 1233780 by HunkyBunky
- Upvotes: 3
Selected Answer: C Amazon MWAA - becuase we already uses Apache Airflow
Comment 1230796 by tgv
- Upvotes: 3
Selected Answer: C Amazon MWAA is a managed service for running Apache Airflow. It allows migrating existing Airflow configurations with minimal changes. Data quality checks can continue to be implemented as SQL tasks in Airflow, similar to the current setup.