Questions and Answers
Question pZCjbhNLVZwieOp04fol
Question
A company is migrating its database servers from Amazon EC2 instances that run Microsoft SQL Server to Amazon RDS for Microsoft SQL Server DB instances. The company’s analytics team must export large data elements every day until the migration is complete. The data elements are the result of SQL joins across multiple tables. The data must be in Apache Parquet format. The analytics team must store the data in Amazon S3. Which solution will meet these requirements in the MOST operationally efficient way?
Choices
- A: Create a view in the EC2 instance-based SQL Server databases that contains the required data elements. Create an AWS Glue job that selects the data directly from the view and transfers the data in Parquet format to an S3 bucket. Schedule the AWS Glue job to run every day.
- B: Schedule SQL Server Agent to run a daily SQL query that selects the desired data elements from the EC2 instance-based SQL Server databases. Configure the query to direct the output .csv objects to an S3 bucket. Create an S3 event that invokes an AWS Lambda function to transform the output format from .csv to Parquet.
- C: Use a SQL query to create a view in the EC2 instance-based SQL Server databases that contains the required data elements. Create and run an AWS Glue crawler to read the view. Create an AWS Glue job that retrieves the data and transfers the data in Parquet format to an S3 bucket. Schedule the AWS Glue job to run every day.
- D: Create an AWS Lambda function that queries the EC2 instance-based databases by using Java Database Connectivity (JDBC). Configure the Lambda function to retrieve the required data, transform the data into Parquet format, and transfer the data into an S3 bucket. Use Amazon EventBridge to schedule the Lambda function to run every day.
answer?
Answer: A Answer_ET: A Community answer A (53%) C (47%) Discussion
Comment 1176923 by taka5094
- Upvotes: 7
Selected Answer: C Choice A) is almost the same approach, but it doesn’t use the AWS Glue crawler, so have to manage the view’s metadata manually.
Comment 1194630 by Christina666
- Upvotes: 7
Selected Answer: C Leveraging SQL Views: Creating a view on the source database simplifies the data extraction process and keeps your SQL logic centralized. Glue Crawler Efficiency: Using a Glue crawler to automatically discover and catalog the view’s metadata reduces manual setup. Glue Job for ETL: A dedicated Glue job is well-suited for the data transformation (to Parquet) and loading into S3. Glue jobs offer built-in scheduling capabilities. Operational Efficiency: This approach minimizes custom code and leverages native AWS services for data movement and cataloging.
Comment 1409578 by Eltanany
- Upvotes: 1
Selected Answer: A I’ll go with A
Comment 1356063 by Certified101
- Upvotes: 1
Selected Answer: A A is correct - no need for crawler
Comment 1339534 by plutonash
- Upvotes: 3
Selected Answer: A the scrawler is not necessary, use GLUE job to read data from sql server and transfert to S3 with Apache Parquet format is enough.
Comment 1321942 by mtrianac
- Upvotes: 3
Selected Answer: A No, in this case, using an AWS Glue Crawler is not necessary. The schema is already defined in the SQL Server database, as the created view contains the required structure (columns and data types). AWS Glue can directly connect to the database via JDBC, extract the data, transform it into Parquet format, and store it in S3 without additional steps.
A crawler is useful if you’re working with data that doesn’t have a predefined schema (e.g., files in S3) or if you need the data to be cataloged for services like Amazon Athena. However, for this ETL flow, using just a Glue Job simplifies the process and reduces operational complexity.
Comment 1307313 by michele_scar
- Upvotes: 1
Selected Answer: A Glue crawler is useless because the schema is already in place with a SQL database
Comment 1306640 by leonardoFelipe
- Upvotes: 3
Selected Answer: A Usually, views aren’t true objects in a SGBD, they’re just a “nickname” for a specific query string, different of Materialized Views. So, my questions is: can glue crawler understand their metadata? I’d go with A
Comment 1240163 by bakarys
- Upvotes: 3
Selected Answer: A Option A involves creating a view in the EC2 instance-based SQL Server databases that contains the required data elements. An AWS Glue job is then created to select the data directly from the view and transfer the data in Parquet format to an S3 bucket. This job is scheduled to run every day. This approach is operationally efficient as it leverages managed services (AWS Glue) and does not require additional transformation steps.
Option D involves creating an AWS Lambda function that queries the EC2 instance-based databases using JDBC. The Lambda function is configured to retrieve the required data, transform the data into Parquet format, and transfer the data into an S3 bucket. This approach could work, but managing and scheduling Lambda functions could add operational overhead compared to using managed services like AWS Glue.
Comment 1177068 by GiorgioGss
- Upvotes: 2
Selected Answer: C Just beacuse it decouples the whole architecture I will go with C
Comment 1163915 by Felix_G
- Upvotes: 1
Option C seems to be the most operationally efficient: It leverages Glue for both schema discovery (via the crawler) and data transfer (via the Glue job). The Glue job can directly handle the Parquet format conversion. Scheduling the Glue job ensures regular data export without manual intervention.
Comment 1140135 by rralucard_
- Upvotes: 3
Selected Answer: A Option A (Creating a view in the EC2 instance-based SQL Server databases and creating an AWS Glue job that selects data from the view, transfers it in Parquet format to S3, and schedules the job to run every day) seems to be the most operationally efficient solution. It leverages AWS Glue’s ETL capabilities for direct data extraction and transformation, minimizes manual steps, and effectively automates the process.
Comment 1139395 by evntdrvn76
- Upvotes: 2
A. Create a view in the EC2 instance-based SQL Server databases that contains the required data elements. Create an AWS Glue job that selects the data directly from the view and transfers the data in Parquet format to an S3 bucket. Schedule the AWS Glue job to run every day. This solution is operationally efficient for exporting data in the required format.
Question 2URAied5iX7GSbkTIUW8
Question
A financial company wants to implement a data mesh. The data mesh must support centralized data governance, data analysis, and data access control. The company has decided to use AWS Glue for data catalogs and extract, transform, and load (ETL) operations. Which combination of AWS services will implement a data mesh? (Choose two.)
Choices
- A: Use Amazon Aurora for data storage. Use an Amazon Redshift provisioned cluster for data analysis.
- B: Use Amazon S3 for data storage. Use Amazon Athena for data analysis.
- C: Use AWS Glue DataBrew for centralized data governance and access control.
- D: Use Amazon RDS for data storage. Use Amazon EMR for data analysis.
- E: Use AWS Lake Formation for centralized data governance and access control.
answer?
Answer: BE Answer_ET: BE Community answer BE (100%) Discussion
Comment 1176328 by hsnin
- Upvotes: 7
Selected Answer: BE The answer is B and E. The data mesh implementation uses Amazon S3 and Athena for data storage and analysis, and AWS Lake Formation for centralized data governance and access control. When combined with AWS Glue, you can efficiently manage your data.
Comment 1558779 by ninomfr64
- Upvotes: 1
Selected Answer: BE S3 (storage) LakeFormation (governance) Athena (analytics) can be used to implement data mesh. In real life you would use DataZone or nowadays SageMaker Unified Studio
Comment 1288854 by Shubham1989
- Upvotes: 1
Selected Answer: BE S3 is best storage for data lake, and AWS lake formation is best for management.
Comment 1223304 by nanaw770
- Upvotes: 2
Selected Answer: BE BE are correct answer.
Comment 1170198 by Josa2
- Upvotes: 3
Selected Answer: BE Sometimes I think examtopics uses us to calibrate the right answers hehehe, by the goal statement and the services outlines and objectives there are no way the answer be different then B,E
Comment 1226760 by pypelyncar
- Upvotes: 1
Selected Answer: BE A: Cost-effective storage: Amazon S3 is a highly scalable and cost-effective object storage service perfect for storing large datasets commonly found in financial institutions. Centralized data lake: S3 acts as the central data lake where all data from different domains can reside in its raw or processed form. Easy data access: Athena provides a serverless interactive query service that allows data analysts to directly query data stored in S3 using standard SQL. This simplifies data exploration and analysis without managing servers. B: Data governance: Lake Formation helps establish data ownership, access control, and lineage for data products within the data mesh. It ensures data quality, security, and compliance with regulations. Fine-grained access control: Lake Formation allows you to define granular access policies for each data domain, ensuring only authorized users can access specific data sets. This aligns with the need for centralized control in a data mesh.
Comment 1254180 by Fredrik1
- Upvotes: 1
Must be B and E
Comment 1182092 by minhtien1707
- Upvotes: 1
Selected Answer: BE i thing so
Comment 1163084 by alexua
- Upvotes: 2
B and E . C - is not correct “AWS Glue DataBrew is a visual data preparation tool that makes it easier for data analysts and data scientists to clean and normalize data to prepare it for analytics and machine learning (ML)“
Comment 1158033 by Alcee
- Upvotes: 1
B and E
Comment 1137870 by TonyStark0122
- Upvotes: 2
BE Given the requirements for implementing a data mesh architecture with centralized data governance, data analysis, and data access control, the two better choices from the options provided would be:
B. Use Amazon S3 for data storage. Use Amazon Athena for data analysis.
E. Use AWS Lake Formation for centralized data governance and access control.
Comment 1125616 by milofficial
- Upvotes: 4
Selected Answer: BE Textbook question, the keyword data mesh means S3, the keyword data governance means LakeFormation
Question PO08YlrEEUelQ5bVBUlg
Question
A data engineering team is using an Amazon Redshift data warehouse for operational reporting. The team wants to prevent performance issues that might result from long- running queries. A data engineer must choose a system table in Amazon Redshift to record anomalies when a query optimizer identifies conditions that might indicate performance issues. Which table views should the data engineer use to meet this requirement?
Choices
- A: STL_USAGE_CONTROL
- B: STL_ALERT_EVENT_LOG
- C: STL_QUERY_METRICS
- D: STL_PLAN_INFO
answer?
Answer: B Answer_ET: B Community answer B (100%) Discussion
Comment 1163916 by Felix_G
- Upvotes: 5
B STL_ALERT_EVENT_LOG records any alerts/notifications related to queries or user-defined performance thresholds. This would capture optimizer alerts about potential performance issues.
STL_PLAN_INFO provides detailed info on execution plans. The optimizer statistics and warnings provide insight into problematic query plans.
STL_USAGE_CONTROL limits user activity but does not log anomalies.
STL_QUERY_METRICS has execution stats but no plan diagnostics.
By enabling alerts and checking STL_ALERT_EVENT_LOG and STL_PLAN_INFO, the data engineer can best detect and troubleshoot queries flagged by the optimizer as problematic before they impair performance. This meets the requirement to catch potential long running queries.
Comment 1177070 by GiorgioGss
- Upvotes: 5
Selected Answer: B https://docs.aws.amazon.com/redshift/latest/dg/r_STL_ALERT_EVENT_LOG.html
Comment 1329054 by HagarTheHorrible
- Upvotes: 1
Selected Answer: B Control table is related to usage control metrics and doesn’t focus on performance issues or anomalies related to query optimization. It’s more about usage limits and controls.
Comment 1227567 by pypelyncar
- Upvotes: 1
Selected Answer: B this table records alerts that are generated by the Amazon Redshift system when it detects certain conditions that might indicate performance issues. These alerts are triggered by the query optimizer when it detects suboptimal query plans or other issues that could affect performance.
Comment 1138084 by rralucard_
- Upvotes: 1
Selected Answer: B https://docs.aws.amazon.com/redshift/latest/dg/cm_chap_system-tables.html STL_ALERT_EVENT_LOG table view to meet this requirement. This system table in Amazon Redshift is designed to record anomalies when a query optimizer identifies conditions that might indicate performance issues
Question paRzd9EgGbiHsKTNAVkC
Question
A data engineer must ingest a source of structured data that is in .csv format into an Amazon S3 data lake. The .csv files contain 15 columns. Data analysts need to run Amazon Athena queries on one or two columns of the dataset. The data analysts rarely query the entire file. Which solution will meet these requirements MOST cost-effectively?
Choices
- A: Use an AWS Glue PySpark job to ingest the source data into the data lake in .csv format.
- B: Create an AWS Glue extract, transform, and load (ETL) job to read from the .csv structured data source. Configure the job to ingest the data into the data lake in JSON format.
- C: Use an AWS Glue PySpark job to ingest the source data into the data lake in Apache Avro format.
- D: Create an AWS Glue extract, transform, and load (ETL) job to read from the .csv structured data source. Configure the job to write the data into the data lake in Apache Parquet format.
answer?
Answer: D Answer_ET: D Community answer D (100%) Discussion
Comment 1311027 by imymoco
- Upvotes: 1
Why not B? I think Athena also be able to handle json.
Comment 1227568 by pypelyncar
- Upvotes: 2
Selected Answer: D Athena is optimized for querying data stored in Parquet format. It can efficiently scan only the necessary columns for a specific query, reducing the amount of data processed. This translates to faster query execution times and lower query costs for data analysts who primarily focus on one or two columns
Comment 1218989 by FunkyFresco
- Upvotes: 2
Selected Answer: D Cost effectively, and they are going to use only one or two columns, columnar.
Comment 1177071 by GiorgioGss
- Upvotes: 3
Selected Answer: D MOST cost-effectively = parquet
Comment 1134424 by atu1789
- Upvotes: 2
Selected Answer: D Glue + Parquet for cost efectiveness
Question GqVcYZ9SIxMhYSMcyP1v
Question
A company has five offices in different AWS Regions. Each office has its own human resources (HR) department that uses a unique IAM role. The company stores employee records in a data lake that is based on Amazon S3 storage. A data engineering team needs to limit access to the records. Each HR department should be able to access records for only employees who are within the HR department’s Region. Which combination of steps should the data engineering team take to meet this requirement with the LEAST operational overhead? (Choose two.)
Choices
- A: Use data filters for each Region to register the S3 paths as data locations.
- B: Register the S3 path as an AWS Lake Formation location.
- C: Modify the IAM roles of the HR departments to add a data filter for each department’s Region.
- D: Enable fine-grained access control in AWS Lake Formation. Add a data filter for each Region.
- E: Create a separate S3 bucket for each Region. Configure an IAM policy to allow S3 access. Restrict access based on Region.
answer?
Answer: BD Answer_ET: BD Community answer BD (100%) Discussion
Comment 1138090 by rralucard_
- Upvotes: 5
Selected Answer: BD https://docs.aws.amazon.com/lake-formation/latest/dg/data-filters-about.html https://docs.aws.amazon.com/lake-formation/latest/dg/access-control-fine-grained.html
Comment 1311017 by ctndba
- Upvotes: 1
B: Since its a initial step for leverage fine grain control of Lakeformation. D: Give granular level control and meets the requirement
Comment 1227572 by pypelyncar
- Upvotes: 3
Selected Answer: BD Registering the S3 path as an AWS Lake Formation location is the first step in leveraging Lake Formation’s data governance and access control capabilities. This allows the data engineering team to centrally manage and govern the data stored in the S3 data lake. Enabling fine-grained access control in AWS Lake Formation and adding a data filter for each Region is the key step to achieve the desired access control. Data filters in Lake Formation allow you to define row-level and column-level access policies based on specific conditions or attributes, such as the Region in this case
Comment 1137344 by rralucard_
- Upvotes: 1
If your table stored in an AWS Glue Data Catalog has tens and hundreds of thousands and millions of partitions, you can enable partition indexes on the table. With partition indexes, only the metadata for the partition value in the query’s filter is retrieved from the catalog instead of retrieving all the partitions’ metadata. The result is faster queries for such highly partitioned tables. The following table compares query runtimes between a partitioned table with no partition indexing and with partition indexing. The table contains approximately 100,000 partitions and uncompressed text data. The orders table is partitioned by the o_custkey column.
Comment 1134423 by atu1789
- Upvotes: 1
Selected Answer: BD BD makes sense