Questions and Answers
Question BjhseyipXH7EkMgyx54t
Question
A company needs to set up a data catalog and metadata management for data sources that run in the AWS Cloud. The company will use the data catalog to maintain the metadata of all the objects that are in a set of data stores. The data stores include structured sources such as Amazon RDS and Amazon Redshift. The data stores also include semistructured sources such as JSON files and .xml files that are stored in Amazon S3. The company needs a solution that will update the data catalog on a regular basis. The solution also must detect changes to the source metadata. Which solution will meet these requirements with the LEAST operational overhead?
Choices
- A: Use Amazon Aurora as the data catalog. Create AWS Lambda functions that will connect to the data catalog. Configure the Lambda functions to gather the metadata information from multiple sources and to update the Aurora data catalog. Schedule the Lambda functions to run periodically.
- B: Use the AWS Glue Data Catalog as the central metadata repository. Use AWS Glue crawlers to connect to multiple data stores and to update the Data Catalog with metadata changes. Schedule the crawlers to run periodically to update the metadata catalog.
- C: Use Amazon DynamoDB as the data catalog. Create AWS Lambda functions that will connect to the data catalog. Configure the Lambda functions to gather the metadata information from multiple sources and to update the DynamoDB data catalog. Schedule the Lambda functions to run periodically.
- D: Use the AWS Glue Data Catalog as the central metadata repository. Extract the schema for Amazon RDS and Amazon Redshift sources, and build the Data Catalog. Use AWS Glue crawlers for data that is in Amazon S3 to infer the schema and to automatically update the Data Catalog.
answer?
Answer: B Answer_ET: B Community answer B (94%) 6% Discussion
Comment 1227602 by pypelyncar
- Upvotes: 7
Selected Answer: B The AWS Glue Data Catalog is a purpose-built, fully managed service designed to serve as a central metadata repository for your data sources. It provides a unified view of your data across various sources, including structured databases (like Amazon RDS and Amazon Redshift) and semi-structured data formats (like JSON and XML files in Amazon S3).
Comment 1217547 by valuedate
- Upvotes: 3
Selected Answer: B glue data catalog with crawlers
Comment 1210757 by hnk
- Upvotes: 1
Selected Answer: A B is the obvious answer
Comment 1177162 by GiorgioGss
- Upvotes: 4
Selected Answer: B A,C out for obvious reason D out because it involves manual schema extract
Comment 1138398 by rralucard_
- Upvotes: 3
Selected Answer: B Option B, using the AWS Glue Data Catalog with AWS Glue Crawlers, is the best solution to meet the requirements with the least operational overhead. It provides a fully managed, integrated solution for cataloging both structured and semistructured data across various AWS data stores without the need for extensive manual configuration or custom coding.
Question zGwxDru3gJkGE71xfScM
Question
A company stores data from an application in an Amazon DynamoDB table that operates in provisioned capacity mode. The workloads of the application have predictable throughput load on a regular schedule. Every Monday, there is an immediate increase in activity early in the morning. The application has very low usage during weekends. The company must ensure that the application performs consistently during peak usage times. Which solution will meet these requirements in the MOST cost-effective way?
Choices
- A: Increase the provisioned capacity to the maximum capacity that is currently present during peak load times.
- B: Divide the table into two tables. Provision each table with half of the provisioned capacity of the original table. Spread queries evenly across both tables.
- C: Use AWS Application Auto Scaling to schedule higher provisioned capacity for peak usage times. Schedule lower capacity during off-peak times.
- D: Change the capacity mode from provisioned to on-demand. Configure the table to scale up and scale down based on the load on the table.
answer?
Answer: C Answer_ET: C Community answer C (89%) 6% Discussion
Comment 1138401 by rralucard_
- Upvotes: 5
Selected Answer: C Option C, using AWS Application Auto Scaling to schedule higher provisioned capacity for peak usage times and lower capacity during off-peak times, is the most cost-effective solution for the described scenario. It allows the company to align their DynamoDB capacity costs with actual usage patterns, scaling up only when needed and scaling down during low-usage periods.
Comment 1364827 by Rakiko
- Upvotes: 1
Selected Answer: C My guess is C as it stands for Cat
Comment 1247697 by sdas1
- Upvotes: 1
C
https://docs.aws.amazon.com/wellarchitected/latest/serverless-applications-lens/capacity.html
DynamoDB auto scaling modifies provisioned throughput settings only when the actual workload stays elevated or depressed for a sustained period of several minutes. This means that provisioned capacity is probably best for you if you have relatively predictable application traffic, run applications whose traffic is consistent, and ramps up or down gradually.
Comment 1227605 by pypelyncar
- Upvotes: 4
Selected Answer: C app autoscalling allows you to dynamically adjust provisioned capacity based on usage patterns. You only pay for the capacity you utilize, reducing costs compared to keeping a high, fixed capacity throughout the week
Comment 1194643 by Christina666
- Upvotes: 3
Selected Answer: C D. Change the capacity mode from provisioned to on-demand… On-demand mode is great for unpredictable workloads. In your case, with predictable patterns, you’d likely pay more with on-demand than with a well-managed, scheduled, provisioned mode.
Comment 1187753 by lucas_rfsb
- Upvotes: 1
Selected Answer: D As I understand, should be D
Comment 1181393 by FuriouZ
- Upvotes: 3
Selected Answer: C Obviously better than B because of peak scaling
Comment 1179931 by jpmadan
- Upvotes: 1
Selected Answer: B D Excerpts from documentation: This means that provisioned capacity is probably best for you if you have relatively predictable application traffic, run applications whose traffic is consistent, and ramps up or down gradually. Whereas on-demand capacity mode is probably best when you have new tables with unknown workloads, unpredictable application traffic and also if you only want to pay exactly for what you use. The on-demand pricing model is ideal for bursty, new, or unpredictable workloads whose traffic can spike in seconds or minutes, and when under-provisioned capacity would impact the user experience.
https://docs.aws.amazon.com/wellarchitected/latest/serverless-applications-lens/capacity.html
Comment 1167439 by damaldon
- Upvotes: 2
Question d9ebTm9Cl0ZMgEngvQ3O
Question
A company is planning to migrate on-premises Apache Hadoop clusters to Amazon EMR. The company also needs to migrate a data catalog into a persistent storage solution. The company currently stores the data catalog in an on-premises Apache Hive metastore on the Hadoop clusters. The company requires a serverless solution to migrate the data catalog. Which solution will meet these requirements MOST cost-effectively?
Choices
- A: Use AWS Database Migration Service (AWS DMS) to migrate the Hive metastore into Amazon S3. Configure AWS Glue Data Catalog to scan Amazon S3 to produce the data catalog.
- B: Configure a Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use AWS Glue Data Catalog to store the company’s data catalog as an external data catalog.
- C: Configure an external Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use Amazon Aurora MySQL to store the company’s data catalog.
- D: Configure a new Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use the new metastore as the company’s data catalog.
answer?
Answer: B Answer_ET: B Community answer B (82%) A (18%) Discussion
Comment 1309418 by Asmunk
- Upvotes: 1
Selected Answer: B A and D can be discarded because of added steps. This link provides documentation for this exact use case : https://aws.amazon.com/blogs/big-data/migrate-and-deploy-your-apache-hive-metastore-on-amazon-emr/ C is also discarded because of the serverless key word, although Aurora can be serverless it is not specified in the choice.
Comment 1194644 by Christina666
- Upvotes: 2
Selected Answer: B Serverless and Cost-Efficient: AWS Glue Data Catalog offers a serverless metadata repository, reducing operational overhead and making it cost-effective. Using it as an external data catalog means you don’t have to manage additional database infrastructure. Seamless Migration: Migrating your existing Hive metastore to Amazon EMR ensures compatibility with your current Hadoop setup. EMR is designed to run Hadoop workloads, facilitating this process. Flexibility: An external data catalog in AWS Glue offers flexibility and separation of concerns. Your metastore remains managed by EMR for your Hadoop workloads, while Glue provides a centralized catalog for broader AWS data sources.
Comment 1190356 by nyaopoko
- Upvotes: 1
B is answer! By leveraging AWS Glue Data Catalog as an external data catalog and migrating the existing Hive metastore into Amazon EMR, the company can achieve a serverless, persistent, and cost-effective solution for storing and managing their data catalog.
Comment 1190247 by arvehisa
- Upvotes: 2
Selected Answer: B B. https://aws.amazon.com/jp/blogs/big-data/migrate-and-deploy-your-apache-hive-metastore-on-amazon-emr/
Comment 1187761 by lucas_rfsb
- Upvotes: 2
Selected Answer: A I will go with A. Besides DMS is typical for migration, it’s the only choice which explicitly concerns about how the migration itself will be made. Other choices would demand a script or GLUE ETL job if you will. But this logic of migration was never put
Comment 1182594 by LeoSantos121212121212121
- Upvotes: 2
I will go with A
Comment 1179930 by jpmadan
- Upvotes: 1
Selected Answer: B serverless catalog in AWS == glue
Comment 1167445 by damaldon
- Upvotes: 1
B. Set up an AWS Glue ETL job which extracts metadata from your Hive metastore (MySQL) and loads it into your AWS Glue Data Catalog. This method requires an AWS Glue connection to the Hive metastore as a JDBC source. An ETL script is provided to extract metadata from the Hive metastore and write it to AWS Glue Data Catalog. https://github.com/aws-samples/aws-glue-samples/blob/master/utilities/Hive_metastore_migration/README.md
Comment 1138404 by rralucard_
- Upvotes: 3
Selected Answer: B https://aws.amazon.com/blogs/big-data/migrate-and-deploy-your-apache-hive-metastore-on-amazon-emr/ Option B is likely the most suitable. Migrating the Hive metastore into Amazon EMR and using AWS Glue Data Catalog as an external catalog provides a balance between leveraging the scalable and managed services of AWS (like EMR and Glue Data Catalog) and ensuring a smooth transition from the on-premises setup. This approach leverages the serverless nature of AWS Glue Data Catalog, minimizing operational overhead and potentially reducing costs compared to managing database servers.
Question v2gTHTPupqiov1tlO8FR
Question
A company uses an Amazon Redshift provisioned cluster as its database. The Redshift cluster has five reserved ra3.4xlarge nodes and uses key distribution. A data engineer notices that one of the nodes frequently has a CPU load over 90%. SQL Queries that run on the node are queued. The other four nodes usually have a CPU load under 15% during daily operations. The data engineer wants to maintain the current number of compute nodes. The data engineer also wants to balance the load more evenly across all five compute nodes. Which solution will meet these requirements?
Choices
- A: Change the sort key to be the data column that is most often used in a WHERE clause of the SQL SELECT statement.
- B: Change the distribution key to the table column that has the largest dimension.
- C: Upgrade the reserved node from ra3.4xlarge to ra3.16xlarge.
- D: Change the primary key to be the data column that is most often used in a WHERE clause of the SQL SELECT statement.
answer?
Answer: B Answer_ET: B Community answer B (92%) 8% Discussion
Comment 1138409 by rralucard_
- Upvotes: 7
Selected Answer: B https://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html Option B, changing the distribution key, is the most effective solution to balance the load more evenly across all five compute nodes. Selecting an appropriate distribution key that aligns with the query patterns and data characteristics can result in a more uniform distribution of data and workloads, thus reducing the likelihood of one node being overutilized while others are underutilized.
Comment 1228214 by pypelyncar
- Upvotes: 2
Selected Answer: B In a Redshift cluster with key distribution, data is distributed across compute nodes based on the values of the distribution key. An uneven distribution can lead to skewed workloads on specific nodes. By choosing the table column with the largest dimension (most distinct values) as the distribution key, you ensure a more even spread of data across all nodes. This balances the processing load on each node when queries access that column.
Comment 1203386 by khchan123
- Upvotes: 2
Selected Answer: B The correct solution is B. Change the distribution key to the table column that has the largest dimension. This will help to distribute the data more evenly across the nodes, reducing the load on the heavily utilized node.
Comment 1194645 by Christina666
- Upvotes: 1
Selected Answer: A Gemini result: Understanding the Problem:
The scenario describes a Redshift cluster with uneven load distribution. This indicates potential issues with either the distribution style or the sort key.
Key Distribution:
The problem states that the cluster uses key distribution, meaning a specific column is designated as the distribution key. Data rows with matching distribution key values are placed on the same node.
Sort Key:
A sort key determines the order in which data is physically stored within a table’s blocks on a node. A well-chosen sort key can significantly optimize query performance, especially when queries often filter by that column.
Comment 1167457 by damaldon
- Upvotes: 2
B. With “Key distribution”. The rows are distributed according to the values in one column. The leader node places matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns. This way, matching values from the common columns are physically stored together. https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html
Question 3XaG6pEqqVrKGQC61bLP
Question
A security company stores IoT data that is in JSON format in an Amazon S3 bucket. The data structure can change when the company upgrades the IoT devices. The company wants to create a data catalog that includes the IoT data. The company’s analytics department will use the data catalog to index the data. Which solution will meet these requirements MOST cost-effectively?
Choices
- A: Create an AWS Glue Data Catalog. Configure an AWS Glue Schema Registry. Create a new AWS Glue workload to orchestrate the ingestion of the data that the analytics department will use into Amazon Redshift Serverless.
- B: Create an Amazon Redshift provisioned cluster. Create an Amazon Redshift Spectrum database for the analytics department to explore the data that is in Amazon S3. Create Redshift stored procedures to load the data into Amazon Redshift.
- C: Create an Amazon Athena workgroup. Explore the data that is in Amazon S3 by using Apache Spark through Athena. Provide the Athena workgroup schema and tables to the analytics department.
- D: Create an AWS Glue Data Catalog. Configure an AWS Glue Schema Registry. Create AWS Lambda user defined functions (UDFs) by using the Amazon Redshift Data API. Create an AWS Step Functions job to orchestrate the ingestion of the data that the analytics department will use into Amazon Redshift Serverless.
answer?
Answer: A Answer_ET: A Community answer A (80%) C (20%) Discussion
Comment 1138433 by rralucard_
- Upvotes: 9
Selected Answer: A Option A, creating an AWS Glue Data Catalog with Glue Schema Registry and orchestrating data ingestion into Amazon Redshift Serverless using AWS Glue, appears to be the most cost-effective and suitable solution. It offers a serverless approach to manage the evolving data schema of the IoT data and efficiently supports data analytics needs without the overhead of managing a provisioned database cluster or complex orchestration setups.
Comment 1213874 by VerRi
- Upvotes: 1
Selected Answer: A Athena is not able to create new data catalog
Comment 1207917 by sdas1
- Upvotes: 1
Option C
Cost-effectiveness: Amazon Athena allows you to query data directly from Amazon S3 without the need for any infrastructure setup or management. You pay only for the queries you run, making it cost-effective, especially for sporadic or exploratory analysis. Flexibility: Since the data structure can change with IoT device upgrades, using Athena allows for flexibility in querying and analyzing the data regardless of its structure. You don’t need to define a fixed schema upfront, enabling you to adapt to changes seamlessly. Apache Spark Support: Athena supports querying data using Apache Spark, which is powerful for processing and analyzing large datasets. This capability ensures that the analytics department can leverage Spark for more advanced analytics if needed. https://www.youtube.com/watch?v=Q93NZJBFSWw
Comment 1203387 by khchan123
- Upvotes: 2
Selected Answer: A The correct solution is A. Create an AWS Glue Data Catalog. Configure an AWS Glue Schema Registry. Create a new AWS Glue workload to orchestrate the ingestion of the data that the analytics department will use into Amazon Redshift Serverless.
Option C (Amazon Athena and Apache Spark) is suitable for ad-hoc querying and exploration but may not be the best choice for the analytics department’s ongoing data analysis needs, as Athena is designed for interactive querying rather than complex data transformations.
Comment 1197154 by chris_spencer
- Upvotes: 4
Selected Answer: A The objective is to create a data catalog that includes the IoT data and AWS Glue Data Catalog is the best option for this requirement. https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html
C is incorrect. While Athena makes it easy to read from S3 using SQL, it does not crawl the data source and create a data catalog.
Comment 1194717 by Christina666
- Upvotes: 2
Selected Answer: C Why Option C is the Most Cost-Effective
Serverless and Pay-as-you-go: Athena is a serverless query service, meaning you only pay for the queries the analytics department runs. No need to provision and manage always-running clusters. Flexible Schema Handling: Athena works well with semi-structured data like JSON and can handle schema evolution on the fly. This is perfect for the scenario where IoT data structures might change. Spark Integration: Integrating Apache Spark with Athena provides rich capabilities for data processing and transformation. Ease of Use for Analytics: Athena’s familiar SQL-like interface and ability to directly query S3 data make it convenient for the analytics department.
Comment 1187769 by lucas_rfsb
- Upvotes: 2
Selected Answer: C Options A, B, and D involve setting up additional infrastructure (e.g., AWS Glue, Redshift clusters, Lambda functions) which may incur unnecessary costs and complexity for the given requirements. Option C, on the other hand, utilizes a serverless and scalable solution directly querying data in S3, making it the most cost-effective choice.