Questions and Answers
Question 0x7ffq2mLrDlC49omsho
Question
A data engineer needs to join data from multiple sources to perform a one-time analysis job. The data is stored in Amazon DynamoDB, Amazon RDS, Amazon Redshift, and Amazon S3. Which solution will meet this requirement MOST cost-effectively?
Choices
- A: Use an Amazon EMR provisioned cluster to read from all sources. Use Apache Spark to join the data and perform the analysis.
- B: Copy the data from DynamoDB, Amazon RDS, and Amazon Redshift into Amazon S3. Run Amazon Athena queries directly on the S3 files.
- C: Use Amazon Athena Federated Query to join the data from all data sources.
- D: Use Redshift Spectrum to query data from DynamoDB, Amazon RDS, and Amazon S3 directly from Redshift.
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1187087 by lucas_rfsb
- Upvotes: 7
Selected Answer: C I would go for C because Federated Query is typical for this porpouse. Besides, we don’t need to add/duplicate resources in S3. But I see that, becasuse Athena is more optimized for S3, it can be considered a tricky question, since there can be more trade-offs to consider, such as data governance that are easier if data is centralized in S3 in my opinion.
Comment 1227006 by pypelyncar
- Upvotes: 4
Selected Answer: C Serverless Processing: Athena is a serverless query service, meaning you only pay for the queries you run. This eliminates the need to provision and manage compute resources like in EMR clusters, making it ideal for one-time jobs. Federated Query Capability: Athena Federated Query allows you to directly query data from various sources like DynamoDB, RDS, Redshift, and S3 without physically moving the data. This eliminates data movement costs and simplifies the analysis process. Reduced Cost for Large Datasets: Compared to copying data to S3, which can be expensive for large datasets, Athena Federated Query avoids unnecessary data movement, reducing overall costs.
Comment 1178946 by certplan
- Upvotes: 2
Amazon Athena Federated Query allows you to query data from multiple federated data sources including relational databases, NoSQL databases, and object stores directly from Athena. While this might seem like an efficient way to join data from different sources without the need for copying data into Amazon S3, it’s essential to consider the cost implications.
AWS documentation on Amazon Athena Federated Query [1] explains that while Federated Query enables you to query data from external data sources without data movement, it does not eliminate data transfer costs. Depending on the data sources involved (such as Amazon RDS, DynamoDB, etc.), there might be data transfer costs associated with querying data directly from these sources.
[1] Amazon Athena Federated Query Documentation: https://docs.aws.amazon.com/athena/latest/ug/federated-data-sources.html
Comment 1178945 by certplan
Upvotes: 1
Data Storage Costs: Storing data in Amazon S3 is generally cheaper compared to the other AWS storage options like Amazon Redshift or Amazon RDS.
Compute Costs: Amazon: Athena is a serverless query service that allows you to query data directly from S3 without the need for provisioning or managing infrastructure. You only pay for the queries you run, which can be more cost-effective compared to provisioning an EMR cluster (option A) or using Redshift Spectrum (option D), both of which involve compute resources that you might not fully utilize.
Data Transfer Costs: Option B involves copying the data once into S3, and then there are no additional data transfer costs for querying the data using Athena. In contrast, options A and D would involve data transfer costs as data is moved between different services.
Amazon Athena Pricing: https://aws.amazon.com/athena/pricing/ Amazon S3 Pricing: https://aws.amazon.com/s3/pricing/
Comment 1178944 by certplan
- Upvotes: 1
Point: “perform a one-time analysis job”
Option C (Amazon Athena Federated Query) might seem appealing, but it’s generally more suited for querying data from external sources without copying the data into S3. However, since the data is already within AWS services, copying it to S3 and using Athena directly would likely be more cost-effective.
Comment 1127578 by [Removed]
- Upvotes: 4
Selected Answer: C You can query these sources by using Federated Queries, which is a native feature of Athena. The other options may increase costs and operational overhead, as they use more than one service to achieve the same result
https://docs.aws.amazon.com/athena/latest/ug/connectors-available.html
Question fZERnQM2VWOyXoFLR6Bl
Question
A company is planning to use a provisioned Amazon EMR cluster that runs Apache Spark jobs to perform big data analysis. The company requires high reliability. A big data team must follow best practices for running cost-optimized and long-running workloads on Amazon EMR. The team must find a solution that will maintain the company’s current level of performance. Which combination of resources will meet these requirements MOST cost-effectively? (Choose two.)
Choices
- A: Use Hadoop Distributed File System (HDFS) as a persistent data store.
- B: Use Amazon S3 as a persistent data store.
- C: Use x86-based instances for core nodes and task nodes.
- D: Use Graviton instances for core nodes and task nodes.
- E: Use Spot Instances for all primary nodes.
answer?
Answer: BD Answer_ET: BD Community answer BD (100%) Discussion
Comment 1127581 by [Removed]
- Upvotes: 9
Selected Answer: BD HDFS is not recommended for persistent storage because once a cluster is terminated, all HDFS data is lost. Also, long-running workloads can fill the disk space quickly. Thus, S3 is the best option since it’s highly available, durable, and scalable.
AWS Graviton-based instances cost up to 20% less than comparable x86-based Amazon EC2 instances: https://aws.amazon.com/ec2/graviton/
Comment 1411307 by sam_pre
- Upvotes: 1
Selected Answer: BD Cost effective + high reliability > S3 Gravitation > Low cost
Comment 1330987 by ttpro1995
- Upvotes: 1
Selected Answer: BD Rule of thumb: pick the AWS in-house solution provided for that service. Graviton is aws processor, and also EMRFS on S3.
Comment 1227007 by pypelyncar
- Upvotes: 3
Selected Answer: BD s3 no question. Graviton⇒ Cost-Effectiveness: Graviton instances are ARM-based instances specifically designed for cloud workloads. They offer significant cost savings compared to x86-based instances while delivering comparable or better performance for many Apache Spark workloads. Performance: Graviton instances are optimized for Spark workloads and can deliver the same level of performance as x86-based instances in many cases. Additionally, EMR offers performance-optimized versions of Spark built for Graviton instances.
Comment 1195088 by okechi
- Upvotes: 1
My answer is BE
Comment 1178953 by certplan
- Upvotes: 2
A. - AWS recommends using Amazon S3 as a persistent data store for Amazon EMR due to its scalability, durability, and cost-effectiveness. Storing data in HDFS would require managing and maintaining additional infrastructure, which may incur higher costs in terms of storage, management, and scalability compared to using Amazon S3. AWS documentation emphasizes the benefits of integrating Amazon EMR with Amazon S3 for cost optimization and efficiency.
D. - While Graviton instances may offer cost savings in certain scenarios, they might not always be the most cost-effective option depending on the specific workload requirements and availability of compatible software. x86-based instances are more commonly supported by a broader range of software and frameworks, which could result in better performance and compatibility in some cases. Additionally, AWS documentation on instance types and pricing can provide insights into the cost-effectiveness of Graviton instances compared to x86-based instances.
Comment 1170872 by GiorgioGss
- Upvotes: 3
Selected Answer: BD B and D.
Question YzpsleuQTEBuBqIgJ5dR
Question
A company wants to implement real-time analytics capabilities. The company wants to use Amazon Kinesis Data Streams and Amazon Redshift to ingest and process streaming data at the rate of several gigabytes per second. The company wants to derive near real-time insights by using existing business intelligence (BI) and analytics tools. Which solution will meet these requirements with the LEAST operational overhead?
Choices
- A: Use Kinesis Data Streams to stage data in Amazon S3. Use the COPY command to load data from Amazon S3 directly into Amazon Redshift to make the data immediately available for real-time analysis.
- B: Access the data from Kinesis Data Streams by using SQL queries. Create materialized views directly on top of the stream. Refresh the materialized views regularly to query the most recent stream data.
- C: Create an external schema in Amazon Redshift to map the data from Kinesis Data Streams to an Amazon Redshift object. Create a materialized view to read data from the stream. Set the materialized view to auto refresh.
- D: Connect Kinesis Data Streams to Amazon Kinesis Data Firehose. Use Kinesis Data Firehose to stage the data in Amazon S3. Use the COPY command to load the data from Amazon S3 to a table in Amazon Redshift.
answer?
Answer: C Answer_ET: C Community answer C (56%) D (39%) 4% Discussion
Comment 1185507 by blackgamer
- Upvotes: 8
Selected Answer: C The answer is C. It can provide near real-time insight analysis. Refer the article from AWS - https://aws.amazon.com/blogs/big-data/real-time-analytics-with-amazon-redshift-streaming-ingestion/
Comment 1174660 by helpaws
- Upvotes: 7
Selected Answer: C Key word here is near real-time. If it’s involve S3 and COPY, it’s not gonna be near real-time
Comment 1410789 by melligeri
- Upvotes: 1
Selected Answer: C https://aws.amazon.com/blogs/big-data/real-time-analytics-with-amazon-redshift-streaming-ingestion/#:~:text=Before%20the%20launch,the%20data%20stream.
Comment 1402198 by Rpathak4
- Upvotes: 2
Selected Answer: D ✅ Use Kinesis Data Firehose to load data into Redshift via S3 for the simplest and most scalable solution. ✅ Firehose automatically batches, transforms, and loads data with no manual intervention required. ✅ Achieves near real-time analytics with minimal operational effort.
Comment 1398891 by MephiboshethGumani
- Upvotes: 1
Selected Answer: D Creating an external schema and using materialized views directly on top of Kinesis Data Streams is also not an ideal choice because this approach can add complexity and doesn’t leverage fully managed solutions like Kinesis Data Firehose. The manual management of data refresh rates adds operational overhead.
Comment 1388256 by Eltanany
- Upvotes: 1
Selected Answer: C Refer to the article from AWS - https://aws.amazon.com/blogs/big-data/real-time-analytics-with-amazon-redshift-streaming-ingestion/
Comment 1364713 by jesusmoh
- Upvotes: 1
Selected Answer: D option D provides a streamlined, efficient, and low-overhead approach to achieving real-time analytics with the specified technologies.
Comment 1339500 by plutonash
- Upvotes: 2
Selected Answer: D A: Kinesis Data Streams to stage data in Amazon S3. not really easy, B: sql directly to Kinesis Data Streams : functionality not exist C : external schema from redshift to Kinesis Data Streams : functionality not exist D : near real-time = Kinesis Data Firehose
Comment 1337845 by subbie
- Upvotes: 1
Selected Answer: C https://aws.amazon.com/blogs/big-data/real-time-analytics-with-amazon-redshift-streaming-ingestion/
Comment 1337844 by subbie
- Upvotes: 1
Selected Answer: B https://aws.amazon.com/blogs/big-data/real-time-analytics-with-amazon-redshift-streaming-ingestion/
Comment 1330421 by haby
- Upvotes: 1
Selected Answer: A A for me C - Redshift does not natively support direct mapping to Kinesis Data Streams. Some extra configs are needed. D - There will be a 60s latency when using Firehose, so it’s “Near” real time not real time.
Comment 1329012 by HagarTheHorrible
- Upvotes: 1
Selected Answer: D Redshift does not natively support direct mapping to Kinesis Data Streams. Materialized views cannot directly query streaming data from Kinesis.
Comment 1321791 by altonh
- Upvotes: 1
Selected Answer: C See https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion-getting-started.html
Comment 1307752 by Asen_Cat
- Upvotes: 2
Selected Answer: D D could be the most standard way to handle this case. How to use C to implement it is questionable for me.
Comment 1307740 by heavenlypearl
- Upvotes: 1
Selected Answer: C Amazon Redshift can automatically refresh materialized views with up-to-date data from its base tables when materialized views are created with or altered to have the autorefresh option. Amazon Redshift autorefreshes materialized views as soon as possible after base tables changes.
https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-refresh.html
Comment 1306078 by royalrum
- Upvotes: 1
Firehose is Near-Real time, you can set your buffer size and stream to either Redshift or S3 directly. Since Redshift is not in the option, use s3…
Comment 1302529 by Shatheesh
- Upvotes: 1
Selected Answer: D Kinesis Data Streams , option D using Kinesis Data Firehose is a fully managed service that automatically handles the ingestion of data
Comment 1283727 by markill123
- Upvotes: 4
Selected Answer: D Here’s why D is the best choice:
Kinesis Data Firehose is a fully managed service that automatically handles the ingestion of data from Kinesis Data Streams and stages it in S3, which significantly reduces operational overhead compared to managing custom data ingestion pipelines. S3 as a staging area: Using Amazon S3 as a staging location allows for flexible data management, high durability, and direct loading into Redshift without needing to manage complex buffering or data handling processes. COPY command: The COPY command in Amazon Redshift is highly optimized for loading large datasets efficiently, making it a common and effective method to load bulk data from S3 into Redshift for near real-time analysis. Firehose to Redshift: Firehose can automatically buffer, batch, and transform data before loading it into Redshift, reducing manual intervention and ensuring data is readily available for real-time analytics.
Comment 1279873 by shammous
- Upvotes: 2
Selected Answer: D Option C has an issue: Redshift does not natively support direct querying or mapping of Kinesis Data Streams. D is the only correct option.
Comment 1260946 by V0811
- Upvotes: 2
Selected Answer: D Option D
Comment 1241955 by bakarys
- Upvotes: 1
Selected Answer: A Option A (using Kinesis Data Streams to stage data in Amazon S3 and loading it directly into Amazon Redshift) is the most straightforward and efficient approach. It minimizes operational overhead and ensures immediate availability of data for analysis. Options B and C introduce additional complexity and may not provide the same level of efficiency
Comment 1207701 by d8945a1
- Upvotes: 2
Selected Answer: C MVs in Redshift with auto refresh is the best option for near real time.
Comment 1194553 by Christina666
- Upvotes: 3
Selected Answer: C Using materialized views with auto-refresh directly on a Redshift external schema of Kinesis Data Stream offers the most streamlined and efficient approach for near real-time insights using existing BI tools.
Comment 1181036 by fceb2c1
- Upvotes: 5
Selected Answer: C https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion-getting-started.html
C is correct. (KDS → Redshift) D is wrong as it has more operational overhead (KDS → KDF → S3 → Redshift)
Comment 1178959 by certplan
Upvotes: 2
Amazon Kinesis Data Firehose: It’s designed to reliably load streaming data into data lakes and data stores with minimal configuration and management overhead. It handles tasks like buffering, scaling, and delivering data to destinations like Amazon S3 and Amazon Redshift automatically.
Amazon S3 as a staging area: Storing data in Amazon S3 provides a scalable and durable solution for data storage without needing to manage infrastructure. It also allows for easy integration with other AWS services and existing BI and analytics tools.
Amazon Redshift: While Redshift requires some setup and management, loading data from Amazon S3 using the COPY command is a straightforward process. Once data is loaded into Redshift, existing BI and analytics tools can query the data directly, enabling near real-time insights.
Minimal operational overhead: This solution minimizes operational overhead because much of the management tasks, such as scaling, buffering, and delivery of data, are handled by Amazon Kinesis Data Firehose. Additionally, using Amazon S3 as a staging area simplifies data storage and integration with other services.
Comment 1178960 by certplan
- Upvotes: 1
By considering the characteristics and capabilities of each AWS service and approach, along with insights from AWS documentation, it becomes evident that option D offers the most streamlined and operationally efficient solution for the scenario described.
This idea/concept is also straight out of the Amazon Solutions Architect course material.
Comment 1178958 by certplan
- Upvotes: 1
Point: “Which solution will meet these requirements with the LEAST operational overhead?”
C. - This approach involves creating an external schema in Amazon Redshift to map data from Kinesis Data Streams, which adds complexity compared to directly loading data from Amazon S3 using Amazon Kinesis Data Firehose.
- While materialized views with auto-refresh can provide near real-time insights, managing them and ensuring proper synchronization with the streaming data source may require more operational effort.
- AWS documentation for Amazon Redshift primarily focuses on traditional data loading methods and querying, with limited guidance on integrating with real-time data sources like Kinesis Data Streams.
Comment 1170877 by GiorgioGss
- Upvotes: 3
Selected Answer: D I think D. It could be C but because of “LEAST operational overhead” I will go with D.
Comment 1142075 by Aesthet
- Upvotes: 1
Both ChatGPT and I are thinking D is correct (100%)
Question VWEmgZx5KhS4j9NJkKmY
Question
A company uses an Amazon QuickSight dashboard to monitor usage of one of the company’s applications. The company uses AWS Glue jobs to process data for the dashboard. The company stores the data in a single Amazon S3 bucket. The company adds new data every day. A data engineer discovers that dashboard queries are becoming slower over time. The data engineer determines that the root cause of the slowing queries is long-running AWS Glue jobs. Which actions should the data engineer take to improve the performance of the AWS Glue jobs? (Choose two.)
Choices
- A: Partition the data that is in the S3 bucket. Organize the data by year, month, and day.
- B: Increase the AWS Glue instance size by scaling up the worker type.
- C: Convert the AWS Glue schema to the DynamicFrame schema class.
- D: Adjust AWS Glue job scheduling frequency so the jobs run half as many times each day.
- E: Modify the IAM role that grants access to AWS glue to grant access to all S3 features.
answer?
Answer: AB Answer_ET: AB Community answer AB (100%) Discussion
Comment 1139119 by rralucard_
- Upvotes: 10
Selected Answer: AB A. Partition the data that is in the S3 bucket. Organize the data by year, month, and day.
• Partitioning data in Amazon S3 can significantly improve query performance. By organizing the data by year, month, and day, AWS Glue and Amazon QuickSight can scan only the relevant partitions of data, which reduces the amount of data read and processed. This approach is particularly effective for time-series data, where queries often target specific time ranges.
B. Increase the AWS Glue instance size by scaling up the worker type.
• Scaling up the worker type can provide more computational resources to the AWS Glue jobs, enabling them to process data faster. This can be especially beneficial when dealing with large datasets or complex transformations. It’s important to monitor the performance improvements and cost implications of scaling up.
Comment 1178965 by certplan
Upvotes: 2
- Partition the Data in Amazon S3:
AWS documentation on optimizing Amazon S3 performance: https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html
AWS Glue documentation on partitioning data for AWS Glue jobs: https://docs.aws.amazon.com/glue/latest/dg/how-it-works.html#how-partitioning-works
Best practices for partitioning in Amazon S3: https://docs.aws.amazon.com/AmazonS3/latest/userguide/best-practices-partitioning.html
- Optimizing AWS Glue Job Settings:
- AWS Glue documentation on optimizing job performance: https://docs.aws.amazon.com/glue/latest/dg/best-practices.html
- AWS Glue documentation on scaling AWS Glue job resources: https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-glue-job-cloudwatch-metrics.html
By referring to these documentation resources, the data engineer can gain insights into best practices and recommendations provided by AWS for optimizing AWS Glue jobs, thereby justifying the suggested actions to address the issue of slowing job performance.
Question PaiMHb1V6C6Hjp02dRBS
Question
A data engineer needs to use AWS Step Functions to design an orchestration workflow. The workflow must parallel process a large collection of data files and apply a specific transformation to each file. Which Step Functions state should the data engineer use to meet these requirements?
Choices
- A: Parallel state
- B: Choice state
- C: Map state
- D: Wait state
answer?
Answer: C Answer_ET: C Community answer C (100%) Discussion
Comment 1274297 by GabrielSGoncalves
- Upvotes: 1
Selected Answer: C Clearly is mapping state
Comment 1227011 by pypelyncar
- Upvotes: 3
Selected Answer: C The Map state allows you to define a single execution path for processing a collection of data items in parallel. This aligns perfectly with the data engineer’s requirement of parallel processing a large collection of data files
Comment 1224455 by FunkyFresco
- Upvotes: 1
Selected Answer: C to execute in parallel
Comment 1216115 by sveni1502
- Upvotes: 3
Selected Answer: C C is Correct To meet the requirement of parallel processing a large collection of data files and applying a specific transformation to each file, the data engineer should use the Map state in AWS Step Functions. The Map state is specifically designed to run a set of tasks in parallel for each element in a collection or array. Each element (in this case, each data file) is processed independently and in parallel, allowing the workflow to take advantage of parallel processing.
Comment 1187116 by lucas_rfsb
- Upvotes: 1
Selected Answer: C C, Map state is correct
Comment 1142096 by Aesthet
- Upvotes: 1
With Step Functions, you can orchestrate large-scale parallel workloads to perform tasks, such as on-demand processing of semi-structured data. These parallel workloads let you concurrently process large-scale data sources stored in Amazon S3. For example, you might process a single JSON or CSV file that contains large amounts of data. Or you might process a large set of Amazon S3 objects.
To set up a large-scale parallel workload in your workflows, include a Map state in Distributed mode.
Comment 1142077 by Aesthet
- Upvotes: 1
C is correct. Map state is designed precisely for the requirement described. It allows you to iterate over a collection of items, processing each item individually. The Map state can automatically manage the iteration and execute the specified transformation on each item in parallel, making it the perfect choice for parallel processing of a large collection of data files.
Comment 1140093 by rralucard_
- Upvotes: 2
Selected Answer: C The Map state is specifically designed for processing a collection of items (like data files) in parallel. It allows you to apply a transformation or a set of steps to each item in the input array independently. The Map state automatically iterates over each item in the array and performs the defined steps. This makes it ideal for scenarios where you need to process a large number of files in a similar manner, as in your requirement.