Event driven?
Event-driven programming is a programming paradigm where the flow of the program is determined by events, such as user interactions, messages from other systems, or sensor outputs. In event-driven systems, the program waits for events to occur, then reacts to those events.
An example of event-driven programming in the real world is an ATM machine. The user interacts with the machine by inserting their card, entering their PIN, and making a selection from the menu. The ATM machine responds to each user action (events) by displaying the appropriate screens, processing transactions, and dispensing cash. The flow of the program is determined by the events generated by the user, rather than following a predetermined sequence of steps.
What is in memory calculation and compare to other type, also give me real world example about it?
In-memory computing is a type of computing where data is stored in random access memory (RAM) rather than on disk storage. This allows for faster access to data, since accessing data in RAM is much faster than accessing data on disk.
In-memory calculation differs from other types of computing in that it uses the fast and direct access to data stored in memory to perform complex calculations, such as aggregations and complex mathematical operations, in real-time. This can greatly improve the performance of applications compared to traditional disk-based computing, where data must be retrieved from disk and processed, a much slower process.
A real-world example of in-memory calculation is a stock trading platform. The platform needs to access and process large amounts of financial data in real-time to provide up-to-date information to traders. By using in-memory computing, the platform can quickly retrieve and process data from memory, allowing traders to make informed decisions in real-time.
Disadvantage of real time computing and give example about it?
Apache Spark is a fast, in-memory data processing engine for big data processing. In-memory computing in Spark refers to the use of RAM for storing and processing data, instead of reading it from disk. This allows Spark to perform data processing tasks much faster than traditional disk-based systems.
In Spark, data is loaded into memory and organized into a data structure called a Resilient Distributed Dataset (RDD). The RDD is split into partitions, which can be processed in parallel on multiple nodes in a cluster. Spark’s in-memory computing model allows it to cache RDDs in memory and reuse them for multiple transformations, reducing the amount of data that needs to be read from disk.
The combination of Spark’s data parallelism and in-memory computing make it an attractive solution for real-time big data processing. Spark can process large amounts of data in real-time, providing low latency and high throughput. Additionally, Spark supports a variety of APIs, such as SQL and machine learning, making it a versatile solution for a wide range of big data processing tasks.
what is advantage of kafka to spark in stream process?
Apache Kafka and Apache Spark are both widely used tools for stream processing. While they have similarities, there are also some key differences between them.
Advantages of Kafka for stream processing:
- Scalability: Kafka is designed to handle high throughput and large numbers of concurrent consumers, making it well-suited for large-scale streaming applications.
- Durability: Kafka stores streams of records in a distributed fashion, providing high durability and fault tolerance.
- Decoupling: Kafka provides a publish-subscribe model for communication between producers and consumers, allowing for greater decoupling of components in a streaming architecture.
Advantages of Spark for stream processing:
- In-Memory Processing: Spark processes data in-memory, allowing for fast and efficient processing of streaming data.
- High-level APIs: Spark provides high-level APIs in Scala, Java, Python, and R, making it easier to build complex stream processing applications.
- Integration: Spark integrates with other big data tools and technologies, allowing for a unified processing pipeline for batch and stream processing.
In conclusion, the choice between Kafka and Spark for stream processing depends on the specific requirements of the application. Kafka is well-suited for high-throughput, large-scale streaming applications, while Spark is a good choice for applications that require in-memory processing and integration with other big data tools.
what is high-throughput, large-scale streaming applications and give me real world example about it?
High-throughput, large-scale streaming applications are applications that process a large volume of data in real-time. These applications need to be able to handle high data volumes, high velocity, and high variety of data and provide fast and efficient processing to support real-time use cases.
A real-world example of a high-throughput, large-scale streaming application is a real-time social media analytics platform. This platform processes a large volume of data from multiple social media sources in real-time, such as tweets, posts, and comments. The platform uses this data to perform real-time sentiment analysis, trend analysis, and influencer analysis to provide actionable insights to businesses and organizations. To handle the high volume and velocity of data, the platform must use efficient stream processing technologies and have the ability to scale to handle growth in data volume and complexity.
Surrogate Key và Nutural Key
knowledge note
- CTE(common table expression)
with table as - window function: Window function )
- Self Join PostgreSQL exercises (pgexercises.com)
- normalization and denormalization
A fact table and a dimension table are essential components of an analytical database. A fact table records facts (raw data, measures or observations) about a specific event or transaction, and typically contains foreign keys to lookup specific dimension tables. Dimension tables record descriptive attributes for each item in the fact table, such as date, geographic location, customer information and product type.
For example: Imagine a retail chain that is analyzing how sales were affected by different weather conditions.
Fact Table: The Fact Table will contain the raw data from all of the sales receipts --- totaling how much each store sold and when/where it was sold.
Dimension Table: The Dimension Table will contain information about the date, location, customer demographics for each sale recorded in the Fact Table..
Airflow optimization
- Modular DAGs: Break large workflows into smaller, modular DAGs that are easier to manage and troubleshoot. task responsiblity, when crawl move to celery
- Keep tasks atomic, Decide where to run data processing jobs
- Task Idempotency: Ensure tasks are idempotent so retries and reruns don’t cause inconsistent states. upseart instead of insert
- Avoid Top level Python Code
- jinja template to access airflow resource as they resolved at runtime
- avoid copy same code in similar task
- parallel mechanism. avoid multiprocess for io bound task, check task xem có xử lý song song được không, tránh tình trạng task tốn ít thời gian phải chờ task nhiều thời gian.
- alert when fail task avoid retry many times