14 October 2017
Apache Spark provides a unified engine that natively supports both batch and streaming workloads. Spark Streaming’s execution model is advantageous over traditional streaming systems for its fast recovery from failures, dynamic load balancing, s treaming and interactive analytics, and native integration.
It is mechanism available in spark that enables us to save an RDD to a reliable storage system like S3 or HDFS so that spark does not forgets the RDD’s lineage completely.When RDD’s lineage dependencies becomes too long it is suggested to keep track of it using checkpoints. It is similar to the process where Hadoop stores intermediate results in disk so that it can recover from failures .Since checkpointing is on external file location it can be reused again by other applications.
For a streaming application to operate continuously it mush be resilient to any kind of failures. A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.). For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures.
There are two types of checkpointing available
To summarize metadata checkpointing is primarily needed for recovery from driver failures, whereas data or RDD checkpointing is necessary even for basic functioning if stateful transformations are used.
In order to have an automatic recover mechanism for driver failure ,different cluster manager provides different techniques
-- supervisecommand while deploying the spark Job so that driver can be restarted upon failures
spark.yarn.max.executor.failures[from reference 2 &3]
Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.
Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations – • Transformations that produce a new DStream. • Output operations that write data to an external system.