What is Apache Spark?
Apache Spark is a general-purpose & lightning fast an open-source cluster computing system that provides high-level API in Java, Scala, Python and R. It can access data from HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source. And run in Standalone, YARN and Mesos cluster manageApache Spark provide three different unit of abstraction to work with.In this post we will provide you basic understanding of each of them and also do same hands-on so that you will get better understanding.
- RDD
- DataFrame
- Dataset
RDD(Resilient Distributed Dataset) :
RDD stands for Resilient Distributed Dataset. What does it means.
Resilient - Immutable and fault tolerant that you can't modified once created
Distributed - RDD is distributed over cluster.(RDD are divided into small chunks are called partitions that is distributed across the cluster)
Distributed - RDD is distributed over cluster.(RDD are divided into small chunks are called partitions that is distributed across the cluster)
Dataset - That holds Data
So, An RDD is the resilient(or immutable),partitions, distributed collection of data.
Most basic data unit in Spark upon which all Operations are performed. In RDD intermediate results is stored in Memory.
When to use RDD?
- If your data is unstructured.
- If you want to work with functional approach.
- If you don't care about imposing schema.
How to create RDD?
Common ways to build the RDD:
- Using SparkContext.parallelize that used existing collection. Assuming sc is spark context object.
val num = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 ,15)
val rdd = sc.parallelize(num)
- By using external files or source of data(like txt file,csv file,hdfs)
- SparkContext.makeRDD
Operations(or in simple word operators) on RDD
RDD support 2 type of operation transformations and actions
Transformation – Transformations are functions that take a RDD as the input and produce one or many RDDs as the output. They do not change the input RDD as we know RDD is immutable.
Note: - Every transformations return a new RDD and keep a pointer to the parent RDD and the result RDD will always be different from their parents and can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap, union, cartesian) or the same size (e.g. map).
All transformations are lazy in spark, i.e. Spark doesn't execute it immediately instead it creates a lineage.
Lineage is just keep a track of all the transformation has to be applied on that RDD.
There are two types of transformations.
There are two types of transformations.
- Narrow Transformations
- Wide Transformations
Narrow Transformation - RDD operations like map, union, filter can operate on a single partition and map the data of that partition to resulting single partition. These kind of operations are known as Narrow operations. In Narrow operation no need to distribute the data across the partitions.
Wide Transformation - RDD operations like groupByKey, distinct, join may require to map the data across the partitions in new RDD. These kind of operations which maps data from one to many partitions are referred as Wide operations.Generally Wide operations distribute the data across the partitions. Wide transformation is most costly than narrow operations due to data shuffling.
List of common transformation supported by SparkAction – Spark Action returns final result to driver program or save it to the external location. Action trigger execution of RDD transformation as we know transformation is lazy.Simply we can say action evaluates lineage graph.
toDebugString() function gives information about RRD lineage.
List of common action supported by Spark.
toDebugString() function gives information about RRD lineage.
List of common action supported by Spark.