EkumeedHelp: February 2018

What is Apache Spark?

Apache Spark is a general-purpose & lightning fast an open-source cluster computing system that provides high-level API in Java, Scala, Python and R. It can access data from HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source. And run in Standalone, YARN and Mesos cluster manage

Apache Spark provide three different unit of abstraction to work with.In this post we will provide you basic understanding of each of them and also do same hands-on so that you will get better understanding.

RDD
DataFrame
Dataset

RDD(Resilient Distributed Dataset) :

RDD stands for Resilient Distributed Dataset. What does it means.

Resilient - Immutable and fault tolerant that you can't modified once created
Distributed - RDD is distributed over cluster.(RDD are divided into small chunks are called partitions that is distributed across the cluster)

Dataset - That holds Data

So, An RDD is the resilient(or immutable),partitions, distributed collection of data.

Most basic data unit in Spark upon which all Operations are performed. In RDD intermediate results is stored in Memory.

When to use RDD?

If your data is unstructured.
If you want to work with functional approach.
If you don't care about imposing schema.

How to create RDD?

Common ways to build the RDD:

Using SparkContext.parallelize that used existing collection. Assuming sc is spark context object.

val num = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 ,15)

val rdd = sc.parallelize(num)

By using external files or source of data(like txt file,csv file,hdfs)

val data = sc.textFile("data.txt")

SparkContext.makeRDD

val rdd = sc.makeRDD(0 to 1000)

Operations(or in simple word operators) on RDD

RDD support 2 type of operation transformations and actions

Transformation – Transformations are functions that take a RDD as the input and produce one or many RDDs as the output. They do not change the input RDD as we know RDD is immutable.

Note: - Every transformations return a new RDD and keep a pointer to the parent RDD and the result RDD will always be different from their parents and can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap, union, cartesian) or the same size (e.g. map).

All transformations are lazy in spark, i.e. Spark doesn't execute it immediately instead it creates a lineage.

Lineage is just keep a track of all the transformation has to be applied on that RDD.

There are two types of transformations.

Narrow Transformations
Wide Transformations

Narrow Transformation - RDD operations like map, union, filter can operate on a single partition and map the data of that partition to resulting single partition. These kind of operations are known as Narrow operations. In Narrow operation no need to distribute the data across the partitions.

Wide Transformation - RDD operations like groupByKey, distinct, join may require to map the data across the partitions in new RDD. These kind of operations which maps data from one to many partitions are referred as Wide operations.Generally Wide operations distribute the data across the partitions. Wide transformation is most costly than narrow operations due to data shuffling.

List of common transformation supported by Spark

Action – Spark Action returns final result to driver program or save it to the external location. Action trigger execution of RDD transformation as we know transformation is lazy.Simply we can say action evaluates lineage graph.

toDebugString() function gives information about RRD lineage.

List of common action supported by Spark.

How to read and write using spark

TEXT File
CSV File
JSON File
PARQUET File
ORC File
AVRO File
SEQUENCE File

Reading and Writing several file formats in spark 2.0

After Rdd and Dataframe another abstraction is introduce in spark 2.0 called Dataset is a super set of Dataframe.

In earlier version of spark, Spark Context was the entry point. Now from spark 2.0 onward Spark Session is the main entry point for Dataset and Dataframe.

SparkSession internally has spark context and is a combination of HiveContext,SQLContext

and is available with name spark in spark-shell

Creating SparkSession

val spark = SparkSession

.builder

.master("local")

.appName("Spark Job")

.getOrCreate()

Reading and Writing several file formats in spark 1.6