Demystifying Apache Spark : SparkSession vs SparkContext

Data Saint Consulting Inc
2 min readNov 13, 2023
Photo by Jakub Skafiriak on Unsplash

PySpark SparkSession vs SparkContext is a common question among Spark users. SparkSession and SparkContext are both entry points to Spark functionality, but they have some differences.

SparkContext was the main entry point for Spark programming with RDDs and connecting to the Spark cluster in earlier versions of Spark or PySpark. SparkContext allows you to create RDDs, accumulators, and broadcast variables, as well as access Spark services and perform jobs. SparkContext also enables you to access SQLContext and HiveContext, which provide additional functionality for working with structured and semi-structured data¹.

SparkSession was introduced in Spark 2.0 and became the preferred entry point for programming with DataFrames and Datasets, which are higher-level abstractions than RDDs. SparkSession also provides a unified interface to access various data sources and formats, such as Parquet, ORC, JSON, CSV, JDBC, and Hive. SparkSession also integrates with various Spark libraries, such as Spark Streaming, MLlib, and GraphX².

SparkSession internally creates a SparkContext object, which can be accessed through the sparkContext attribute. Therefore, you can still use SparkContext methods and features through SparkSession. However, SparkSession also offers some advantages over SparkContext, such as:

--

--

Data Saint Consulting Inc
Data Saint Consulting Inc

Written by Data Saint Consulting Inc

For Consultation services regarding Data Engineering and Analytics: datasaintconsulting@ gmail.com

No responses yet