Demystifying Apache Spark : SparkSession vs SparkContext
PySpark SparkSession vs SparkContext is a common question among Spark users. SparkSession and SparkContext are both entry points to Spark functionality, but they have some differences.
SparkContext was the main entry point for Spark programming with RDDs and connecting to the Spark cluster in earlier versions of Spark or PySpark. SparkContext allows you to create RDDs, accumulators, and broadcast variables, as well as access Spark services and perform jobs. SparkContext also enables you to access SQLContext and HiveContext, which provide additional functionality for working with structured and semi-structured data¹.
SparkSession was introduced in Spark 2.0 and became the preferred entry point for programming with DataFrames and Datasets, which are higher-level abstractions than RDDs. SparkSession also provides a unified interface to access various data sources and formats, such as Parquet, ORC, JSON, CSV, JDBC, and Hive. SparkSession also integrates with various Spark libraries, such as Spark Streaming, MLlib, and GraphX².
SparkSession internally creates a SparkContext object, which can be accessed through the sparkContext attribute. Therefore, you can still use SparkContext methods and features through SparkSession. However, SparkSession also offers some advantages over SparkContext, such as: