We are writing our Spark program and executing it to get the desired output, But Spark does a lot of background work to process our data in a Distributed manner.
Spark Query Plan
When an Action is called against Data, the spark engine will perform the below steps,
Unresolved Logical Plan
The query will be parsed in the first step, and a parsed logical plan will be generated. It will only check the syntax errors in the query, and if the syntax is incorrect, then ParseException will be raised. If the Syntax check is passed, an unresolved plan will be generated.
Analyzed Logical Plan
After the generation of an unresolved plan from step 1, in the 2nd step, all the unresolved plans get resolved by accessing an internal Spark structure mentioned as "Catalog".
"Catalog" is a repository of Spark table, DataFrame, and DataSet. The data from the meta-store is pulled into an internal storage component of Spark (also known as Catalog).
The analyzer can reject the Unresolved Logical Plan when it cannot resolve the column name, table name, etc. It creates a Resolved Logical Plan if it can resolve them.
Upon successful completion of everything, the plan is marked as an "Analyzed Logical Plan"
Optimized Logical Plan using Catalyst Optimizer
In this step, certain optimizations will be performed using a catalyst optimizer based on a certain set of rules already in place, E.g., filter pushes down, combining of filters, etc. We can also add our rules in the catalyst optimizer if needed.
Physical Plan
In this step, multiple physical plans will be generated, describing which operation needs to be performed. Let’s say some aggregation needs to be performed. Then out of these multiple physical plans, one may contain sortAggregate, one may contain hashAggregate, etc.
From the multiple physical plans generated, one of the physical plans which is the most optimized with minimum cost will be selected.
Based on the selected physical plan, the data will be processed by converting it into a low-level RDD.
We can write the application code in either Python/Scala/SQL/R, the spark engine will anyway convert all those into an low level RDD at the end.