How to efficiently re-partition Spark DataFrames
How to increase or decrease the number of a Spark DataFrame
Apache Spark is a framework that enables the process of enormous amount of data in a reasonable amount of time. The efficiency of this unified engine is hugely dependent on its ability to distribute and parallelise the work performed over a collection of data.
In this article, we are going to introduce partitions in Spark and also explain how to re-partition DataFrames. Additionally, we will also discuss when it is worth increasing or decreasing the number of partitions of Spark DataFrames in order to optimise the execution time as much as possible.
Spark Partitioning in a nutshell
In order to achieve high parallelism, Spark will split the data into smaller chunks called partitions which are distributed across different nodes in the Spark Cluster. Every node, can have more than one executor each of which can execute a task.
The distribution of the work into multiple executors requires data to be partitioned and distributed across the executors, so that the work can be done in parallel in order to optimise the data processing for a specific job.
How to get current number of partitions
Before jumping into re-partitioning, it is worth describing the way one can use to get the current number of partitions of a Spark DataFrame. As an example, let’s assume that we have the following minimal Spark DataFrame
In order to get the number of partitions of the above dataframe, all we have to do is run the following
Note that the output is dependent to your current setup and configuration so you might see a different output.
How to increase the number of partitions
If you want to increase the partitions of your DataFrame, all you need to run is the repartition()
function.
Returns a new
DataFrame
partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned.
The code below will increase the number of partitions to 1000:
How to decrease the number of partitions
Now if you want to repartition your Spark DataFrame so that it has fewer partitions, you can still use repartition()
however, there’s a more efficient way to do so.
coalesce()
results in a narrow dependency, which means that when used for reducing the number of partitions, there will be no shuffle, which is probably one of the most costly operations in Spark.
Returns a new
DataFrame
that has exactly N partitions.
In the example below we limit our partitions to 100. The Spark DataFrame that originally has 1000 partitions, will be repartitioned to 100 partitions without shuffling. By no shuffling we mean that each the 100 new partitions will be assigned to 10 existing partitions. Therefore, it is way more efficient to call coalesce()
when one wants to reduce the number of partitions of a Spark DataFrame.
Conclusion
In this article we discussed how data processing is optimised through partitions that allow the work to be distributed across the executors of a Spark Cluster. Additionally, we also explored the two possible ways one can use in order to increase or decrease the number of partitions in DataFrames.
repartition()
can be used for increasing or decreasing the number of partitions of a Spark DataFrame. However, repartition()
involves shuffling which is a costly operation.
On the other hand, coalesce()
can be used when we want to reduce the number of partitions as this is more efficient due to the fact that this method won’t trigger data shuffling across the nodes of the Spark Cluster.