How to efficiently re-partition Spark DataFrames

How to efficiently re-partition Spark DataFrames

How to increase or decrease the number of a Spark DataFrame

Photo by Mae Mu on Unsplash

Apache Spark is a framework that enables the process of enormous amount of data in a reasonable amount of time. The efficiency of this unified engine is hugely dependent on its ability to distribute and parallelise the work performed over a collection of data.

In this article, we are going to introduce partitions in Spark and also explain how to re-partition DataFrames. Additionally, we will also discuss when it is worth increasing or decreasing the number of partitions of Spark DataFrames in order to optimise the execution time as much as possible.

Spark Partitioning in a nutshell

In order to achieve high parallelism, Spark will split the data into smaller chunks called partitions which are distributed across different nodes in the Spark Cluster. Every node, can have more than one executor each of which can execute a task.

The distribution of the work into multiple executors requires data to be partitioned and distributed across the executors, so that the work can be done in parallel in order to optimise the data processing for a specific job.

How to get current number of partitions

Before jumping into re-partitioning, it is worth describing the way one can use to get the current number of partitions of a Spark DataFrame. As an example, let’s assume that we have the following minimal Spark DataFrame

In order to get the number of partitions of the above dataframe, all we have to do is run the following

Note that the output is dependent to your current setup and configuration so you might see a different output.

How to increase the number of partitions

If you want to increase the partitions of your DataFrame, all you need to run is the repartition() function.

Returns a new DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned.

The code below will increase the number of partitions to 1000:

How to decrease the number of partitions

Now if you want to repartition your Spark DataFrame so that it has fewer partitions, you can still use repartition() however, there’s a more efficient way to do so.

coalesce() results in a narrow dependency, which means that when used for reducing the number of partitions, there will be no shuffle, which is probably one of the most costly operations in Spark.

Returns a new DataFrame that has exactly N partitions.

In the example below we limit our partitions to 100. The Spark DataFrame that originally has 1000 partitions, will be repartitioned to 100 partitions without shuffling. By no shuffling we mean that each the 100 new partitions will be assigned to 10 existing partitions. Therefore, it is way more efficient to call coalesce() when one wants to reduce the number of partitions of a Spark DataFrame.

Conclusion

In this article we discussed how data processing is optimised through partitions that allow the work to be distributed across the executors of a Spark Cluster. Additionally, we also explored the two possible ways one can use in order to increase or decrease the number of partitions in DataFrames.

repartition() can be used for increasing or decreasing the number of partitions of a Spark DataFrame. However, repartition() involves shuffling which is a costly operation.

On the other hand, coalesce() can be used when we want to reduce the number of partitions as this is more efficient due to the fact that this method won’t trigger data shuffling across the nodes of the Spark Cluster.

My Favorite Mac Utilities

My Favorite Mac Utilities

Five simple utilities that make my Mac that much better.

MacBook Air with my five favorite utilites.

I like to write about hardware a lot, like laptops, headphones, or even pens and notebooks. What I don’t talk about enough are applications or tools that I use on those pieces of hardware.

I have been using a Mac for almost ten years now, besides the times in-between where I used iPads full-time. Over that time, I have picked up on a few tools, utilities, and apps that I like to use regularly.

For this post, I would like to go over the utilities that I use. Ones that are immediately reinstalled on any new MacBook. Some of these might be considered applications, but I see them as tools that get me to where I want to go or something I setup that runs in the background without needing me.

I will continue with more posts to go over applications and other processes and tools that I use, but I wanted to start simple for now. So here are my favorite utilities that I constantly use on my Mac.

Bartender 4 ($15)

This utility has been around for ages. It was around way before I even got into using a Mac, my computer of choice. The utility is precisely what its name says; it is a “tender” of the menu “bar.”

All of my menu bar icons for apps or other utilities on my Mac. What once looked like this:

Showing menu bar when Bartender is expanded showing all icons.

Now it looks like this:

Showing menu bar when Bartender is activated, hiding icons.

Like I said before, this had been around for a while. The simple feature is that it hides all of your menu items to give you a cleaner-looking menu bar. You also have a few options on what you want the main icon to be, like a pair of sunglasses, three dots, or a star.

Bartender doesn’t stop there, though; you are also given an abundance of options to choose from in how you want each menu icon to be. You can either hide specific menu icons in Bartender, not visible at all, or always visible.

Bartender Preferences.

Bartender 4 is available for macOS Big Sur and is currently in a Public Beta so it is free whilst still in beta. Pretty soon though it will cost $15 once the beta is not longer active. For anyone using an older macOS version Bartender 3 is still available to purchase.

It is a simple tool that provides a simple solution that keeps your menu bar clear of clutter. I know many have a ton of menu icons and find Bartender even more useful to manage and maintain.

Magnet ($7.99)

Window management on Mac is something you either love or hate. Many love iPadOS because window management is very structured and limited. On the Mac, you can overlap windows for days which can create chaos for some.

Magnet centering the Finder window.

Magnet tries to help with window management on the Mac by providing options to move and resize windows with a simple click. A few examples of how I like to use Magnet would be wanting to center something on my desktop or make the app go full-size without entering full-screen mode.

I like to maximize Ulysses instead of going full screen, so I have the menu bar at the top always visible for me. All it takes is for me to open the app then click the maximize option in Magnet. On a separate desktop, I like to have Twitter taking up one-third of the screen and email taking up the other two-thirds of the screen.

Using Magnet to Maximize Ulysses app.

Magnet is excellent if you like to have many windows and want a quick way of organizing or resizing without having to drag the windows around manually.

Spotlight (Quicksilver)

Triggering Spotlight on macOS Big Sur.

Once I memorized the command+space bar keyboard shortcut to bring up Spotlight, I have since only launched apps this way and have set my dock to auto-hide. I know Spotlight has been around for a while, but it wasn’t just a couple of years that I started using it more.

Typing Ulysses into Spotlight.

What is excellent about Spotlight is that you can do more than just search for apps and launch them. You can search for documents, emails, music, and even search the web for items straight from the text box.

For many years Spotlight was limited in what features it provided, limited to only search for local things. But over time, it has grown and now offers many things that Quicksilver has done for many years. Quicksilver also provides automation and other features, but for a simple search tool for everything on your Mac, Spotlight does the job well.

Typing Billions into Spotlight.

1Password ($6.99 monthly, $59.99 annually)

Passwords are a pain to keep track of, especially when you want to have safe passwords for each one of your online services. Apple’s iCloud Keychain is excellent, but I wanted something that I could share with my wife.

1Password application.

Not to go too dark, but this utility and the next are things that give me peace of mind not only in case of my devices breaking but if I were to disappear somehow. Death is never something many like to think about but, for me, knowing that I have certain things set up for my wife to handle a difficult situation with a bit more ease is worth it.

1Password available on the menu bar.

1Password offers a personal vault for me to keep track of all the logins I could ever think of; it even lets you set up two-factor authentication for logins that I have set up and will copy that one-time password when using 1Password on my iPhone and iPad.

1Password Safari Extension auto-filling logins on Grammarly website.

The best part is shared vaults, though. The fact that my wife and I have all of our more crucial logins in this shared vault is so useful, not only if something happens to one of us and we need a login only one of us has, but if I need to login to something of hers when she can’t log in herself, I have access with a simple click.

Backblaze ($6 a month)

This utility is purely just peace of mind. I have always had Backblaze on my and my wife’s MacBook as soon as I learned about its existence years ago. It is a cheap offsite backup service that just works.

BackBlaze Preferences window.

For many years it offered unlimited backups for a single computer and any external drives plugged in for $5 a month. It has recently upped that price to $6, but I still think it is a killer deal. I not only have my over 300 GB of data on my MacBook Air backed up, but I also have my 2 TB of data on my external hard drive backed up as well. All for only $6 for me.

Actually, it is possible to pay $6 for what I am doing, but I have also recently opted for another new feature that BackBlaze offers for an additional $2 more. That extra feature is 1-Year Version History, which safeguards any deleted data for up to 1 year.

So if I were to delete a file today accidentally and in six months realized I needed that file, Backblaze would still have it available in my back up to recover. They also offer Forever Version History, meaning that you can recover any file deleted ever as long as you are still a customer with them.

The Forever Version History is also $2 a month and charges $.005 per GB per month. If I were to opt into the Forever option for my wife and me, the total would cost about $12 for the standard base cost and $4 + $13.05 (2.7 TB multiplied by $0.005) total cost of about $30 a month.

BackBlaze options available in menu bar.

$30 for 2.7 terabytes of data and Version History forever doesn’t sound so bad, but it is a little overboard for us. The $16 a month that we pay now for our laptops and 1-Year Version History is good enough.

It is an excellent option to be aware of, though, if you have a lot of data or have many family members that may benefit from long-term data recovery. Like some who have highly sensitive data, like a manuscript for a book that has been worked on for many years, want to ensure you don’t lose anything.

I have many other utilities that I use, mostly ones that come with macOS, but these are the ones that I really could not live without. My heart isn’t dead set on these specific versions either, since my primary needs are what these utilities offer.

Keeping my menu bar organized and windows managed, the ability to search for anything on my Mac with a simple keyboard command, a password manager, and reliable offsite backup software are the things that make my Mac the best that it is.

There are other utilities out there that can do similar things, but these are the ones that have worked for me consistently for years. The ultimate goal of having utilities on your Mac is to make it better — for your computer to work for you and not the other way around.

So if you have had desires to make your Mac more useful, I would check out some of the utilities I listed above or any other tools that may fit your needs more. The most crucial part is that it does not get in your way but just works for you.

Multivariate Outlier Detection in Python

Multivariate Outlier Detection in Python

Multivariate Outliers and Mahalanobis Distance in Python

Mahalonobis Distance

Figure 1 — Euclidean distance vs Mahalonobis distance (Image by author)

Figure 2— Euclidean distance vs Mahalonobis distance (Image by author)

Formula 1 — Mahalanobis distance between two points

Mahalanobis Distance with Python

Figure 3 — Outliers in Temp — Ozone variables (Image by author)

What is Next?

Interoperable Python and SQL in Jupyter Notebooks

Interoperable Python and SQL in Jupyter Notebooks

Using SQL on top of Pandas, Spark, and Dask

First look at FugueSQL in Jupyter

Motivation

Enhancements Over ANSI SQL

Variable assignment with DataFrames

Comparison to ipython-sql

Fugue Logo

Distributed Compute with Spark and Dask

Simple median function with Pandas

Prepartition and Spark to get the medians

Conclusion and More Examples

Setup in Notebooks

pip install fugue
from fugue_notebook import setup
setup()

Contact Us