Implementing Rayon’s Parallel Iterators - A Tutorial

Posted 2022-10-30

Last Updated 2022-11-01

The rayon library is a brilliant piece of engineering that is so simple to use that one seldom has to descend into its depths. This is the story of how I learned to implement rayon’s ParallelIterator trait for my own type. There are tons of guides on how to use rayon’s parallel iterators and there are a few explanations on how they work under the hood. However, I found no tutorials on how to implement a parallel iterator from the ground up. This article aims to close that gap.

There is a fair bit of complexity around rayon’s parallel iterators and this tutorial cannot explain every nook and cranny. What I’d rather do is give a guide for a not-too-trivial example. It might or might not be enough for your use case, but you’ll have an understanding of the map of the territory either way.

Existing Literature

First, here is a collection of prior art on the subject of implementing parallel iterators. I’ve ordered this in ascending order of usefulness (as perceived by me). I recommend to read this guide first and then go back to the literature referenced in this section. Eventually, reading the source will prove invaluable, though it would not be my first port of call.

Inside the rayon repository, there is a plumbing/README.md. It was too terse as an introduction for me, but it does come in handy as a refresher or if you have prior knowledge. What I found very helpful in understanding how rayon thinks about parallel iterators is the three part blog series (Part 1 - Foundations, Part 2 - Producers, Part 3 - Consumers) by Niko Matsakis, rayon’s creator. It’s a brilliant introduction to this subject and I hope this guide will complement it nicely. We’re going to see the principles applied to an example.

Finally, it’s worth noting that often you don’t really have to implement your own parallel iterator from the ground up because you can use what’s already there in rayon. Here and here are examples of how the par_bridge and par_chunks functionality can be used as quick alternatives to implementing custom iterators. Here is an example of how to make use of rayon’s existing iterators to implement your own iterator with less overhead. But what if it turns out you do have to (or want to) implement a parallel iterator from the ground up? That is where this guide comes in.

Groundwork

First let’s get to know our example and draw a very rough map of the rayon territory.

Our Example

We’ll implement parallel iterators for a collection of some data where sequential iterators are already present. This is a common use case. Our example will be deliberately simple, which is why I use vectors and slices as the underlying ways of storing and accessing our data. Those already give us sequential iterators¹. Note that rayon already has parallel iterators for Vecs and slices, but we will not use them. So we learn how to implement parallel iterators from first principles. For convenience, we’ll use i32s as a stand-in for the data² inside our collection.

type Data = i32;

struct DataCollection {
  data : Vec<Data>,
}

We will make heavy use of the fact that we can split a vector into slices and that there are sequential iterators over slices. Again, we will not exploit rayon’s parallel iterators over slices or Vecs.

Rayon Tour de Force

I am interested in writing an iterator that implements both rayon’s ParallelIterator as well as IndexedParallelIterator, which makes this (in rayon’s terms) a “random access” iterator with an exactly known length. Some of what I am going to say will be true for other types of parallel iterators but some things won’t be, so keep that in mind.

We will start out by writing a structure for the parallel iterator over our data and we’ll see that we can implement all but one required method of ParallelIterator and IndexedParallelIterator pretty easily. For the final piece of the puzzle, we have to understand rayon’s concept of a Producer. It helps to think of rayon as a divide and conquer multithreading library. It wants to split the whole iteration into smaller and smaller chunks, distribute them across threads, then fall back to regular sequential iterators to perform the actual work within the threads. Producers are the glue that allows rayon to understand how to split your iteration into smaller chunks and how to iterate over those chunks sequentially. If that all seems a bit much now, bear with me.

The Implementation

Here we’ll see how to implement parallel iterators that iterate over (mutably or immutably) borrowed data.

(Indexed) Parallel Iterators

Since we are borrowing the data, the easiest way is to provide the iterator with a reference to a slice of the data. Let’s start with iterators over immutable references fist.

struct ParDataIter<'a> {
  data_slice : &'a [Data]
}

I already mentioned that I want to write a parallel iterator that has an exactly known size. So the two traits we have to implement are ParallelIterator and IndexedParallelIterator on our ParDataIter. Let’s start out by implementing all the required methods by just putting a todo!() into each body to appease the compiler. This looks something like this:

impl<'a> ParallelIterator for ParDataIter<'a> {
    type Item = &'a Data;

    fn drive_unindexed<C>(self, consumer: C) -> C::Result
    where C: UnindexedConsumer<Self::Item> {
        todo!()
    }
}

The ParallelIterator trait only has one required method, which seems not that bad, right? The associated type Item is clear, because we want to iterate over references to the data, so we make it &'a Data right away. Now let’s look at the second iterator trait before we go any further:

impl<'a> IndexedParallelIterator for ParDataIter<'a> {
    fn with_producer<CB: ProducerCallback<Self::Item>>(
        self,
        callback: CB,
    ) -> CB::Output {
        todo!()
    }

    fn drive<C: Consumer<Self::Item>>(self, consumer: C) -> C::Result {
        todo!()
    }

    fn len(&self) -> usize {
        todo!()
    }
}

This one has three methods we need to implement. The simplest one is len, which must return the number of elements that this parallel iterator produces. This is just self.data_slice.len() and we’re done. The next two methods we implement are drive_unindexed and drive of ParallelIterator and IndexedParallelIterator, respectively. The three part series by Niko Matsakis linked above gives a great explanation of what the logic behind these methods is. Here, we’ll take a pragmatic approach and look at how rayon goes about implementing these methods. In the parallel iterator Implementation for slices in lines 708 and 721, we can see that both methods are implemented by a simple call to bridge(self, consumer). Interesting! If we look at the documentation of rayon::iter::plumbing::bridge we find:

This helper function is used to “connect” a parallel iterator to a consumer. […] This is useful when you are implementing your own parallel iterators: it is often used as the definition of the drive_unindexed or drive methods.

The last sentence tells us that this is exactly what we need. It is worth noting that bridge requires the first argument (i.e. self) to implement IndexedParallelIterator, which is no problem for us, because that is what we are doing anyways. That lets us fill all but one method with the correct logic. Before we see how to implement with_producer, let’s throw in a low-hanging optimization.

ParallelIterator has a method opt_len(&self)->Option<usize> that returns the length of this iterator if it is known. We can just return Some(self.len()), which calls the len method of self as an implementor of IndexedParallelIterator. In summary, this leaves us with this code:

impl<'a> ParallelIterator for ParDataIter<'a> {
    type Item = &'a Data;

    fn drive_unindexed<C>(self, consumer: C) -> C::Result
    where
        C: UnindexedConsumer<Self::Item> {
        bridge(self,consumer)
    }

    fn opt_len(&self) -> Option<usize> {
      Some(self.len())
    }
}

impl<'a> IndexedParallelIterator for ParDataIter<'a> {
    fn with_producer<CB: ProducerCallback<Self::Item>>(
        self,
        callback: CB,
    ) -> CB::Output {
        todo!()
    }

    fn drive<C: Consumer<Self::Item>>(self, consumer: C) -> C::Result {
        bridge(self,consumer)
    }

    fn len(&self) -> usize {
        self.data_slice.len()
    }
}

So the only thing left to implement is with_producer and we’re done.

Producers

It’s pretty important to understand rayon’s concept of producers when we want to implement parallel iterators. They work in hand in hand with a concept called consumers. Put simply, producers produce elements and consumers consume them. I know: thanks, Captain Obvious! But bear with me. Functions like fold on parallel iterators work with a FoldConsumer under the hood that consumes the produced elements. That’s about all we need to know about consumers here, but we need to dive into producers a little bit more. Producers are described by the rayon documentation like this:

A Producer is effectively a “splittable IntoIterator”. That is, a producer is a value which can be converted into an iterator at any time: at that point, it simply produces items on demand, like any iterator. But what makes a Producer special is that, before we convert to an iterator, we can also split it at a particular point using the split_at method.

So a producer allows us to split the range over which we iterate and it can be made into a sequential iterator at any time. So let’s try and create a producer for the elements of our DataCollection. Let’s create a new structure for the producer ³:

struct DataProducer<'a> {
  data_slice : &'a [Data],
}

To implement the Producer trait for this structure we have to know essentially three things.

What is the sequential iterator into which this producer can be made?
What type of item does said iterator produce?
How can we split the producer into two at a given index?

Let’s start at the top. The sequential iterator should be the iterator over a borrowed slice of Data, i.e. std::slice::Iter<'a,Data>. This implies that the type of item returned from this iterator is &'a Data. It is worth noting that Producer requires the returned iterator to implement both DoubleEndedIterator as well as ExactSizeIterator. This is no problem for us because slice iterators implement both these traits⁴. Finally, we can split our producer by splitting the borrowed slice by using split_at. Our implementation can thus look like this:

impl<'a> Producer for DataProducer<'a> {
    type Item = &'a Data;
    type IntoIter = std::slice::Iter<'a, Data>;

    fn into_iter(self) -> Self::IntoIter {
        self.data_slice.iter()
    }

    fn split_at(self, index: usize) -> (Self, Self) {
        let (left, right) = self.data_slice.split_at(index);
        (
            DataProducer { data_slice: left },
            DataProducer { data_slice: right },
        )
    }
}

And just like that we have our producer. Let’s add some convenience functionality to go from a parallel iterator to a producer. This will come in handy momentarily.

impl<'a> From<ParDataIter<'a>> for DataProducer<'a> {
    fn from(iterator: ParDataIter<'a>) -> Self {
        Self {
            data_slice: iterator.data_slice,
        }
    }
}

Finally, we can revisit our implementation of IndexedParallelIterator for ParDataIter and fill in the missing piece.

impl<'iter> IndexedParallelIterator for ParDataIter<'iter> {
    fn with_producer<CB: ProducerCallback<Self::Item>>(
        self,
        callback: CB,
    ) -> CB::Output {
        let producer = DataProducer::from(self);
        callback.callback(producer)
    }

// --- other methods unchanged ---
// [...]
}

If you are wondering what on earth a producer callback is, I recommend to read the appropriately titled section “What on earth is ProducerCallback” in the rayon README. For us as implementors, we just need to remember to invoke that callback function on a producer that we create from our iterator. We do that by using the slightly awkward (but very cleverly designed) callback.callback(producer) syntax.

Usage and Ergonomics

Now we have sucessfully implemented a parallel iterator. There is one final thing we need to do before we can use it. We have to expose an interface on our data structure to get one. We can for example expose a member function

impl DataCollection {
    pub fn parallel_iterator(&self) -> ParDataIter {
    ParDataIter {
      data_slice : &self.data,
    }
  }
}

Now we can call this function on our collection to obtain a parallel iterator. That is a perfectly valid way to obtain parallel iterators. As a matter of fact, for structures that have more than one way to iterate over their data it is good practice to implement descriptive member functions that return different kinds of parallel iterators. Think e.g. of a matrix that can have element wise, column wise, and row wise parallel iterators.

However, if there is just one reasonable way of iterating over the elements in our data structure, rayon has a nice feature through blanket implementations. There is a trait IntoParallelRefIterator that exposes a function par_iter() that iterates over references of elements. We don’t implement this trait directly, but we get it for free when implementing IntoParallelIterator for &DataCollection. So let’s do that:

impl<'a> IntoParallelIterator for &'a DataCollection {
    type Iter = ParDataIter<'a>;
    type Item = &'a Data;

    fn into_par_iter(self) -> Self::Iter {
        ParDataIter { data_slice: &self.data }
    }
}

Example Code on the Rust Playground

You can find all the code plus more on the playground. The playground also code includes an implementation of iterators over mutable data. Putting everything together, allows us to do something like this:

fn main() {
    let mut data = DataCollection{
      data : vec![1, 2, 3, 4]
    };

    data
    .par_iter_mut()
    .for_each(|x| *x = -*x);

    println!("data = {:?}", data);

    let sum_of_squares: Data = data
      .par_iter()
      .map(|x| x * x)
      .sum();
      
    println!("sum = {}", sum_of_squares);
}

Parallel Iterators for Mutable Data

So far we have only seen an implementation to immutably iterate over our data. The good thing is that adding parallel iterators for mutable data is dead simple, because we can just replace all our &'a with &'a mut for mutable iteration. So what we do is create a second iterator for mutable iteration ParDataIterMut that references a mutable slice. We implement the two iterator traits just as above. That means we’ll have to create an analogous DataProducerMut, plug everything together again and voilà we’re done⁵. The playground link above has the code for mutable iterators as well.

Conclusion

Multithreading is hard and it is a testament to the genius design of the rayon library, that so much of the complexity is abstracted away from us. Ninetynine percent of the time we can just replace iter() for par_iter() and enjoy the magic. We usually don’t have to know how to implement our own parallel iterators, but if you find yourself wanting to, I hope this tutorial has given you an idea of how to go about it. Now is probably a good time to look at all the prior art that I mentioned at the beginning of this article, if you haven’t already.

Endnotes

As a matter of fact, those iterators implement ExactSizeIterator as well as DoubleEndedIterator, which will be important later. ↩
Note that i32 is Send, which is also important later. ↩
The astute reader will have noticed that the producer here looks just like the parallel iterator structure. I have seen this code duplication inside of the rayon codebase as well. There’s no reason why we could not implement the Producer trait on our parallel iterator type. This will not work for every use case, but it certainly would for our particular example. I have also written parallel iterators where I was able to modify the sequential iterators so that they implemented the Producer trait. This, of course, is only possible if you own the codebase that contains the sequential iterators. We’ll stick to the most general case here and won’t bother ourselves with reducing code duplication too much. ↩
If the sequential iterator for your use case does not implement these traits, this gets trickier. You can either try and implement them for the iterator (if you own the codebase) or create a new sequential iterator that implements them, possibly by wrapping the existing one. ↩
While it truly is that simple for our case, it does not have to be in all cases. The borrow checker might complain about certain code containing mutable references that it will accept for immutable references. ↩

rust rayon parallel concurrency multi-threading

Comments

You can comment on this post using your GitHub account.

Join the discussion for this article on this ticket. Comments appear on this page instantly.