Safe std::vector iterators
A while ago, I had the pleasure of participating in the #cppgameshow, where Peter Bindels won very convincingly in my opinion. You can see the recording here: https://lnkd.in/d4xySUHA.
The challenge that Amir Kirsh presented us was about invalidation of std::vector iterators.
Here is a simple code sample to better understand the issue:
std::vector<int> vec;
vec.push_back(0);
vec.push_back(1);
auto iter1 = vec.begin();
vec.push_back(2);
auto iter2 = vec.begin();
vec.push_back(3);
And the question is which of the two iterators `iter1` and `iter2` is valid? And the answer is that we can’t know based on the standard alone - let us dive a little deeper into that.
According to the standard, calling `push_back` invalidates the past-the-end iterator (which is irrelevant in this case) - but may also cause reserving additional capacity to the vector in which case all the iterators are invalidated. But since the standard doesn’t specify what the allocation strategy should be, it means that we can’t tell which of the iterators, if any, is still valid. Here are three possibilities:
First, if we use the “usual” strategy of doubling the vector capacity each time. Then `push_back(0)` will increase the vector capacity to1, `push_back(1)` will increase it to 2, `push_back(2)` will increase it to 4, but `push_back(3)` will have enough capacity remaining and won’t cause reallocation. So `iter1` is invalidated by `push_back(2)` but `iter2` remains valid after `push_back(3)`.
Second, if we use the minimalist strategy of only increasing the vector capacity by 1 each time (in order to make sure we are not wasting any memory), then each `push_back` has to cause reallocation - and both iterators are invalidated.
Third, if we use a strategy that reserves a small number, say eight, on the first try (in order to avoid many reallocations and copy\move operations that happen as small vectors grow). Then only `push_back(0)` will reserve memory and both iterators are valid.
So looking at the original code sample, we have no way of knowing which iterators are valid without looking at the internal implementation of std::vector - worst, we have no way of asking that in the code itself. But before going on to discuss how we can solve this problem, we should first ask how big of a problem is this anyway? After all, std::vector has been in use for a while now and people seem to manage. I like to think about it as a limitation we’ve learned to work around, and it is based on the observation that we usually don’t mix modifying the std::vector with iterating over it - and the invalidations are a big part of it. That is also why algorithms like `std::erase_if` are so important - they may appear to be simple, but it is all too easy to miss some delicate point. So as long as we keep the iterations, which in this context mean the iterator lifetime, and the modifications separate - this isn’t an issue.But it is a limitation that we may want to lift.
My original idea was to add a layer of abstraction that will keep the iterator valid even if a reallocation has occurred. After all, if we think of an iterator as indicating a position within a vector then reallocation, or changing the content stored within that location? shouldn’t invalidate the iterator. If you want to see a slightly fixed-up version of what I came up with during the game show (I did fix a bug that I had), you can see it here: https://godbolt.org/z/56z13azoc.
The question is, should we view iterators this way? And while my personal answer is yes, an iterator should represent a location and not an element, the standard clearly thinks otherwise - in the standard an iterator is a thinly disguised pointer, and pointers by their very nature are connected to specific memory addresses. Can I give a justification for my position? Well, I think that the fact that the past-the-end iterator is valid but non-dereferenceable would be my strongest argument - but ultimately it all boils down to the mental model that you have in your head as to what iterators are.
And so I came up with a sort-of solution to the problem which is based on changing the semantic of what iterators are. Which is basically dodging the real issue. I was asked to give one thing - an iterator to std::vector that is safe - and came up with something that behaves differently and called it an iterator. Could be a good thing to have, but not exactly what I was asked for.
So now it is time to try and come up with a solution that will actually preserve the semantics. Interestingly enough, I came up with two - so we have something to compare and contrast.
领英推荐
A few notes before we continue. I’ll provide code samples to both options, but they share certain intended limitations. The implementations are based on wrapping std::vector, the reason for that is that I believe this will allow highlighting the additions required to make such solutions work and their implications and avoid being bogged-down with a lot of otherwise necessary implementation details. Also, I’ve only implemented a small part of the interface of std::vector that I thought will suffice to demonstrate that solutions work - but that there’s a lot more to be done before these could be considered workable.
So the first solution is to follow Peter’s example and have a version that indicates when was the last time the element the iterator refers to was invalidated. However, since reallocation is not the only case in which iterators can be invalidated - in my example I use `erase` that invalidates all iterators from a certain location - we need to keep a separate version for each element. That means that in addition to the vector that holds the actual data, we need another vector to hold the versions of each element in the original vector (plus another one for the past-the-end iterator). In my example I hold the versions vector in a shared pointer - which allows my to check if a reallocation has occurred, or if the entire safe-vector has reached end-of-life.
You can see the code here: https://godbolt.org/z/os3cTjTnx.
The good news is that we are guaranteed that the run-time complexity is the same as in regular std::vector. The not-so-good news is that we are now requiring a significant amount of additional memory to manage the version information and that operations that change the vector, while still being of the same order of complexity, will still be more costly by some constant.
The second solution is based on the previous observation that for the most part, iterators’ lifetime doesn’t overlap changes in the vector itself. So we can hold a reference to all the iterators who are alive and invalidate them if we need to. The bug advantage here is that since for the most part we assume that there aren’t any such iterators the additional cost in terms of additional memory or complexity is minimal (a small constant in the “normal” case).
The not so good news is that maintaining that set of iterators is more difficult, especially as iterators are created, copied, moved, and reach end-of-life - all of which needs to be tracked. And that if the assumption the lifetime of iterators is limited and doesn’t overlap vector changes, in other words if we are not in what I called the normal situation, then the overhead in doing operations on the vector will grow as the number of iterators grows - so complexity is less certain.
Again, you can see the code here: https://godbolt.org/z/5183Gx9W8.
And so, we have two possible types of solutions. The first involves an investment in memory? proportionate to the capacity of the vector, and increasing complexity by a known factor. The second involves minimal increase in both memory and complexity in what I called the “normal” case - but at the risk of unbounded increase as the number of valid iterators increase.
So when working with vectors it seems that we are faced with the following options:
Personally, after spending some time going over the semantic of std::vector iterators I think I would like to change them regardless of the safety issue - but that is a topic for another post.
As usual, if you have any questions or comments - please let me know.
Principal SW Engineer @ e.solutions GmbH | Software Architecture, Infotainment
2 年One possible solution would be to utilize on polymorphic allocators, and write your own local allocator that will publish any time when memory ?allocation“ happens, to matching iterator as subscriber. It has nothing to do with particular container, but rather with memory managment https://godbolt.org/z/G41d8jvT1
Senior Software Engineer @ CyberArk | Former Team Leader @ CheckPoint
2 年Great read! Thanks! I wonder, why do you find changing the "semantics" to be problematic? At the end of the day, the iterator will do what it is supposed to do