-
-
Notifications
You must be signed in to change notification settings - Fork 32.9k
gh-139122: Reimplement base UUID type, uuid4(), and uuid7() in C #139123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… in C The C implementation considerably boosts the performance of the key UUID operations: ------------------------------------ Operation Speedup ------------------------------------ uuid4() generation 15.01x uuid7() generation 29.64x UUID from string 6.76x UUID from bytes 5.16x str(uuid) conversion 6.66x ------------------------------------ Summary of changes: * The UUID type is reimplemented in C in its entirety. * The pure-Python is kept around and is used of the C implementation isn't available for some reason. * Both implementations are tested extensively; additional tests are added to ensure that the C implementation of the type follows the pure Python implementation fully. * The Python implementation stores UUID values as int objects. The C implementation stores them as `uint8_t[16]` array. * The C implementation supports unpickling of UUIDs created with Python 2 using protocols starting with 0. That necessitated a small fix to the `copyreg` module (the change is only affecting legacy pickle pathway). * The C implementation has faster hash() implementation but also caches the computed hash value to speedup cases when UUIDs are used as set/dict keys. * The C implementation has a freelist to make new UUID object instantiation as fast as possible. * uuid4() and uuid7() are now implmented in C. The most performance boost (10x) comes from overfetching entropy to decrease the number of _PyOS_URandom() calls. On its own it's a safe optimization with the edge case that Unix fork needs to be explicitly handled. We do that by comparing the current PID to the PID of when the random buffer was populated. * Portions of code are coming from my implementation of faster UUID in gel-python [1]. I did use AI during the development, but basically had to rewrite the code it generated to be more idiomatic and efficient. * The benchmark can be found here [2]. * This PR makes Python UUID operations as fast as they are in NodeJS and Bun runtimes. [1] https://github.com/MagicStack/py-pgproto/blob/b8109fb311a59f30f9947567a13508da9a776564/uuid.pyx [2] https://gist.github.com/1st1/f03e816f34a61e4d46c78ff98baf4818
2d0a381
to
c4363f8
Compare
fa12192
to
4a5c2a2
Compare
Note to reviewers:
|
|
||
*counter = (((uint64_t)(high & 0x1FF) << 32) | (low >> 32)) & 0x1FFFFFFFFFF; | ||
*tail = (uint32_t)low; | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested thoroughly (mostly looking at GodBolt output), but this ought to generate more efficient code (and there might be more ways to generate the right values directly, I just didn't go that far):
struct {
uint16_t high;
uint32_t low1;
uint32_t low2;
} rand_bytes;
if (gen_random(state, (uint8_t*)&rand_bytes, sizeof(rand_bytes)) < 0) {
return -1;
}
*counter = (((uint64_t)(rand_bytes.high & 0x1FF) << 32) | (rand_bytes.low1)) & 0x1FFFFFFFFFF;
*tail = rand_bytes.low2;
uint8_t bytes[16]; | ||
uint64_t timestamp_ms, counter; | ||
uint32_t tail; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ought to be able to do a similar struct/union trick here for faster conversion.
goto fail; | ||
} | ||
|
||
uuid_mod = PyImport_ImportModule("uuid"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we possibly simplify this whole concept (caching the values here) by pushing the queries down to a Python subclass of _uuid.UUID
? So that the core create/parse/etc. functionality is entirely native, and the more esoteric queries are written in Python as class uuid.UUID(_uuid.UUID)
? (Answer can be "no" if it's a bad idea for reasons I haven't considered)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Steve, I really tried hard to simplify it.
- [1] is my attempt, where the C implementation defines the fundamental ops and the rest is kept in Python.
- despite me pushing it really hard and even implementing freelist for Python-land classes (probably was first such perversive attempt in the stdlib), the performance of UUID instantiation takes a big hit. Benchmarks became 10-20% slower just because the UUID type is inherited from a Python base.
- Overall savings were about 300 lines of C. Which is considerable, but IMO making UUID as fast as we can is more important.
So let's keep this PR approach as is.
That said, good news! I figured out how to fully support pickle without my prior shenanigans with copyreg, so now the C implementation is a full drop-in replacement. 🚀
I would be happy to have a C implementation of UUID but for reviewing purposes, may I suggest that we first focus on implementing the wrapper in C in one PR and have different PRs for each UUID? it would be much easier to focus on the algorithm to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the first round comments I have. I realy want multiple PRs because the module becomes really huge and there are parts that I don't think we need to reimplement in C.
@@ -219,13 +242,21 @@ def __init__(self, hex=None, bytes=None, bytes_le=None, fields=None, | |||
raise ValueError('badly formed hexadecimal UUID string') | |||
int = int_(hex, 16) | |||
elif bytes_le is not None: | |||
if not isinstance(bytes_le, bytes_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make a behavioural change so let's not change this here (previously there would be an assertion error). Let's do it in a separate PR (and remove the assert at the same time)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't view this as a behavioral change. It was an error condition before, it's now handled properly. I wouldn't touch this code if I didn't have to re-implement it in C; re-implementing "assert" statement behavior is just counterproductive. I'd do that, if it was really a behavioral change, but I strongly believe it is not.
Sorry, I'm not going to do multiple PRs, I've no time for that. But I will address your comments here for sure! |
…7() in C The C implementation considerably boosts the performance of the key UUID operations: ------------------------------------ Operation Speedup ------------------------------------ uuid4() generation 15.01x uuid7() generation 29.64x UUID from string 6.76x UUID from bytes 5.16x str(uuid) conversion 6.66x ------------------------------------ Summary of changes: * The UUID type is reimplemented in C in its entirety. * The pure-Python is kept around and is used of the C implementation isn't available for some reason. * Both implementations are tested extensively; additional tests are added to ensure that the C implementation of the type follows the pure Python implementation fully. * The Python implementation stores UUID values as int objects. The C implementation stores them as `uint8_t[16]` array. * The C implementation has faster hash() implementation but also caches the computed hash value to speedup cases when UUIDs are used as set/dict keys. * The C implementation has a freelist to make new UUID object instantiation as fast as possible. * uuid4() and uuid7() are now implmented in C. The most performance boost (10x) comes from overfetching entropy to decrease the number of _PyOS_URandom() calls. On its own it's a safe optimization with the edge case that Unix fork needs to be explicitly handled. We do that by comparing the current PID to the PID of when the random buffer was populated. * Portions of code are coming from my implementation of faster UUID in gel-python [1]. I did use AI during the development, but basically had to rewrite the code it generated to be more idiomatic and efficient. * The benchmark can be found here [2]. * This PR makes Python UUID operations as fast as they are in NodeJS and Bun runtimes. [1] https://github.com/MagicStack/py-pgproto/blob/b8109fb311a59f30f9947567a13508da9a776564/uuid.pyx [2] https://gist.github.com/1st1/f03e816f34a61e4d46c78ff98baf4818
I would REALLY appreciate smaller PRs. Usually this is our (current) workflow, namely incrementing changes. I think other core devs would agree with me here (@vstinner, @serhiy-storchaka and @encukou). Note that whatever we do, this is a feature that will only be included in 3.15 so I can also make the small PRs once I am back at home. Changing 3.14 is no more possible unless the RM gives their approval here (cc @hugovk, who could also whether he prefers one or multiple PR) as performance improvements are usually considered as new features (especially if we are adding a C implementation). |
The current PR layout:
I'm usually also for making smaller PRs, but objectively there's no point for that right here; I don't see how reviewing this would be fundamentally any different if instead of one PR with 1700 lines in uuidmodule.c, you'd have 3 PRs, one with 1600 lines and another with like 100 lines, and yet another with 30. Reviewing time would be the same, no? What's the point? If you absolutely insist I can do this, but tbqh I really don't understand why you're pushing for this so hard. It would create a lot of work for me and not necessarily make reviewers more happy.
I don't think this is 3.14 material and I wasn't pushing for that. 3.15 is fine (unless other people but me want this to be in 3.14, in which case I won't say no). That said I'd like to see my work through and would love to still merge this myself. |
uint64_t random_idx; | ||
uint64_t random_last_pid; | ||
|
||
// A freelist for uuid objects -- 15-20% performance boost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a generic freelist implementation for python objects (see https://github.com/python/cpython/blob/7257b24140ac1b39fb8cfd4610134ec79575a396/Include/internal/pycore_freelist_state.h). Unless there are strong reasons not to, the uuid object should use the generic implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly I can't use that, as uuidobject.c is compiled as a shared lib.
The C implementation considerably boosts the performance of the key UUID operations:
Summary of changes:
The UUID type is reimplemented in C in its entirety.
The pure-Python is kept around and is used of the C implementation isn't available for some reason.
Both implementations are tested extensively; additional tests are added to ensure that the C implementation of the type follows the pure Python implementation fully.
The Python implementation stores UUID values as int objects. The C implementation stores them as
uint8_t[16]
array.The C implementation supports unpickling of UUIDs created with Python 2 using protocols starting with 0.
The C implementation has faster hash() implementation but also caches the computed hash value to speedup cases when UUIDs are used as set/dict keys.
The C implementation has a freelist to make new UUID object instantiation as fast as possible.
uuid4() and uuid7() are now implmented in C. The most performance boost (10x) comes from overfetching entropy to decrease the number of _PyOS_URandom() calls. On its own it's a safe optimization with the edge case that Unix fork needs to be explicitly handled. We do that by comparing the current PID to the PID of when the random buffer was populated.
Portions of code are coming from my implementation of faster UUID in gel-python [1]. I did use AI during the development, but basically had to rewrite the code it generated to be more idiomatic and efficient.
The benchmark can be found here [2].
This PR makes Python UUID operations as fast as they are in NodeJS and Bun runtimes.
[1] https://github.com/MagicStack/py-pgproto/blob/b8109fb311a59f30f9947567a13508da9a776564/uuid.pyx
[2] https://gist.github.com/1st1/f03e816f34a61e4d46c78ff98baf4818