Skip to content

Conversation

1st1
Copy link
Member

@1st1 1st1 commented Sep 18, 2025

The C implementation considerably boosts the performance of the key UUID operations:

------------------------------------
Operation                    Speedup
------------------------------------
uuid4() generation            15.01x
uuid7() generation            29.64x
UUID from string               6.76x
UUID from bytes                5.16x
str(uuid) conversion           6.66x
------------------------------------

Summary of changes:

  • The UUID type is reimplemented in C in its entirety.

  • The pure-Python is kept around and is used of the C implementation isn't available for some reason.

  • Both implementations are tested extensively; additional tests are added to ensure that the C implementation of the type follows the pure Python implementation fully.

  • The Python implementation stores UUID values as int objects. The C implementation stores them as uint8_t[16] array.

  • The C implementation supports unpickling of UUIDs created with Python 2 using protocols starting with 0.

  • The C implementation has faster hash() implementation but also caches the computed hash value to speedup cases when UUIDs are used as set/dict keys.

  • The C implementation has a freelist to make new UUID object instantiation as fast as possible.

  • uuid4() and uuid7() are now implmented in C. The most performance boost (10x) comes from overfetching entropy to decrease the number of _PyOS_URandom() calls. On its own it's a safe optimization with the edge case that Unix fork needs to be explicitly handled. We do that by comparing the current PID to the PID of when the random buffer was populated.

  • Portions of code are coming from my implementation of faster UUID in gel-python [1]. I did use AI during the development, but basically had to rewrite the code it generated to be more idiomatic and efficient.

  • The benchmark can be found here [2].

  • This PR makes Python UUID operations as fast as they are in NodeJS and Bun runtimes.

[1] https://github.com/MagicStack/py-pgproto/blob/b8109fb311a59f30f9947567a13508da9a776564/uuid.pyx

[2] https://gist.github.com/1st1/f03e816f34a61e4d46c78ff98baf4818

@python-cla-bot
Copy link

python-cla-bot bot commented Sep 18, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

@1st1 1st1 changed the title The C implementation considerably boosts the performance of the key UUID Reimplement base UUID type, uuid4(), and uuid7() in C Sep 18, 2025
1st1 added a commit to 1st1/cpython that referenced this pull request Sep 18, 2025
… in C

The C implementation considerably boosts the performance of the key UUID
operations:

------------------------------------
Operation                    Speedup
------------------------------------
uuid4() generation            15.01x
uuid7() generation            29.64x
UUID from string               6.76x
UUID from bytes                5.16x
str(uuid) conversion           6.66x
------------------------------------

Summary of changes:

* The UUID type is reimplemented in C in its entirety.

* The pure-Python is kept around and is used of the C implementation
  isn't available for some reason.

* Both implementations are tested extensively; additional tests are
  added to ensure that the C implementation of the type follows the pure
  Python implementation fully.

* The Python implementation stores UUID values as int objects. The C
  implementation stores them as `uint8_t[16]` array.

* The C implementation supports unpickling of UUIDs created with Python
  2 using protocols starting with 0. That necessitated a small fix to
  the `copyreg` module (the change is only affecting legacy pickle
  pathway).

* The C implementation has faster hash() implementation but also caches
  the computed hash value to speedup cases when UUIDs are used as
  set/dict keys.

* The C implementation has a freelist to make new UUID object
  instantiation as fast as possible.

* uuid4() and uuid7() are now implmented in C. The most performance
  boost (10x) comes from overfetching entropy to decrease the number of
  _PyOS_URandom() calls. On its own it's a safe optimization with the
  edge case that Unix fork needs to be explicitly handled. We do that by
  comparing the current PID to the PID of when the random buffer was
  populated.

* Portions of code are coming from my implementation of faster UUID
  in gel-python [1]. I did use AI during the development, but basically
  had to rewrite the code it generated to be more idiomatic and
  efficient.

* The benchmark can be found here [2].

* This PR makes Python UUID operations as fast as they are in NodeJS and
  Bun runtimes.

[1] https://github.com/MagicStack/py-pgproto/blob/b8109fb311a59f30f9947567a13508da9a776564/uuid.pyx

[2] https://gist.github.com/1st1/f03e816f34a61e4d46c78ff98baf4818
@1st1 1st1 force-pushed the uuid_push branch 2 times, most recently from 2d0a381 to c4363f8 Compare September 18, 2025 13:22
@1st1 1st1 changed the title Reimplement base UUID type, uuid4(), and uuid7() in C #139122: Reimplement base UUID type, uuid4(), and uuid7() in C Sep 18, 2025
@1st1 1st1 changed the title #139122: Reimplement base UUID type, uuid4(), and uuid7() in C #gh-139122: Reimplement base UUID type, uuid4(), and uuid7() in C Sep 18, 2025
@1st1 1st1 force-pushed the uuid_push branch 2 times, most recently from fa12192 to 4a5c2a2 Compare September 18, 2025 13:24
@1st1
Copy link
Member Author

1st1 commented Sep 18, 2025

Note to reviewers:

  • Please compare uuid7 C and Python implementations side by side
  • Please review all of the bitshift/magic code closely (Python implementation uses int, i'm using char[16], so low-level operations are very different but have to be equivalent)


*counter = (((uint64_t)(high & 0x1FF) << 32) | (low >> 32)) & 0x1FFFFFFFFFF;
*tail = (uint32_t)low;
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested thoroughly (mostly looking at GodBolt output), but this ought to generate more efficient code (and there might be more ways to generate the right values directly, I just didn't go that far):

    struct {
        uint16_t high;
        uint32_t low1;
        uint32_t low2;
    } rand_bytes;
    if (gen_random(state, (uint8_t*)&rand_bytes, sizeof(rand_bytes)) < 0) {
        return -1;
    }

    *counter = (((uint64_t)(rand_bytes.high & 0x1FF) << 32) | (rand_bytes.low1)) & 0x1FFFFFFFFFF;
    *tail = rand_bytes.low2;

uint8_t bytes[16];
uint64_t timestamp_ms, counter;
uint32_t tail;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ought to be able to do a similar struct/union trick here for faster conversion.

goto fail;
}

uuid_mod = PyImport_ImportModule("uuid");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we possibly simplify this whole concept (caching the values here) by pushing the queries down to a Python subclass of _uuid.UUID? So that the core create/parse/etc. functionality is entirely native, and the more esoteric queries are written in Python as class uuid.UUID(_uuid.UUID)? (Answer can be "no" if it's a bad idea for reasons I haven't considered)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steve, I really tried hard to simplify it.

  • [1] is my attempt, where the C implementation defines the fundamental ops and the rest is kept in Python.
  • despite me pushing it really hard and even implementing freelist for Python-land classes (probably was first such perversive attempt in the stdlib), the performance of UUID instantiation takes a big hit. Benchmarks became 10-20% slower just because the UUID type is inherited from a Python base.
  • Overall savings were about 300 lines of C. Which is considerable, but IMO making UUID as fast as we can is more important.

So let's keep this PR approach as is.

That said, good news! I figured out how to fully support pickle without my prior shenanigans with copyreg, so now the C implementation is a full drop-in replacement. 🚀

[1] https://github.com/1st1/cpython/blob/8a5877165e993afb2633cd48da5222326d3f6e0e/Modules/_uuidmodule.c#L4

@StanFromIreland StanFromIreland changed the title #gh-139122: Reimplement base UUID type, uuid4(), and uuid7() in C gh-139122: Reimplement base UUID type, uuid4(), and uuid7() in C Sep 18, 2025
@picnixz
Copy link
Member

picnixz commented Sep 18, 2025

I would be happy to have a C implementation of UUID but for reviewing purposes, may I suggest that we first focus on implementing the wrapper in C in one PR and have different PRs for each UUID? it would be much easier to focus on the algorithm to review.

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the first round comments I have. I realy want multiple PRs because the module becomes really huge and there are parts that I don't think we need to reimplement in C.

@@ -219,13 +242,21 @@ def __init__(self, hex=None, bytes=None, bytes_le=None, fields=None,
raise ValueError('badly formed hexadecimal UUID string')
int = int_(hex, 16)
elif bytes_le is not None:
if not isinstance(bytes_le, bytes_):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make a behavioural change so let's not change this here (previously there would be an assertion error). Let's do it in a separate PR (and remove the assert at the same time)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't view this as a behavioral change. It was an error condition before, it's now handled properly. I wouldn't touch this code if I didn't have to re-implement it in C; re-implementing "assert" statement behavior is just counterproductive. I'd do that, if it was really a behavioral change, but I strongly believe it is not.

@1st1
Copy link
Member Author

1st1 commented Sep 19, 2025

I realy want multiple PRs because the module becomes really huge and there are parts that I don't think we need to reimplement in C.

Sorry, I'm not going to do multiple PRs, I've no time for that. But I will address your comments here for sure!

…7() in C

The C implementation considerably boosts the performance of the key UUID
operations:

------------------------------------
Operation                    Speedup
------------------------------------
uuid4() generation            15.01x
uuid7() generation            29.64x
UUID from string               6.76x
UUID from bytes                5.16x
str(uuid) conversion           6.66x
------------------------------------

Summary of changes:

* The UUID type is reimplemented in C in its entirety.

* The pure-Python is kept around and is used of the C implementation
  isn't available for some reason.

* Both implementations are tested extensively; additional tests are
  added to ensure that the C implementation of the type follows the pure
  Python implementation fully.

* The Python implementation stores UUID values as int objects. The C
  implementation stores them as `uint8_t[16]` array.

* The C implementation has faster hash() implementation but also caches
  the computed hash value to speedup cases when UUIDs are used as
  set/dict keys.

* The C implementation has a freelist to make new UUID object
  instantiation as fast as possible.

* uuid4() and uuid7() are now implmented in C. The most performance
  boost (10x) comes from overfetching entropy to decrease the number of
  _PyOS_URandom() calls. On its own it's a safe optimization with the
  edge case that Unix fork needs to be explicitly handled. We do that by
  comparing the current PID to the PID of when the random buffer was
  populated.

* Portions of code are coming from my implementation of faster UUID
  in gel-python [1]. I did use AI during the development, but basically
  had to rewrite the code it generated to be more idiomatic and
  efficient.

* The benchmark can be found here [2].

* This PR makes Python UUID operations as fast as they are in NodeJS and
  Bun runtimes.

[1] https://github.com/MagicStack/py-pgproto/blob/b8109fb311a59f30f9947567a13508da9a776564/uuid.pyx

[2] https://gist.github.com/1st1/f03e816f34a61e4d46c78ff98baf4818
@picnixz
Copy link
Member

picnixz commented Sep 19, 2025

I would REALLY appreciate smaller PRs. Usually this is our (current) workflow, namely incrementing changes. I think other core devs would agree with me here (@vstinner, @serhiy-storchaka and @encukou).

Note that whatever we do, this is a feature that will only be included in 3.15 so I can also make the small PRs once I am back at home. Changing 3.14 is no more possible unless the RM gives their approval here (cc @hugovk, who could also whether he prefers one or multiple PR) as performance improvements are usually considered as new features (especially if we are adding a C implementation).

@1st1
Copy link
Member Author

1st1 commented Sep 19, 2025

I would REALLY appreciate smaller PRs. Usually this is our (current) workflow, namely incrementing changes. I think other core devs would agree with me here (@vstinner, @serhiy-storchaka and @encukou).

The current PR layout:

  1. 90% is reimplementing the UUID type
  2. 10% is uuid4 and uuid7 implementations which are isolated separate functions from (1); uuid4 is very tiny.

I'm usually also for making smaller PRs, but objectively there's no point for that right here; I don't see how reviewing this would be fundamentally any different if instead of one PR with 1700 lines in uuidmodule.c, you'd have 3 PRs, one with 1600 lines and another with like 100 lines, and yet another with 30. Reviewing time would be the same, no? What's the point?

If you absolutely insist I can do this, but tbqh I really don't understand why you're pushing for this so hard. It would create a lot of work for me and not necessarily make reviewers more happy.

Note that whatever we do, this is a feature that will only be included in 3.15 so I can also make the small PRs once I am back at home. Changing 3.14 is no more possible unless the RM gives their approval here (cc @hugovk, who could also whether he prefers one or multiple PR) as performance improvements are usually considered as new features (especially if we are adding a C implementation).

I don't think this is 3.14 material and I wasn't pushing for that. 3.15 is fine (unless other people but me want this to be in 3.14, in which case I won't say no).

That said I'd like to see my work through and would love to still merge this myself.

uint64_t random_idx;
uint64_t random_last_pid;

// A freelist for uuid objects -- 15-20% performance boost.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a generic freelist implementation for python objects (see https://github.com/python/cpython/blob/7257b24140ac1b39fb8cfd4610134ec79575a396/Include/internal/pycore_freelist_state.h). Unless there are strong reasons not to, the uuid object should use the generic implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly I can't use that, as uuidobject.c is compiled as a shared lib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants