Skip to content

Conversation

Ashp116
Copy link
Contributor

@Ashp116 Ashp116 commented Jul 30, 2025

Description

This PR introduces a new Video API that streamlines video processing and rendering workflows. It addresses both issues #1923 and #1929 by enabling more flexible backend support and improved audio-video synchronization.

With this update, the video processing function now supports multiple backends, including PyAV and OpenCV. Notably, PyAV is the only backend currently supporting audio rendering, which significantly improves output quality.

This PR requires the optional dependency pyAV for the video rendering backend.

Tags:
Fixes #1923
Fixes #1929

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How has this change been tested, please provide a testcase or example of how you tested the change?

Please refer to #1923 and #1929

Any specific deployment considerations

Ensure that pyAV is installed in the environment to test pyAV backend.

Docs

  • Docs updated? What were the changes

@Ashp116 Ashp116 requested a review from SkalskiP as a code owner July 30, 2025 19:29
@Ashp116 Ashp116 changed the title ADD: Added audio stream for process_video BUG: Added audio stream for process_video Jul 30, 2025
@SkalskiP
Copy link
Collaborator

Hi @Ashp116 👋🏻 Another great idea! Video processing is probably the oldest part of supervision, written over two years ago, and I’ve been wanting to update its API for a while. Would you be open to not only adding audio support but also helping me with the update?

@Ashp116
Copy link
Contributor Author

Ashp116 commented Jul 31, 2025

Hi @SkalskiP, yea, I would like to help update the API. I was thinking of changing how videos are written in process_video. The original compression is lost when annotations are added and the file is written to a target_path. But yea, I would like to help out with the update.

@SkalskiP
Copy link
Collaborator

SkalskiP commented Aug 1, 2025

Hi @Ashp116 I'm really glad you want to help me! Let's goooo! 🔥 🔥 🔥

I want the functionalities currently found in supervision.utils.video to be reorganized around a new Video class. Importantly, all features previously available in the old API must still be supported in the new one. Ideally, the new API should be more consistent and expressive.

  • get video info (works for files, RTSP, webcams)

    import supervision as sv
     
    # static video
    sv.Video("source.mp4").info
    
    # video stream
    sv.Video("rtsp://...").info
    
    # webcam
    sv.Video(0).info
  • simple frame iteration (object is iterable)

    import supervision as sv
    
    video = sv.Video("source.mp4")
    for frame in video:
        ...
  • advanced frame iteration (stride, sub-clip, on-the-fly resize)

    import supervision as sv
    
    for frame in sv.Video("source.mp4").frames(stride=5, start=100, end=500, resolution_wh=(1280, 720)):
        ...
  • process the video

    import cv2
    import supervision as sv
    
    def blur(frame, i):
        return cv2.GaussianBlur(frame, (11, 11), 0)
    
    sv.Video("source.mp4").save(
        "blurred.mp4",
        callback=blur,
        show_progress=True
    )
  • overwrite target video parameters

    import supervision as sv
    
    sv.Video("source.mp4").save(
        "timelapse.mp4",
        fps=60,
        callback=lambda f, i: f,
        show_progress=True
    )
  • complete manual control with explicit VideoInfo

    from supervision import Video, VideoInfo
    
    source = Video("source.mp4")
    target_info = VideoInfo(width=800, height=800, fps=24)
    
    with src.sink("square.mp4", info=target_info) as sink:
        for f in src.frames():
            f = cv2.resize(f, target_info.resolution_wh)
            sink.write(f)
  • multi-backend support decode/encode

    import supervision as sv
    
    video = sv.Video("source.mkv", backend="pyav")
    
    video = sv.Video("source.mkv", backend="opencv")

    suggested minimal protocol

    class Backend(Protocol):
        def open(self, path: str) -> Any: ...
        def info(self, handle: Any) -> VideoInfo: ...
    
        def read(self, handle: Any) -> tuple[bool, np.ndarray]: ...
        def grab(self, handle: Any) -> bool: ...
        def seek(self, handle: Any, frame_idx: int) -> None: ...
    
        def writer(self, path: str, info: VideoInfo, codec: str) -> Writer: ...
    
    class Writer(Protocol):
        def write(self, frame: np.ndarray) -> None: ...
        def close(self) -> None: ...

@Ashp116
Copy link
Contributor Author

Ashp116 commented Aug 2, 2025

Hi @SkalskiP,

I’ve addressed most of the features you mentioned, but I have some thoughts on a few aspects of the implementation:

  • .save Functionality
    How would you handle .save for a video feed coming from a webcam or an RTSP stream? Currently, I have it where only video files can be saved.

  • Writer and Backend Classes
    This is just my personal opinion, but should these classes be moved to separate scripts/modules? If we add more writers and backends in the future, keeping everything inside the main video script might become cluttered.

  • “Complete manual control with explicit VideoInfo” Functionality

    from supervision import Video, VideoInfo
    
    source = Video("source.mp4")
    target_info = VideoInfo(width=800, height=800, fps=24)
    
    with src.sink("square.mp4", info=target_info) as sink:
        for f in src.frames():
            f = cv2.resize(f, target_info.resolution_wh)
            sink.write(f)

    I’m not fully clear on what this feature is intended to do. In this snippet, the Video instance source is created but never used afterward. Is src supposed to be source? Also, is the goal to create sinks for each backend? Could you please clarify the purpose and expected usage here?

@Ashp116 Ashp116 changed the title BUG: Added audio stream for process_video FEATURE: Versatile Video class Aug 2, 2025
@Ashp116
Copy link
Contributor Author

Ashp116 commented Aug 12, 2025

Hi @SkalskiP,

Thank you for reviewing the PR. I have addressed all the comments from the review. Could you please take a look at the following points?

  • render_audio parameter:
    You mentioned the need for this parameter. I agree it’s necessary. Here is my response:

I think we do. pyAV’s default compression codec is h264, which produces much better quality than OpenCV’s mp4v. If a user wants to render only the video frames without audio using pyAV, this parameter allows that. I also suggest setting render_audio to None with the default as True.
Overall, ffmpeg’s default compression leads to better video outputs.

  • .show() feature:
    I think this is a useful addition, and I have implemented it. However, there are some issues: currently, cv2.imshow is used to render the frame with a wait time of 1 ms. This causes the display to not match the correct FPS for a given video source. There are ways to address this, but they would require adding a dependency. Could you share your thoughts on this implementation approach?
    EDIT: I added support for headless and notebook support. Solid points that were mentioned here

  • pyAV webcam bug:
    In my previous review, I completely missed webcam support for the pyAV backend. I’ve added this in the current PR. Using a webcam with pyAV requires a different code path. Could you help test this feature on other devices? I’ve verified it works on a Windows machine.

Please let me know if you have any feedback or suggestions. Thanks

@Ashp116 Ashp116 requested a review from SkalskiP August 12, 2025 02:45
@Ashp116
Copy link
Contributor Author

Ashp116 commented Sep 1, 2025

Hi @SkalskiP,

It’s been a while! I’ve added better audio support. Previously, I manually manipulated audio packets, along with DTS and PTS values, to synchronize them with the video. Now, I’m using the atempo filter on audio streams, which matches the video much more cleanly.

I’ve included my Colab notebook showcasing the new .show() function. Next, I’ll be working on the documentation and unit tests for audio.

I’d love to hear your thoughts and get your feedback on my current implementation.

Thank you!

Copy link

@ryashry ryashry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve Hight quality video web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reimplement video utils BUG: Audio stream not captured in process_video
3 participants