Speaker thumbnail selection and speaker visualization in diarized transcripts for text-based video

    公开(公告)号:US12300272B2

    公开(公告)日:2025-05-13

    申请号:US17967697

    申请日:2022-10-17

    Applicant: Adobe Inc.

    Abstract: Embodiments of the present invention provide systems, methods, and computer storage media for selection of the best image of a particular speaker's face in a video, and visualization in a diarized transcript. In an example embodiment, candidate images of a face of a detected speaker are extracted from frames of a video identified by a detected face track for the face, and a representative image of the detected speaker's face is selected from the candidate images based on image quality, facial emotion (e.g., using an emotion classifier that generates a happiness score), a size factor (e.g., favoring larger images), and/or penalizing images that appear towards the beginning or end of a face track. As such, each segment of the transcript is presented with the representative image of the speaker who spoke that segment and/or input is accepted changing the representative image associated with each speaker.

    Face-aware speaker diarization for transcripts and text-based video editing

    公开(公告)号:US12125501B2

    公开(公告)日:2024-10-22

    申请号:US17967399

    申请日:2022-10-17

    Applicant: Adobe Inc.

    CPC classification number: G11B27/031 G06V20/41

    Abstract: Embodiments of the present invention provide systems, methods, and computer storage media for face-aware speaker diarization. In an example embodiment, an audio-only speaker diarization technique is applied to generate an audio-only speaker diarization of a video, an audio-visual speaker diarization technique is applied to generate a face-aware speaker diarization of the video, and the audio-only speaker diarization is refined using the face-aware speaker diarization to generate a hybrid speaker diarization that links detected faces to detected voices. In some embodiments, to accommodate videos with small faces that appear pixelated, a cropped image of any given face is extracted from each frame of the video, and the size of the cropped image is used to select a corresponding active speaker detection model to predict an active speaker score for the face in the cropped image.

Patent Agency Ranking