Skip to content

Commit

Permalink
Merge pull request #21 from LemurPwned/feat/video-summary
Browse files Browse the repository at this point in the history
Feat/video summary
  • Loading branch information
LemurPwned committed Apr 14, 2024
2 parents 1259f4a + bf53240 commit 8f7b54f
Show file tree
Hide file tree
Showing 8 changed files with 300 additions and 9 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@

Changelog for the `video-sampler`.

### 0.11.0

- added summary creation from sampled frames

### 0.9.0

- keyword yt-dlp extraction
Expand Down
27 changes: 26 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
[![License](https://img.shields.io/github/license/LemurPwned/video-sampler)](https://github.com/LemurPwned/video-sampler/blob/main/LICENSE)
[![Downloads](https://img.shields.io/pypi/dm/video-sampler.svg)](https://img.shields.io/pypi/dm/video-sampler.svg)

Video sampler allows you to efficiently sample video frames.
Video sampler allows you to efficiently sample video frames and summarise the videos.
Currently, it uses keyframe decoding, frame interval gating and perceptual hashing to reduce duplicated samples.

**Use case:** for sampling videos for later annotations used in machine learning.
Expand All @@ -28,6 +28,7 @@ Currently, it uses keyframe decoding, frame interval gating and perceptual hashi
- [Basic usage](#basic-usage)
- [YT-DLP integration plugin](#yt-dlp-integration-plugin)
- [Extra YT-DLP options](#extra-yt-dlp-options)
- [OpenAI summary](#openai-summary)
- [API examples](#api-examples)
- [Advanced usage](#advanced-usage)
- [Gating](#gating)
Expand Down Expand Up @@ -62,6 +63,7 @@ Documentation is available at [https://lemurpwned.github.io/video-sampler/](http
- [x] Integrations
- [x] YTDLP integration -- streams directly from [yt-dlp](http://github.com//yt-dlp/yt-dlp) queries,
playlists or single videos
- [x] OpenAI multimodal models integration for video summaries

## Installation and Usage

Expand Down Expand Up @@ -145,6 +147,29 @@ or this will skip all shorts:
... --ytdlp --yt-extra-args '--match-filter "original_url!*=/shorts/ & url!*=/shorts/"
```
#### OpenAI summary
To use the OpenAI multimodal models integration, you need to install `openai` first `pip install openai`.
Then, you simply add `--summary-interval` to the command and the url.
In the example, I'm using [llamafile](https://github.com/Mozilla-Ocho/llamafile) LLAVA model to summarize the video every 50 frames. If you want to use the OpenAI multimodal models, you need to export `OPENAI_API_KEY=your_api_key` first.

To replicate, run LLAVA model locally and set the `summary-url` to the address of the model. Specify the `summary-interval` to the minimal interval in seconds between frames that are to be summarised/described.

```bash
video_sampler hash ./videos/FatCat.mp4 ./output-frames/ --hash-size 3 --buffer-size 20 --summary-url "http://localhost:8080" --summary-interval 50
```

Some of the frames, based on the interval specified, will be summarised by the model and the result will saved in the `./output-frames/summaries.json` folder. The frames that are summarised come after the sampling and gating process happens, and only those frames that pass both stages are viable for summarisation.

```jsonl
summaries.jsonl
---
{"time": 56.087, "summary": "A cat is walking through a field of tall grass, with its head down and ears back. The cat appears to be looking for something in the grass, possibly a mouse or another small creature. The field is covered in snow, adding a wintry atmosphere to the scene."}
{"time": 110.087, "summary": "A dog is walking in the snow, with its head down, possibly sniffing the ground. The dog is the main focus of the image, and it appears to be a small animal. The snowy landscape is visible in the background, creating a serene and cold atmosphere."}
{"time": 171.127, "summary": "The image features a group of animals, including a dog and a cat, standing on a beach near the ocean. The dog is positioned closer to the left side of the image, while the cat is located more towards the center. The scene is set against a beautiful backdrop of a blue sky and a vibrant green ocean. The animals appear to be enjoying their time on the beach, possibly taking a break from their daily activities."}
```

#### API examples

See examples in [./scripts](./scripts/run_benchmarks.py).
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
name = "video_sampler"
description = "Video Sampler -- sample frames from a video file"
url = "https://github.com/LemurPwned/video-sampler"
version = "0.9.0"
version = "0.10.0"
authors = [
{ name = "LemurPwned", email = "[email protected]" }
]
Expand Down
16 changes: 16 additions & 0 deletions video_sampler/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,12 @@ def main(
keywords: str = typer.Option(
None, help="Comma separated positive keywords for text extraction."
),
summary_url: str = typer.Option(
None, help="URL to summarise the video using LLaMA."
),
summary_interval: int = typer.Option(
-1, help="Interval in seconds to summarise the video."
),
) -> None:
"""Default buffer is the perceptual hash buffer"""
extractor_cfg = {}
Expand All @@ -129,6 +135,15 @@ def main(
extractor_cfg = {"type": "keyword", "args": {"keywords": keywords_}}
sampler_cls = SegmentSampler
subs_enable = True
summary_config = {}
if summary_interval > 0:
summary_config = {"url": summary_url, "min_sum_interval": summary_interval}
elif summary_url is not None:
console.print(
"Set summary interval to be greater than 0 to enable summary feature.",
style=f"bold {Color.red.value}",
)
return typer.Exit(code=-1)
cfg = SamplerConfig(
min_frame_interval_sec=min_frame_interval_sec,
keyframes_only=keyframes_only,
Expand All @@ -141,6 +156,7 @@ def main(
"debug": debug,
"hash_size": hash_size,
},
summary_config=summary_config,
gate_config=(
{
"type": "blur",
Expand Down
11 changes: 7 additions & 4 deletions video_sampler/buffer.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@ class SamplerConfig:
the frame gate. Defaults to {"type": "pass"}.
extractor_config (dict[str, Any], optional): Configuration options for
the extractor (keyword, audio). Defaults to None.
summary_config (dict[str, Any], optional): Configuration options for
the summary generator. Defaults to None.
Methods:
__str__() -> str:
Returns a string representation of the configuration.
Expand All @@ -61,6 +62,7 @@ class SamplerConfig:
}
)
extractor_config: dict[str, Any] = field(default_factory=dict)
summary_config: dict[str, Any] = field(default_factory=dict)

def __str__(self) -> str:
return str(asdict(self))
Expand Down Expand Up @@ -218,7 +220,7 @@ def __init__(
self.max_hits = max_hits
self.mosaic_buffer = {}

def __get_grid_hash(self, item: Image.Image) -> str:
def __get_grid_hash(self, item: Image.Image) -> Iterable[str]:
"""Compute grid hashes for a given image"""
for x in range(self.grid_x):
for y in range(self.grid_y):
Expand Down Expand Up @@ -418,8 +420,9 @@ def get_buffer_state(self) -> list[str]:
return self.sliding_top_k_buffer.get_buffer_state()

def add(self, item: Image.Image, metadata: dict[str, Any]):
entropy = item.entropy()
return self.sliding_top_k_buffer.add(item, {**metadata, "index": -entropy})
return self.sliding_top_k_buffer.add(
item, {**metadata, "index": -item.entropy()}
)

def final_flush(self) -> Iterable[tuple[Image.Image | None, dict]]:
return self.sliding_top_k_buffer.final_flush()
Expand Down
3 changes: 2 additions & 1 deletion video_sampler/integrations/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from .llava_chat import ImageDescription, VideoSummary
from .yt_dlp_plugin import YTDLPPlugin

__all__ = ["YTDLPPlugin"]
__all__ = ["YTDLPPlugin", "ImageDescription", "VideoSummary"]
179 changes: 179 additions & 0 deletions video_sampler/integrations/llava_chat.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
try:
from openai import OpenAI
except ImportError:
print(
"openai not installed, please install it using `pip install openai` to use this plugin"
)
import base64
import io
import os

import requests
from PIL import Image


def resize_image(image: Image, max_side: int = 512):
"""
Resize the image to max_side if any of the sides is greater than max_side
"""
# get the image shape
width, height = image.size
if max(width, height) > max_side:
# resize the image to max_side
# keeping the aspect ratio
if width > height:
new_width = max_side
new_height = int(height * max_side / width)
else:
new_height = max_side
new_width = int(width * max_side / height)
return image.resize((new_width, new_height))
return image


def encode_image(image: Image):
"""
Convert the image to base64
"""
# create a buffer to store the image
buffer = io.BytesIO()
# save the image to the buffer
image.save(buffer, format="JPEG")
# convert the image to base64
return base64.b64encode(buffer.getvalue()).decode("utf-8")


class PromptClient:
def __init__(self, url: str) -> None:
self.client = OpenAI(
base_url=url,
api_key=os.getenv("OPENAI_API_KEY", "sk-no-key-required"),
)
self.base_settings = {"cache_prompt": True, "temperature": 0.01}
self.headers = {
"accept-language": "en-US,en",
"content-type": "application/json",
}

def get_prompt(self):
raise NotImplementedError


class ImageDescription:
"""A client to interact with the LLaMA image description API.
The API is used to generate short phrases that describe an image.
Methods:
summarise_image(image: Image) -> str:
Summarise the image using the LLaMA API.
"""

def __init__(self, url: str = "http://localhost:8080"):
"""Initialise the client with the base URL of the LLaMA API.
Args:
url (str): The base URL of the LLaMA API.
"""
"""TODO: migrate to OpenAI API when available"""
if url is None:
url = "http://localhost:8080/"
self.url = url
self.headers = {
"accept-language": "en-GB,en",
"content-type": "application/json",
}
if api_key := os.getenv("OPENAI_API_KEY"):
self.headers["Authorization"] = f"Bearer {api_key}"
self.session = requests.Session()

def get_prompt(self):
return """You're an AI assistant that describes images using short phrases.
The image is shown below.
\nIMAGE:[img-10]
\nASSISTANT:"""

def summarise_image(self, image: Image):
"""Summarise the image using the LLaMA API.
Args:
image (Image): The image to summarise.
Returns:
str: The description of the image.
"""
b64image = encode_image(resize_image(image))

json_body = {
"stream": False,
"n_predict": 300,
"temperature": 0.1,
"repeat_last_n": 78,
"image_data": [{"data": b64image, "id": 10}],
"cache_prompt": True,
"top_k": 40,
"top_p": 1,
"min_p": 0.05,
"tfs_z": 1,
"typical_p": 1,
"presence_penalty": 0,
"frequency_penalty": 0,
"mirostat": 0,
"mirostat_tau": 5,
"mirostat_eta": 0.1,
"grammar": "",
"n_probs": 0,
"min_keep": 0,
"api_key": "",
"slot_id": 0,
"stop": ["</s>", "Llama:", "User:"],
"prompt": self.get_prompt(),
}

response = self.session.post(
f"{self.url}/completion",
json=json_body,
headers=self.headers,
stream=False,
)
if response.status_code != 200:
print(f"Failed to summarise image: {response.content}")
return None
return response.json()["content"].strip()


class VideoSummary(PromptClient):
"""A client to interact with the LLaMA video summarisation API.
The API is used to generate a summary of a video based on image descriptions.
Methods:
summarise_video(image_descriptions: list[str]) -> str:
Summarise the video using the LLaMA API.
"""

def __init__(self, url: str = "http://localhost:8080/v1"):
"""Initialise the client with the base URL of the LLaMA API.
Args:
url (str): The base URL of the LLaMA API."""
if url is None:
url = "http://localhost:8080/v1"
super().__init__(url)

def get_prompt(self):
return """You're an AI assistant that summarises videos based on image descriptions.
Combine image descriptions into a coherent summary of the video."""

def summarise_video(self, image_descriptions: list[str]):
"""Summarise the video using the LLaMA API.
Args:
image_descriptions (list[str]): The descriptions of the images in the video.
Returns:
str: The summary of the video.
"""
return self.client.chat.completions.create(
model="LLaMA_CPP",
messages=[
{
"role": "system",
"content": self.get_prompt(),
},
{"role": "user", "content": "\n".join(image_descriptions)},
],
max_tokens=300,
)
Loading

0 comments on commit 8f7b54f

Please sign in to comment.