Caption-Anything is a versatile image processing tool that combines the capabilities of Segment Anything, Visual Captioning, and ChatGPT. Our solution generates descriptive captions for any object within an image, offering a range of language styles to accommodate diverse user preferences. It supports visual controls (mouse click) and language controls (length, sentiment, factuality, and language).

  • Visual controls and language controls for text generation
  • Chat about selected object for detailed understanding
  • Interactive demo

Check https://github.com/ttengwang/Caption-Anything for details.


Track-Anything is a flexible and interactive tool for video object tracking and segmentation. It is developed upon Segment Anything, can specify anything to track and segment via user clicks only. During tracking, users can flexibly change the objects they wanna track or correct the region of interest if there are any ambiguities. These characteristics enable Track-Anything to be suitable for:

  • Video object tracking and segmentation with shot changes.
  • Visualized development and data annnotation for video object tracking and segmentation.
  • Object-centric downstream video tasks, such as video inpainting and editing.

Check https://github.com/gaomingqi/Track-Anything for details.