Clawvard
Clawvard

Product

EvaluateModel ServiceLearning & EvolutionCampus

Developers

DocsResearchGitHub

Legal

PrivacyTerms

Community

XREDnoteTikTok
© 2026 Clawvard LimitedPowered by AWS Cloud Computing
←Back to Courses

🎬 Media

Highlight Clips

3-5 captioned vertical .mp4 clips with a unified cover

💰 ~57 cr/视频🔌 Uses: Volcengine ASR · AWS Bedrock · OpenAI · Google Gemini

Everything below is a skill document. Hit copy, paste it to your agent, and it has learned the skill.

clawvard-media / SKILL.md

clipping-video-highlights

End-to-end SOP: long video → N short highlight clips, each with burned subtitles + a shared cover image. Five backend calls (Clawvard cloud) + local ffmpeg / yt-dlp.


Setup (run first, every time)

1. Install / update the SDK (in the user's project root)

# in the dir that has — or will have — package.json:
npm init -y                              # skip if package.json already exists
npm install @clawvard/sdk@latest         # never pin; SDK ships new methods between minor versions
export CLAW_API_KEY=sk-...               # ask the user; never hardcode

Node.js ≥ 18 required.

2. Verify local binaries (ffmpeg + yt-dlp)

ffmpeg  -version | head -1   # → 6.x or 7.x. Required for cut + subtitle burn.
ffprobe -version | head -1   # → 6.x or 7.x. Bundled with ffmpeg.
yt-dlp  --version            # → 2024.xx.xx or newer. Required for source fetch.

If missing:

  • macOS: brew install ffmpeg yt-dlp
  • Debian/Ubuntu: sudo apt-get install -y ffmpeg && pipx install yt-dlp
  • Windows: choco install ffmpeg yt-dlp (or winget install yt-dlp.yt-dlp)

3. Confirm course enrollment + cache pricing

import { Clawvard } from "@clawvard/sdk";
const cv = new Clawvard({ apiKey: process.env.CLAW_API_KEY! });
const { services, lastUpdatedAt } = await cv.platform.pricing({
  courseId: "media-101-highlight-clips",
});
// services.length should be 5 (the cv.media.* family). If 0, the user
// isn't enrolled in the course — direct them to the course page first.

Hold onto services — the next section uses it for budgeting.


Pipeline

# Step Where Charged
1 Fetch source video yt-dlp —
2 Extract audio (mp3) ffmpeg —
3 cv.media.uploadTemp → upload to Clawvard cloud SDK + local PUT per call (see live pricing)
4 cv.media.transcribe SDK per call
5 cv.media.translateSubtitles (optional) SDK per call
6 cv.media.pickHighlights SDK per call
7 Extract candidate frames ffmpeg —
8 cv.media.composeCover SDK per call
9 Cut clip + burn subtitles (+ optional source watermark) ffmpeg —

Cover ships as a separate cover.jpg (not spliced into the clip — see Local commands below).

One shared cover: composeCover is called once per video, not per highlight. Per-highlight cover is opt-in and charges per extra invocation.


Pricing — read live, decide what to skip

The Setup step already fetched the catalog. Do not hardcode credit numbers — reuse the same services array to budget and decide whether to skip optional steps (translate / cover) when the user's balance is tight.

const priceFor = Object.fromEntries(services.map(s => [s.serviceId, s.pricing.credits]));
// priceFor["media.transcribe"]  → e.g. 24 (whatever the platform currently charges)
// priceFor["media.upload-temp"] → e.g. 1

const fullPipelineCost =
  priceFor["media.upload-temp"] +
  priceFor["media.transcribe"] +
  priceFor["media.translate-subtitles"] +
  priceFor["media.pick-highlights"] +
  priceFor["media.compose-cover"];

The pricing endpoint itself is free (0 credit). Cache the response; refresh only when your cached lastUpdatedAt differs from the latest response (an admin price / gating change bumps it).


Endpoints

cv.media.uploadTemp

Get a one-time upload URL for Clawvard cloud temporary storage, then PUT the binary directly. Returns a fileUrl valid for 24 hours — after that the file is auto-deleted. Pass fileUrl to other endpoints (e.g. cv.media.transcribe).

const { uploadUrl, fileUrl } = await cv.media.uploadTemp({
  filename: "audio.mp3",
  contentType: "audio/mpeg",            // audio/* | video/* | application/octet-stream
  sizeBytes: audio.byteLength,          // ≤ 300 MB; signed URL enforces this exact length
});
await fetch(uploadUrl, {
  method: "PUT",
  headers: { "Content-Type": "audio/mpeg" },
  body: audio,
});

cv.media.transcribe

Returns segments[] from the audio at audioUrl. Use .wait().

const { segments, language } = await cv.media
  .transcribe({
    audioUrl: fileUrl,
    language: "en-US",                  // optional, auto-detect if omitted
  })
  .wait();
// segments: { i, start, end, text }[]

Hard limit: audio duration ≤ 5400s (1.5h). Over-cap rejects with audio_too_long.

Languages: zh-CN en-US ja-JP es-MX pt-BR de-DE fr-FR ko-KR it-IT ru-RU id-ID ms-MY th-TH vi-VN ar-SA yue-CN.

cv.media.translateSubtitles

Index-aligned translation. Every input i returns a matching output i. Skip if user wants original-language subtitles.

const { translations } = await cv.media.translateSubtitles({
  segments: segments.map(s => ({ i: s.i, en: s.text })),
  targetLanguage: "zh",                 // "zh" | "en" | "ja" | "es" | "fr"
});
// translations: { i, text }[]

Pass the entire transcript in one call. One call is fixed-price regardless of segment count — per-highlight calls waste credits.

cv.media.pickHighlights

Returns N highlight spans. endSec lands on a sentence terminator (.?!。?!) when possible. The server bidirectionally adjusts each span to fit the duration band — even if the LLM picks too short, it gets extended forward to meet the minimum.

const { highlights } = await cv.media.pickHighlights({
  segments: subSegments,                // post-translation if you translated
  audience: "学生与职场新人",
  mode: "viral",                        // viral (default) | chapter | core
  // durationSecRange: [30, 75],        // optional — overrides mode default
  // count: 5,                          // optional — derived from total duration
  languageHint: "zh",                   // "zh" | "en" | "bilingual" | ...
});
// highlights: { title, subtitle, reason, startSec, endSec }[]

mode picks editorial flavor + default duration band:

  • viral (default) — short-form social hooks, default [30, 75]s, ~1 highlight per 5min of source
  • chapter — documentary-style longer excerpts, default [90, 300]s, ~1 per 10min — use this for 5-minute clips
  • core — high-density core-idea segments, default [30, 75]s, ~1 per 5min

Pass durationSecRange to override the mode default. Pass count to override the auto-derived target. Cut the source at [startSec, endSec] verbatim.

cv.media.composeCover

Returns one cover image as a base64 data URL. Use .wait().

const cover = await cv.media.composeCover({
  candidateFrames: [
    { jpegBase64, tSec: 0.1 * duration },
    { jpegBase64, tSec: 0.3 * duration },
    { jpegBase64, tSec: 0.5 * duration },
    { jpegBase64, tSec: 0.7 * duration },
    { jpegBase64, tSec: 0.9 * duration },
  ],
  subject: "30 多岁实用主义博主",        // who is in the video
  headline: "本集精选标题",               // big text on the cover
  summary: "1-2 句视频内容描述",          // REQUIRED
  audience: "学生与职场新人",             // REQUIRED
  platform: "Douyin",                    // Douyin | Xiaohongshu | YouTube | Bilibili | Instagram | TikTok
  // Optional: templateId, tag1, tag2, aspectRatio, userPrompt
}).wait();
// cover.imageUrl: "data:image/jpeg;base64,..."

Frame limits: candidateFrames ≤ 12, each decoded JPEG ≤ 200 KB.

Resilience: If Stage 3 image-gen fails (safety filter, transient flap), the service retries once after a 2s backoff, then falls back to returning the rank-chosen source frame as-is. Inspect cover.imageSource ("generated" | "frame-fallback") and cover.imageFallbackReason — when fallback fires, you can re-call composeCover with a softer headline/subject, or surface a soft warning to the user.

Aspect ratio defaults (matched to each platform's recommended cover spec):

Platform Default aspectRatio Pixel ref
Douyin / TikTok / WeChat-Channel / Instagram-Reels 9:16 1080×1920
Xiaohongshu 3:4 1080×1440
Instagram (feed) 4:5 1080×1350
YouTube / Bilibili 16:9 1920×1080

Override with aspectRatio if needed. Allowed: 9:16 16:9 1:1 3:4 4:5.

Templates (optional templateId, omit to auto-pick): HIGH_CONTRAST_HOOK CINEMATIC_STORY MINIMAL_KNOWLEDGE SPLIT_COMPARISON TECH_UI_OVERLAY PRODUCT_SHOWCASE MEME_EXAGGERATED EDUCATIONAL_STEP AESTHETIC_LIFESTYLE TRAVEL_EXPERIENCE PERSONAL_BRANDING INFO_DENSE_YOUTUBE.


Local commands (ffmpeg + yt-dlp)

When Command
Fetch source yt-dlp "<url>" -f 'bv*[height<=1080]+ba/b[height<=1080]' --merge-output-format mp4 -o source.mp4
Extract audio ffmpeg -y -i source.mp4 -vn -ar 16000 -ac 1 -b:a 64k audio.mp3
Get duration ffprobe -v quiet -print_format json -show_format source.mp4
Extract a frame at tSec ffmpeg -y -ss <tSec> -i source.mp4 -frames:v 1 -q:v 4 -vf "scale='if(gt(iw,ih),min(iw\,1024),-2)':'if(gt(ih,iw),min(ih\,1024),-2)'" -f mjpeg frame.jpg
Cut a highlight ffmpeg -y -i source.mp4 -ss <startSec> -to <endSec> -c:v libx264 -preset fast -crf 22 -c:a aac -b:a 192k clip.mp4
Burn SRT ffmpeg -y -i clip.mp4 -vf "subtitles=clip.srt:force_style='Fontname=$CJK_FONTNAME,Fontsize=22,PrimaryColour=&H00FFFFFF&,OutlineColour=&H00000000&,BorderStyle=1,Outline=2,Shadow=0,MarginV=40'" -c:v libx264 -preset fast -crf 22 -c:a copy clip-subbed.mp4
Burn source attribution (bottom-left) `ffmpeg -y -i clip-subbed.mp4 -vf "drawtext=fontfile='$CJK_FONT':text='来源: @<博主名>':fontcolor=white:fontsize=36:box=1:boxcolor=black@0.55:boxborderw=24

⚠️ Always set CJK fonts when text contains 中文. Without them, ffmpeg's default font has no CJK glyphs and 中文字符 will render as □□ tofu. Set both at the top of your script:

# Path for drawtext (needs absolute font file path)
CJK_FONT=$(ls /System/Library/Fonts/PingFang.ttc \
  /System/Library/Fonts/STHeiti\ Medium.ttc \
  /usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc \
  /usr/share/fonts/truetype/noto/NotoSansCJK-Regular.ttc 2>/dev/null | head -1)
[ -z "$CJK_FONT" ] && { echo "no CJK font found, install fonts-noto-cjk"; exit 1; }

# Family name for libass (used by `subtitles=` filter via fontconfig)
CJK_FONTNAME=$([ "$(uname)" = "Darwin" ] && echo "PingFang SC" || echo "Noto Sans CJK SC")

The subtitles= filter resolves font names through fontconfig, while drawtext needs an absolute file path — that's why we set both.

Cover delivery: cover.jpg is shipped as a separate file alongside the clip — do NOT splice it onto the front of the mp4. Each platform has a different aspect ratio (3:4 4:5 9:16 16:9); padding the cover into the clip's frame produces ugly letterboxing. Upload cover + clip as two assets when the platform supports it (抖音/小红书/B站/YouTube all do).

SRT format — HH:MM:SS,mmm (comma, not period), blank line between cues, ≤18 CJK / ≤30 Latin chars per line:

1
00:00:00,000 --> 00:00:03,400
两分钟规则

2
00:00:03,400 --> 00:00:07,200
立刻执行最简单的事

For bilingual subtitles, put ${en}\n${zh} in one cue — libass renders stacked.


End-to-end skeleton

import { Clawvard } from "@clawvard/sdk";
import { readFile, writeFile } from "node:fs/promises";
import { execSync } from "node:child_process";

const cv = new Clawvard({ apiKey: process.env.CLAW_API_KEY! });
const URL = "https://www.youtube.com/watch?v=...";
const OUT = "/tmp/job-1";
const FINAL = `${OUT}/final`;
const WORK  = `${OUT}/work`;
execSync(`mkdir -p ${FINAL} ${WORK}`);

// 1-2. Fetch source + extract audio
execSync(`yt-dlp "${URL}" -f 'bv*[height<=1080]+ba/b[height<=1080]' --merge-output-format mp4 -o ${WORK}/source.mp4`);
execSync(`ffmpeg -y -i ${WORK}/source.mp4 -vn -ar 16000 -ac 1 -b:a 64k ${WORK}/audio.mp3`);
const audio = await readFile(`${WORK}/audio.mp3`);

// 3. Upload audio to Clawvard cloud (auto-deleted after 24h)
const { uploadUrl, fileUrl } = await cv.media.uploadTemp({
  filename: "audio.mp3", contentType: "audio/mpeg", sizeBytes: audio.byteLength,
});
const put = await fetch(uploadUrl, { method: "PUT", headers: { "Content-Type": "audio/mpeg" }, body: audio });
if (!put.ok) throw new Error(`upload failed: ${put.status}`);

// 4. Transcribe
const { segments } = await cv.media.transcribe({ audioUrl: fileUrl, language: "en-US" }).wait();

// 5. Translate (skip for original-language workflow)
const { translations } = await cv.media.translateSubtitles({
  segments: segments.map(s => ({ i: s.i, en: s.text })),
  targetLanguage: "zh",
});
const zh = Object.fromEntries(translations.map(t => [t.i, t.text]));
const subSegments = segments.map(s => ({ ...s, text: zh[s.i] ?? s.text }));

// 6. Pick highlights — `mode` drives default duration band + count.
//    Use "viral" for 30-75s social hooks, "chapter" for 90-300s long
//    excerpts, "core" for 30-75s core-idea segments. Override with
//    explicit `durationSecRange` / `count` if needed.
const { highlights } = await cv.media.pickHighlights({
  segments: subSegments,
  audience: "学生与职场新人",
  mode: "viral",
  languageHint: "zh",
});

// 7. Sample 5 candidate frames + composeCover
const probe = JSON.parse(execSync(`ffprobe -v quiet -print_format json -show_format ${WORK}/source.mp4`).toString());
const duration = parseFloat(probe.format.duration);
const candidateFrames = [];
for (const ratio of [0.1, 0.3, 0.5, 0.7, 0.9]) {
  const tSec = ratio * duration;
  const fp = `${WORK}/cover-f-${ratio}.jpg`;
  execSync(`ffmpeg -y -ss ${tSec} -i ${WORK}/source.mp4 -frames:v 1 -q:v 4 -vf "scale='if(gt(iw,ih),min(iw\\,1024),-2)':'if(gt(ih,iw),min(ih\\,1024),-2)'" -f mjpeg ${fp}`);
  candidateFrames.push({ jpegBase64: (await readFile(fp)).toString("base64"), tSec });
}
const cover = await cv.media.composeCover({
  candidateFrames,
  subject: "30 多岁实用主义博主",
  headline: highlights[0].title,
  summary: highlights.map(h => h.reason).join(" / "),
  audience: "学生与职场新人",
  platform: "Douyin",
}).wait();
const [, b64] = cover.imageUrl.split(",");
await writeFile(`${FINAL}/cover.jpg`, Buffer.from(b64, "base64"));

// 8. Per highlight: cut → SRT → burn subtitles → (optional) source attribution
for (let i = 0; i < highlights.length; i++) {
  const hl = highlights[i];
  const idx = String(i + 1).padStart(2, "0");
  const rawCut = `${WORK}/h-${idx}.mp4`;
  const subbed = `${WORK}/h-${idx}-subbed.mp4`;
  const srt    = `${WORK}/h-${idx}.srt`;
  const out    = `${FINAL}/h-${idx}.mp4`;

  execSync(`ffmpeg -y -i ${WORK}/source.mp4 -ss ${hl.startSec} -to ${hl.endSec} -c:v libx264 -preset fast -crf 22 -c:a aac -b:a 192k ${rawCut}`);

  const local = subSegments
    .filter(s => s.end > hl.startSec && s.start < hl.endSec)
    .map((s, n) => ({
      i: n + 1,
      start: Math.max(0, s.start - hl.startSec),
      end: Math.min(hl.endSec - hl.startSec, s.end - hl.startSec),
      text: s.text,
    }));
  execSync(`cat > ${srt}`, { input: renderSrt(local) });   // implement renderSrt: HH:MM:SS,mmm + blank lines

  execSync(`ffmpeg -y -i ${rawCut} -vf "subtitles=${srt}" -c:v libx264 -preset fast -crf 22 -c:a copy ${subbed}`);

  // Optional: burn a "来源: @<博主>" badge bottom-left. Skip if user didn't ask for attribution.
  execSync(`ffmpeg -y -i ${subbed} -vf "drawtext=text='来源\\: @<博主名>':fontcolor=white:fontsize=36:box=1:boxcolor=black@0.55:boxborderw=24|12:x=20:y=h-text_h-32" -c:v libx264 -preset fast -crf 22 -c:a copy ${out}`);
}

Output: only ${FINAL}/ (cover.jpg + h-NN.mp4) is user-facing — they're two separate deliverables, uploaded as cover-image + video on the destination platform. ${WORK}/ is scratch.


Errors

HTTP Error Fix
401 authentication_required Pass Authorization: Bearer sk-xxx (or X-Clawvard-Key); SDK does this for you when apiKey is set
402 insufficient_credits (hint: "Need N, have M") Top up at /billing; or skip optional steps (translate / cover) to lower the per-pipeline cost
402 course_not_purchased (hint: "media-101-highlight-clips") Enroll user in the course (free + auto-enroll in v1)
400 audio_too_long: <N>s exceeds cap 5400s Trim or chunk source video locally
400 sizeBytes must be between 1 and 314572800 Re-encode at lower bitrate (64 kbps mp3 fits 1.5h in ~40 MB)
429 rate_limit_exceeded Wait the suggested seconds, then retry
400 invalid_input (hint: "...") Check the request shape against this skill's endpoint signatures

What you get

clip-01.mp4

一条长视频 → 3 条 30 秒竖屏高光,烧录字幕 + 统一封面

Popular tasks · tap to copy

Backend APIs

  • 01Volcengine ASRcloud
  • 02AWS Bedrockcloud
  • 03OpenAIcloud
  • 04Google Geminicloud

All run via your Clawvard SDK key (CLAW_API_KEY) — no third-party keys.

The open-source skill

clawvard-media

Prereqs: 通过 Clawvard SDK key(Clawvard API key)调用,无需自备 key;本地需 ffmpeg + yt-dlp。