clipping-video-highlights
End-to-end SOP: long video → N short highlight clips, each with burned subtitles + a shared cover image. Five backend calls (Clawvard cloud) + local ffmpeg / yt-dlp.
Setup (run first, every time)
1. Install / update the SDK (in the user's project root)
# in the dir that has — or will have — package.json:
npm init -y # skip if package.json already exists
npm install @clawvard/sdk@latest # never pin; SDK ships new methods between minor versions
export CLAW_API_KEY=sk-... # ask the user; never hardcode
Node.js ≥ 18 required.
2. Verify local binaries (ffmpeg + yt-dlp)
ffmpeg -version | head -1 # → 6.x or 7.x. Required for cut + subtitle burn.
ffprobe -version | head -1 # → 6.x or 7.x. Bundled with ffmpeg.
yt-dlp --version # → 2024.xx.xx or newer. Required for source fetch.
If missing:
- macOS:
brew install ffmpeg yt-dlp - Debian/Ubuntu:
sudo apt-get install -y ffmpeg && pipx install yt-dlp - Windows:
choco install ffmpeg yt-dlp(orwinget install yt-dlp.yt-dlp)
3. Confirm course enrollment + cache pricing
import { Clawvard } from "@clawvard/sdk";
const cv = new Clawvard({ apiKey: process.env.CLAW_API_KEY! });
const { services, lastUpdatedAt } = await cv.platform.pricing({
courseId: "media-101-highlight-clips",
});
// services.length should be 5 (the cv.media.* family). If 0, the user
// isn't enrolled in the course — direct them to the course page first.
Hold onto services — the next section uses it for budgeting.
Pipeline
| # | Step | Where | Charged |
|---|---|---|---|
| 1 | Fetch source video | yt-dlp |
— |
| 2 | Extract audio (mp3) | ffmpeg |
— |
| 3 | cv.media.uploadTemp → upload to Clawvard cloud |
SDK + local PUT | per call (see live pricing) |
| 4 | cv.media.transcribe |
SDK | per call |
| 5 | cv.media.translateSubtitles (optional) |
SDK | per call |
| 6 | cv.media.pickHighlights |
SDK | per call |
| 7 | Extract candidate frames | ffmpeg |
— |
| 8 | cv.media.composeCover |
SDK | per call |
| 9 | Cut clip + burn subtitles (+ optional source watermark) | ffmpeg |
— |
Cover ships as a separate cover.jpg (not spliced into the clip — see Local commands below).
One shared cover: composeCover is called once per video, not per highlight. Per-highlight cover is opt-in and charges per extra invocation.
Pricing — read live, decide what to skip
The Setup step already fetched the catalog. Do not hardcode credit numbers — reuse the same services array to budget and decide whether to skip optional steps (translate / cover) when the user's balance is tight.
const priceFor = Object.fromEntries(services.map(s => [s.serviceId, s.pricing.credits]));
// priceFor["media.transcribe"] → e.g. 24 (whatever the platform currently charges)
// priceFor["media.upload-temp"] → e.g. 1
const fullPipelineCost =
priceFor["media.upload-temp"] +
priceFor["media.transcribe"] +
priceFor["media.translate-subtitles"] +
priceFor["media.pick-highlights"] +
priceFor["media.compose-cover"];
The pricing endpoint itself is free (0 credit). Cache the response; refresh only when your cached lastUpdatedAt differs from the latest response (an admin price / gating change bumps it).
Endpoints
cv.media.uploadTemp
Get a one-time upload URL for Clawvard cloud temporary storage, then PUT the binary directly. Returns a fileUrl valid for 24 hours — after that the file is auto-deleted. Pass fileUrl to other endpoints (e.g. cv.media.transcribe).
const { uploadUrl, fileUrl } = await cv.media.uploadTemp({
filename: "audio.mp3",
contentType: "audio/mpeg", // audio/* | video/* | application/octet-stream
sizeBytes: audio.byteLength, // ≤ 300 MB; signed URL enforces this exact length
});
await fetch(uploadUrl, {
method: "PUT",
headers: { "Content-Type": "audio/mpeg" },
body: audio,
});
cv.media.transcribe
Returns segments[] from the audio at audioUrl. Use .wait().
const { segments, language } = await cv.media
.transcribe({
audioUrl: fileUrl,
language: "en-US", // optional, auto-detect if omitted
})
.wait();
// segments: { i, start, end, text }[]
Hard limit: audio duration ≤ 5400s (1.5h). Over-cap rejects with audio_too_long.
Languages: zh-CN en-US ja-JP es-MX pt-BR de-DE fr-FR ko-KR it-IT ru-RU id-ID ms-MY th-TH vi-VN ar-SA yue-CN.
cv.media.translateSubtitles
Index-aligned translation. Every input i returns a matching output i. Skip if user wants original-language subtitles.
const { translations } = await cv.media.translateSubtitles({
segments: segments.map(s => ({ i: s.i, en: s.text })),
targetLanguage: "zh", // "zh" | "en" | "ja" | "es" | "fr"
});
// translations: { i, text }[]
Pass the entire transcript in one call. One call is fixed-price regardless of segment count — per-highlight calls waste credits.
cv.media.pickHighlights
Returns N highlight spans. endSec lands on a sentence terminator (.?!。?!) when possible. The server bidirectionally adjusts each span to fit the duration band — even if the LLM picks too short, it gets extended forward to meet the minimum.
const { highlights } = await cv.media.pickHighlights({
segments: subSegments, // post-translation if you translated
audience: "学生与职场新人",
mode: "viral", // viral (default) | chapter | core
// durationSecRange: [30, 75], // optional — overrides mode default
// count: 5, // optional — derived from total duration
languageHint: "zh", // "zh" | "en" | "bilingual" | ...
});
// highlights: { title, subtitle, reason, startSec, endSec }[]
mode picks editorial flavor + default duration band:
viral(default) — short-form social hooks, default[30, 75]s, ~1 highlight per 5min of sourcechapter— documentary-style longer excerpts, default[90, 300]s, ~1 per 10min — use this for 5-minute clipscore— high-density core-idea segments, default[30, 75]s, ~1 per 5min
Pass durationSecRange to override the mode default. Pass count to override the auto-derived target. Cut the source at [startSec, endSec] verbatim.
cv.media.composeCover
Returns one cover image as a base64 data URL. Use .wait().
const cover = await cv.media.composeCover({
candidateFrames: [
{ jpegBase64, tSec: 0.1 * duration },
{ jpegBase64, tSec: 0.3 * duration },
{ jpegBase64, tSec: 0.5 * duration },
{ jpegBase64, tSec: 0.7 * duration },
{ jpegBase64, tSec: 0.9 * duration },
],
subject: "30 多岁实用主义博主", // who is in the video
headline: "本集精选标题", // big text on the cover
summary: "1-2 句视频内容描述", // REQUIRED
audience: "学生与职场新人", // REQUIRED
platform: "Douyin", // Douyin | Xiaohongshu | YouTube | Bilibili | Instagram | TikTok
// Optional: templateId, tag1, tag2, aspectRatio, userPrompt
}).wait();
// cover.imageUrl: "data:image/jpeg;base64,..."
Frame limits: candidateFrames ≤ 12, each decoded JPEG ≤ 200 KB.
Resilience: If Stage 3 image-gen fails (safety filter, transient flap), the service retries once after a 2s backoff, then falls back to returning the rank-chosen source frame as-is. Inspect cover.imageSource ("generated" | "frame-fallback") and cover.imageFallbackReason — when fallback fires, you can re-call composeCover with a softer headline/subject, or surface a soft warning to the user.
Aspect ratio defaults (matched to each platform's recommended cover spec):
| Platform | Default aspectRatio | Pixel ref |
|---|---|---|
| Douyin / TikTok / WeChat-Channel / Instagram-Reels | 9:16 |
1080×1920 |
| Xiaohongshu | 3:4 |
1080×1440 |
| Instagram (feed) | 4:5 |
1080×1350 |
| YouTube / Bilibili | 16:9 |
1920×1080 |
Override with aspectRatio if needed. Allowed: 9:16 16:9 1:1 3:4 4:5.
Templates (optional templateId, omit to auto-pick): HIGH_CONTRAST_HOOK CINEMATIC_STORY MINIMAL_KNOWLEDGE SPLIT_COMPARISON TECH_UI_OVERLAY PRODUCT_SHOWCASE MEME_EXAGGERATED EDUCATIONAL_STEP AESTHETIC_LIFESTYLE TRAVEL_EXPERIENCE PERSONAL_BRANDING INFO_DENSE_YOUTUBE.
Local commands (ffmpeg + yt-dlp)
| When | Command |
|---|---|
| Fetch source | yt-dlp "<url>" -f 'bv*[height<=1080]+ba/b[height<=1080]' --merge-output-format mp4 -o source.mp4 |
| Extract audio | ffmpeg -y -i source.mp4 -vn -ar 16000 -ac 1 -b:a 64k audio.mp3 |
| Get duration | ffprobe -v quiet -print_format json -show_format source.mp4 |
Extract a frame at tSec |
ffmpeg -y -ss <tSec> -i source.mp4 -frames:v 1 -q:v 4 -vf "scale='if(gt(iw,ih),min(iw\,1024),-2)':'if(gt(ih,iw),min(ih\,1024),-2)'" -f mjpeg frame.jpg |
| Cut a highlight | ffmpeg -y -i source.mp4 -ss <startSec> -to <endSec> -c:v libx264 -preset fast -crf 22 -c:a aac -b:a 192k clip.mp4 |
| Burn SRT | ffmpeg -y -i clip.mp4 -vf "subtitles=clip.srt:force_style='Fontname=$CJK_FONTNAME,Fontsize=22,PrimaryColour=&H00FFFFFF&,OutlineColour=&H00000000&,BorderStyle=1,Outline=2,Shadow=0,MarginV=40'" -c:v libx264 -preset fast -crf 22 -c:a copy clip-subbed.mp4 |
| Burn source attribution (bottom-left) | `ffmpeg -y -i clip-subbed.mp4 -vf "drawtext=fontfile='$CJK_FONT':text='来源: @<博主名>':fontcolor=white:fontsize=36:box=1:boxcolor=black@0.55:boxborderw=24 |
⚠️ Always set CJK fonts when text contains 中文. Without them, ffmpeg's default font has no CJK glyphs and 中文字符 will render as
□□tofu. Set both at the top of your script:# Path for drawtext (needs absolute font file path) CJK_FONT=$(ls /System/Library/Fonts/PingFang.ttc \ /System/Library/Fonts/STHeiti\ Medium.ttc \ /usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc \ /usr/share/fonts/truetype/noto/NotoSansCJK-Regular.ttc 2>/dev/null | head -1) [ -z "$CJK_FONT" ] && { echo "no CJK font found, install fonts-noto-cjk"; exit 1; } # Family name for libass (used by `subtitles=` filter via fontconfig) CJK_FONTNAME=$([ "$(uname)" = "Darwin" ] && echo "PingFang SC" || echo "Noto Sans CJK SC")The
subtitles=filter resolves font names through fontconfig, whiledrawtextneeds an absolute file path — that's why we set both.
Cover delivery: cover.jpg is shipped as a separate file alongside the clip — do NOT splice it onto the front of the mp4. Each platform has a different aspect ratio (3:4 4:5 9:16 16:9); padding the cover into the clip's frame produces ugly letterboxing. Upload cover + clip as two assets when the platform supports it (抖音/小红书/B站/YouTube all do).
SRT format — HH:MM:SS,mmm (comma, not period), blank line between cues, ≤18 CJK / ≤30 Latin chars per line:
1
00:00:00,000 --> 00:00:03,400
两分钟规则
2
00:00:03,400 --> 00:00:07,200
立刻执行最简单的事
For bilingual subtitles, put ${en}\n${zh} in one cue — libass renders stacked.
End-to-end skeleton
import { Clawvard } from "@clawvard/sdk";
import { readFile, writeFile } from "node:fs/promises";
import { execSync } from "node:child_process";
const cv = new Clawvard({ apiKey: process.env.CLAW_API_KEY! });
const URL = "https://www.youtube.com/watch?v=...";
const OUT = "/tmp/job-1";
const FINAL = `${OUT}/final`;
const WORK = `${OUT}/work`;
execSync(`mkdir -p ${FINAL} ${WORK}`);
// 1-2. Fetch source + extract audio
execSync(`yt-dlp "${URL}" -f 'bv*[height<=1080]+ba/b[height<=1080]' --merge-output-format mp4 -o ${WORK}/source.mp4`);
execSync(`ffmpeg -y -i ${WORK}/source.mp4 -vn -ar 16000 -ac 1 -b:a 64k ${WORK}/audio.mp3`);
const audio = await readFile(`${WORK}/audio.mp3`);
// 3. Upload audio to Clawvard cloud (auto-deleted after 24h)
const { uploadUrl, fileUrl } = await cv.media.uploadTemp({
filename: "audio.mp3", contentType: "audio/mpeg", sizeBytes: audio.byteLength,
});
const put = await fetch(uploadUrl, { method: "PUT", headers: { "Content-Type": "audio/mpeg" }, body: audio });
if (!put.ok) throw new Error(`upload failed: ${put.status}`);
// 4. Transcribe
const { segments } = await cv.media.transcribe({ audioUrl: fileUrl, language: "en-US" }).wait();
// 5. Translate (skip for original-language workflow)
const { translations } = await cv.media.translateSubtitles({
segments: segments.map(s => ({ i: s.i, en: s.text })),
targetLanguage: "zh",
});
const zh = Object.fromEntries(translations.map(t => [t.i, t.text]));
const subSegments = segments.map(s => ({ ...s, text: zh[s.i] ?? s.text }));
// 6. Pick highlights — `mode` drives default duration band + count.
// Use "viral" for 30-75s social hooks, "chapter" for 90-300s long
// excerpts, "core" for 30-75s core-idea segments. Override with
// explicit `durationSecRange` / `count` if needed.
const { highlights } = await cv.media.pickHighlights({
segments: subSegments,
audience: "学生与职场新人",
mode: "viral",
languageHint: "zh",
});
// 7. Sample 5 candidate frames + composeCover
const probe = JSON.parse(execSync(`ffprobe -v quiet -print_format json -show_format ${WORK}/source.mp4`).toString());
const duration = parseFloat(probe.format.duration);
const candidateFrames = [];
for (const ratio of [0.1, 0.3, 0.5, 0.7, 0.9]) {
const tSec = ratio * duration;
const fp = `${WORK}/cover-f-${ratio}.jpg`;
execSync(`ffmpeg -y -ss ${tSec} -i ${WORK}/source.mp4 -frames:v 1 -q:v 4 -vf "scale='if(gt(iw,ih),min(iw\\,1024),-2)':'if(gt(ih,iw),min(ih\\,1024),-2)'" -f mjpeg ${fp}`);
candidateFrames.push({ jpegBase64: (await readFile(fp)).toString("base64"), tSec });
}
const cover = await cv.media.composeCover({
candidateFrames,
subject: "30 多岁实用主义博主",
headline: highlights[0].title,
summary: highlights.map(h => h.reason).join(" / "),
audience: "学生与职场新人",
platform: "Douyin",
}).wait();
const [, b64] = cover.imageUrl.split(",");
await writeFile(`${FINAL}/cover.jpg`, Buffer.from(b64, "base64"));
// 8. Per highlight: cut → SRT → burn subtitles → (optional) source attribution
for (let i = 0; i < highlights.length; i++) {
const hl = highlights[i];
const idx = String(i + 1).padStart(2, "0");
const rawCut = `${WORK}/h-${idx}.mp4`;
const subbed = `${WORK}/h-${idx}-subbed.mp4`;
const srt = `${WORK}/h-${idx}.srt`;
const out = `${FINAL}/h-${idx}.mp4`;
execSync(`ffmpeg -y -i ${WORK}/source.mp4 -ss ${hl.startSec} -to ${hl.endSec} -c:v libx264 -preset fast -crf 22 -c:a aac -b:a 192k ${rawCut}`);
const local = subSegments
.filter(s => s.end > hl.startSec && s.start < hl.endSec)
.map((s, n) => ({
i: n + 1,
start: Math.max(0, s.start - hl.startSec),
end: Math.min(hl.endSec - hl.startSec, s.end - hl.startSec),
text: s.text,
}));
execSync(`cat > ${srt}`, { input: renderSrt(local) }); // implement renderSrt: HH:MM:SS,mmm + blank lines
execSync(`ffmpeg -y -i ${rawCut} -vf "subtitles=${srt}" -c:v libx264 -preset fast -crf 22 -c:a copy ${subbed}`);
// Optional: burn a "来源: @<博主>" badge bottom-left. Skip if user didn't ask for attribution.
execSync(`ffmpeg -y -i ${subbed} -vf "drawtext=text='来源\\: @<博主名>':fontcolor=white:fontsize=36:box=1:boxcolor=black@0.55:boxborderw=24|12:x=20:y=h-text_h-32" -c:v libx264 -preset fast -crf 22 -c:a copy ${out}`);
}
Output: only ${FINAL}/ (cover.jpg + h-NN.mp4) is user-facing — they're two separate deliverables, uploaded as cover-image + video on the destination platform. ${WORK}/ is scratch.
Errors
| HTTP | Error | Fix |
|---|---|---|
| 401 | authentication_required |
Pass Authorization: Bearer sk-xxx (or X-Clawvard-Key); SDK does this for you when apiKey is set |
| 402 | insufficient_credits (hint: "Need N, have M") |
Top up at /billing; or skip optional steps (translate / cover) to lower the per-pipeline cost |
| 402 | course_not_purchased (hint: "media-101-highlight-clips") |
Enroll user in the course (free + auto-enroll in v1) |
| 400 | audio_too_long: <N>s exceeds cap 5400s |
Trim or chunk source video locally |
| 400 | sizeBytes must be between 1 and 314572800 |
Re-encode at lower bitrate (64 kbps mp3 fits 1.5h in ~40 MB) |
| 429 | rate_limit_exceeded |
Wait the suggested seconds, then retry |
| 400 | invalid_input (hint: "...") |
Check the request shape against this skill's endpoint signatures |