fix(listener): 拦贴纸/GIF聚合站 + 裸媒体文件,避免 Discord 表情包误入分享库#2
Conversation
事故:用户 yhn 在分享频道发了一个 Discord 贴纸(klipy GIF),message.content 里就是裸 https://klipy.com/gifs/... URL,listener 当成正常分享走完 OG fetch + 分类,被打成 APPROVED 上架成 #18。 原 _SKIP_HOSTS 只拦了 discord.com / cdn.discordapp.com 等 Discord 自家域,没考虑贴纸面板默认走 tenor / klipy / giphy。同类问题:mmbiz.qpic.cn 这类纯图片直链(#5)也不该入库。 改法两层:(1) _SKIP_HOSTS 加入 tenor / klipy / giphy 全套;(2) 兜底在 path 上做媒体扩展名(.gif/.png/.jpg/.mp4/...)匹配,host 永远穷举不完。匹配只看 path,query 里出现 .jpg 不算(避免误伤带 ?file=foo.jpg 的正常 API 链接)。+19 个测试 case 覆盖。
/share 是单页提交入口(带 ?url=... 预填,给 bookmarklet 用),/feed 才是已审核通过的展示墙。Bot 在 listener.py(首条 reply + APPROVED 终态 reply)和 commands.py(/share 斜杠命令成功回执)三处都把 '点此查看 / 已收录到内卷地狱分享库' 链接指向 /share——结果用户点过去看到的是空提交表单,不是自己刚分享的内容。
There was a problem hiding this comment.
Pull request overview
This PR hardens the Discord share listener’s URL filtering so sticker/GIF aggregator links and direct (bare) media-file URLs don’t get treated as “share submissions” and ingested into the backend.
Changes:
- Expand
_SKIP_HOSTSto include tenor/klipy/giphy domains commonly emitted by Discord sticker/GIF features. - Add a path-based media extension fallback (
.gif/.png/.jpg/.../.mp3/...) to skip bare media links regardless of host, while ignoring query-only matches. - Add unit tests covering aggregator domains, bare media links, and non-skipped “normal article” URLs.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/chat_bot/cogs/listener.py |
Extends skip logic to include sticker/GIF aggregators and path-based media extension filtering. |
tests/test_listener_skip.py |
Adds test cases validating the new skip behavior and non-regressions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # path 走小写匹配,跟 query 解耦:?foo=bar.jpg 不会误命中 | ||
| return parsed.path.lower().endswith(_MEDIA_EXTENSIONS) |
There was a problem hiding this comment.
Media-extension skipping is based on parsed.path.endswith(...), but the listener’s URL regex can still capture trailing punctuation like , / . / , / 。 after a URL in chat. In that case the path becomes /file.jpg, and won’t match, defeating the new safeguard. Consider normalizing before the check (e.g., stripping common trailing punctuation from the URL/path) and adding a regression test for it.
| @pytest.mark.parametrize( | ||
| "url", | ||
| [ | ||
| # 裸图片(WeChat 图床、随便哪个 host 的图片直链) | ||
| "https://mmbiz.qpic.cn/mmbiz_jpg/abc/640.jpg", | ||
| "https://example.com/path/photo.PNG", | ||
| "https://i.example.com/cat.gif", | ||
| "https://example.com/foo.webp", | ||
| # 视频/音频直链 | ||
| "https://example.com/clip.mp4", | ||
| "https://example.com/audio.mp3", | ||
| # SVG(即便 host 不在黑名单也拦,配合服务端 SVG 上传黑名单) | ||
| "https://example.com/icon.svg", | ||
| ], |
There was a problem hiding this comment.
Consider adding at least one regression case where a bare media URL is followed by trailing punctuation (e.g. https://example.com/a.jpg, or Chinese punctuation) to reflect how URLs appear in real Discord messages; otherwise the new path-based extension filter can be bypassed if the extracted URL includes that punctuation.
| host = parsed.netloc.lower().split(":")[0] | ||
| if host in _SKIP_HOSTS: |
There was a problem hiding this comment.
host = parsed.netloc.lower().split(":")[0] is a bit fragile (doesn’t handle IPv6 literals like [::1]:443 and can be confused by userinfo in the URL). Prefer parsed.hostname (already lowercased by urlparse) and then compare against _SKIP_HOSTS.
| host = parsed.netloc.lower().split(":")[0] | |
| if host in _SKIP_HOSTS: | |
| host = parsed.hostname | |
| if host is not None and host in _SKIP_HOSTS: |
用户在分享频道贴自己 PR (#2) 通告,bot 把它当 '社区分享' 收成 #19。同类还会有 issue/commit/compare/actions/releases/discussions/blob/tree 等 dev 子路径。 策略:path 至少 3 段(/<org>/<repo>/<sub>)且 org=involutionhell 时 skip,仓库主页和第三方仓库全放行。这是 dev 自循环噪声专杀,不影响合法分享。+11 测试 case。
事故
用户在分享频道发了一个 Discord 贴纸(klipy GIF),`message.content` 里就是裸 `https://klipy.com/gifs/...\` URL。listener 当成正常分享走完 OG fetch + 分类,被打成 APPROVED 上架成 #18。
类似的还有 #5(`mmbiz.qpic.cn/...640.jpg`,WeChat 图片直链)。
根因
原 `_SKIP_HOSTS` 只拦了 `discord.com` / `cdn.discordapp.com` 等 Discord 自家域,没考虑:
改法
DB 清理
`#5` 和 `#18` 已直接 `UPDATE shared_links SET status = 'REJECTED' WHERE id IN (5, 18)` 在 prod DB 里执行,前端已不展示。
部署
本仓库 systemd 已 `restart chat-bot` 加载新代码(systemd 读磁盘)。
Test
🤖 Generated with Claude Code