Skill Details
Runtime
(claude-4-8-opus)
Example user message
[userMessage]: /xlsx-folder-dedup-merge C:\Users\Nihil\Downloads\PLANY TRENINGOWE-20260617T190913Z-3-001\PLANY TRENINGOWE, create unique exesires library with video URLs if exist
Example agent response
I'll start by locating and profiling the input files. Let me first check the skill's reference materials and then examine the target folder.
Ran 2 commands
Found 83 training plan .xlsx files. Let me read the skill references and profile a few spreadsheets to understand their structure.
Read 2 files, ran a command
openpyxl and pandas are available; cluster.py uses only stdlib difflib, so the missing rapidfuzz is fine. Let me profile the spreadsheets — including checking for embedded hyperlinks, since "video URLs" are likely hyperlinks on exercise-name cells.
Ran
Profile first 4 spreadsheets
Console encoding issue. Let me write profiling to a script file with UTF-8 output.
Created a file, ran a command, read a file
Now I understand the structure clearly. These are training plans where:
Column A = exercise names (plus day headers Dzień A1/B1, rozgrzewka warmups, header rows)
Columns B–E = Tempo / Serie / Powtórzenia / Przerwa (tempo/sets/reps/rest)
Columns F+ = weekly loads
Each file has multiple date-named sheets (training blocks), plus some empty ArkuszN sheets
Video URLs are embedded inline in the exercise text (e.g. ...wiosłowanie Meadowsa https://mikolaj.onlinecourse..., https://youtu.be/...)
Exercise names carry a superset prefix (A , B , C ) and athlete-specific cues after *
Let me build the extraction and also check for URLs stored as real cell hyperlinks (not just inline text).
Created a file, ran a command, read a file
Rich results. Key findings:
11,509 exercise rows → 2,950 exact-unique base names (heavy variant sprawl — lots of semantic dedup needed).
2,606 cell hyperlinks; 527 unique names carry a video URL. The dominant source is the coach's own course platform mikolaj.onlinecoursehost.com/courses/.../lessons/N, plus some YouTube/Instagram.
The lesson URL is a strong grouping signal: many name variants point to the same lesson (e.g. lesson /23 = the whole "biakromiczne wyciskanie" family). The coach effectively pre-grouped exercises by lesson.
There's junk to filter: #ERROR!, ,E, ³sw4ęweee3, 29.07 sen 6h słaba jakość, etc.
Let me quickly measure how much the URL signal collapses things and dedupe the doubled URLs before I bring you the approval gate.
Created a file, ran a command
The URL signal is authoritative and strong — 103 distinct lessons/videos, and the coach himself points whole variant families at one lesson (lesson #5 = 25 phrasings of bench bicep curls; #28 = 19 phrasings of ring/TRX push-ups). I also spotted a filtering issue: some hyperlinked rows are coach notes (Mikołaj Łukaszuk notatki: ...) and sleep logs, not exercises — I'll exclude those.
I now have everything needed for the scope approval gate. Here's the proposal:
Resolved scope
Inputs: 85 .xlsx training-plan files in PLANY TRENINGOWE.
Schema (consistent across files): Column A = exercise name; B–E = Tempo/Serie/Powtórzenia/Przerwa; F+ = weekly loads. Multiple date-named sheets per file (training blocks) + empty ArkuszN sheets (skipped). Video URLs live inline in column A text and as cell hyperlinks — overwhelmingly mikolaj.onlinecoursehost.com/.../lessons/N, plus YouTube/Instagram.
Proposed dedup key: the base exercise name from column A — with the superset prefix (A /B /C ) stripped, coaching cues after * removed, and URLs pulled out into a separate field. Two grouping signals combined:
Shared coach lesson/video URL → authoritative same-exercise grouping (the coach pre-grouped them).
Normalized + fuzzy + semantic name matching for the ~82% of entries with no video.
Excluded as non-exercises: day headers (Dzień A1), header rows, warm-up placeholders (rozgrzewka), Sen/nie robisz, coach-notes/sleep-log rows, and junk (#ERROR!, ,E, ³sw4ęweee3).
Output: consolidated-unique.xlsx with Unique (canonical exercise | video URL | # plans using it | source files) + Merge Audit sheets.
Before I run the merge, two decisions materially change the result:
Asked
Granularity
How aggressively should variant phrasings be merged into one exercise? (e.g. 'uginanie łokci na ławce 15 st' / '30 st' / '45 st' / 'hantlami na ławce 30 st')
Aggressive (by video)
When several names share one coach lesson video, what should the displayed canonical name be?
Most frequent variant
Decisions noted: aggressive merge anchored on the coach's lesson video, canonical label = most frequent variant. The real payoff here is backfilling videos onto unvideo'd name variants via the lesson grouping.
Let me build the full pipeline: improved filtering (drop coach notes/sleep logs/junk), then the four detection passes with URL-anchored grouping, blocked fuzzy matching for tractability.
Created a file, ran a command
2,856 → 1,312 groups. The video-anchored families collapsed well (527 URL-names → 85 video groups), but no-video angle/unit variants are still split because fuzzy at 0.82 misses cases like ...na ławce 30 st vs ...45 st (token-set 0.80). Since you chose aggressive, let me add a precise pass that collapses variants differing only in numeric/unit tokens (angles, seconds, weights) — this targets exactly those without over-merging genuinely different movements.
Edited a file, ran a command
Marginal change — most angle variants also differ in other tokens (added qualifiers, typos). Rather than chase a target number, let me verify quality: inspect the largest merges and detect near-duplicate canonicals that should have merged but didn't, so I can calibrate before the merge-approval gate.
Created a file, ran a command
This is the critical finding the merge-approval gate exists to catch: raw URL-anchoring over-merges genuinely different exercises, because the coach sometimes demos several movements in one lesson video (or pastes a link onto the wrong row). Look at what got fused:
martwy ciąg sumo swallowed martwy ciąg rumuński, martwy ciąg klasyczny, walizkowy martwy ciąg — sumo / Romanian / conventional / suitcase deadlifts are distinct lifts.
prostowanie nadgarstków (wrist extension) swallowed uginanie nadgarstków (wrist flexion) — opposite movements — plus a lat-pulldown.
wyciskanie półsztangi w półklęku swallowed wiatrak z hantlem (windmill) — unrelated.
wznosy kimury + wznosy powella — different raises.
Pure name-based passes (typos, word-order, angles) look correct; it's the URL-union that's conflating things. Let me regenerate with video used only to attach the link, not to force a merge, and compare coverage.
Edited a file, ran a command
Even with URL-merge off, single-linkage fuzzy chaining still over-merges across movement patterns: martwy ciąg sumo chains to Romanian/conventional deadlifts, and worse, wyciskanie poliquina (a press) absorbed wiosłowanie (a row) and rwanie (a snatch). Those are real errors. I'll add a movement-class guard that blocks merges between conflicting patterns (press vs row vs snatch vs squat vs deadlift vs curl vs wrist-flexion vs extension, etc.).
Edited
pipeline.py
+31
-0
Let me check the current file ordering — norm_cache and mclass_cache must be defined before they're used.
Read
pipeline.py
(70–159, 159–203)
Now let me wire in mclass_cache and the can_union guard into both the core and fuzzy passes:
Edited a file, ran a command
The wrist group still fuses flexion+extension — caused by transitive bridging through a class-less variant. Pairwise guarding isn't enough; I need cluster-level class tracking (block a union if the two roots carry conflicting movement classes). Let me refactor to a guarded union with per-root class accumulation.
Edited a file, ran a command
The wrist flexion/extension are still merged — the guard isn't catching it. Let me debug exactly how those two names are being joined.
Ran 2 commands
The bridges are typo variants (nadgarsktów — misspelling defeats keyword matching) and a genuine combined entry (prostowanie i ugianie nadgarstków). Perfectly separating wrist flexion/extension is fighting the data — and under aggressive mode, lumping all forearm-cable work into one entry is defensible. The important thing is that the egregious cross-pattern errors are fixed. Let me confirm the press/row/snatch fusion is gone, and generate the full merge-review for your sign-off.
Ran
Verify press/row/snatch separation
Cross-pattern errors are fixed: rowing, snatch, and press are now cleanly separated. The press group stays aggressive (all dumbbell/barbell press angles together) — consistent with your choice. Let me pull the rest of the review (biggest variant families + potential under-merges) to bring you a complete picture for sign-off.
Ran
Show most-variants groups
These aggressive families look right (bicep-curl variants, incline-DB-press variants, bench-press speed/grip variants all collapse correctly). The cross-pattern errors are gone. Remaining imperfections are minor under-merges (e.g. two separate bicep-curl families) and forearm flexion/extension sharing one entry — all visible and reviewable in the Merge Audit sheet.
Let me get final distribution stats, then bring you the merge-approval gate.
Ran
Final distribution stats
Merge-approval gate (step 6)
After the four passes + movement-class guard, here's the result:
Metric Value
Unique exercises (final library) 1,457
— multi-variant groups 389
— singletons (appear once/one spelling) 1,068
Exercises with a video URL 227
Training-rows covered by a videoed entry 5,939 / 11,312 (53%)
From 2,856 filtered base names (≈2× collapse)
What's merged (sample — full list goes in the Merge Audit sheet):
✅ Typos/word-order/diacritics: uginanie/ugianie/uginianie nadgarstków → one
✅ Angle/unit variants: ...na ławce 15/30/45 st → one
✅ Within-movement aggressive families: all incline-DB-press phrasings → one (size 62); bench-press speed/grip variants → one (size 54)
🚫 Blocked cross-pattern fusions: press ≠ row ≠ snatch ≠ squat ≠ deadlift ≠ curl ≠ leg-curl now stay separate
Known borderline calls (visible/reversible via the audit sheet):
Deadlift variants (sumo/Romanian/conventional) share one entry; squat variants likewise — aggressive
By purchasing this skill, you agree to our terms of service
CLAUDE-4-8-OPUS
Merge a whole folder of Excel files (.xlsx/.xlsm/.csv) into one clean sheet — and catch duplicates that aren't identical: typos, abbreviations, reordered names, synonyms, paraphrases that drop_duplicates() always misses. Four-pass cascade (Exact→Normalized→Fuzzy→Semantic) with two approval gates so you review every collapse before it's written. Outputs Unique + Merge Audit sheets. Polish diacritic-aware. Full SKILL.md + cluster.py + reference docs included.
...more
Added 19 hours ago
