精读预计 15 分钟

LLMs Will Replace 8-Track Duplication Engineers

摘要

8 轨道磁带（8-track）要求将专辑歌曲分成时长尽可能相等的四个程序组以减少磁带浪费，这在数学上属于 NP-Hard 的平衡划分问题。作者分析了来自 Discogs 和 MusicBrainz 的 6463 个样本，发现当时的工程师表现惊人：中位误差仅 5 秒，且有 27.5% 的案例达到了理论最优解。实验证明，这些工程师在 1982 年 Karmarkar-Karp 算法发布前，就已经通过非教科书式的手段在实际生产中实现了极高的优化效率。

荐读理由

能把 4 路平衡分区问题直接转为带时间约束的装箱优化模型，用最新开源 solver 或约束求解器（Python / PuLP / OR-Tools）一次性求出全局最优，自动处理轨道交错与静音损失

原文

Human Performance on an NP-Hard Partitioning Problem at Industrial Scale

Or, LLMs Will Replace 8-Track Duplication Engineers

Columbia Records introduced the LP format in 1948, giving listeners twenty-five minutes of music per side of a vinyl record. For fifty years, every album, through vinyl and cassette, was built around this fifty-minute, two-sided limitation. In 1965, the 8-track cartridge changed this format. 8-track cartridges are a single, continuous loop of tape with four programs. To create the 8-track version of an LP, someone at the record company or tape duplication facility must take the track listing of the LP, and partition it in such a way so that the four programs are as close to equal length as possible. The longest program determines the length of tape in the cartridge. A cartridge with an 11-minute long Program 1 and only 8 minutes on programs 2, 3, and 4 will have nine total minutes of silence.

This is a classic NP-hard problem, the easiest hard problem, and a problem that was solved thousands of times by unknown audio engineers all without the aid of a computer. Deep in the bowels of the Discogs and MusicBrainz APIs are data indexing human performance on NP-hard problems done by experts in their field.

Frontier labs are still paying Mechanical Turk workers pennies per hour, when they could be nerding out on a music format that has been dead for forty years. I pity them.

In case you’re wondering, I’m putting an 8-track player in my car and I wanted a copy of Frank Ocean’s Blonde. Then I started thinking. Then I realized an 8-track of Girls’ Broken Dreams Club is impossible. And now we’re here.

The problem

An 8-track cartridge is one continuous loop of tape with four parallel “programs” recorded side by side. The player works through them in sequence, physically shifting the head at the end of each pass — and because all four programs share the same loop, all four are exactly the same length. The longest program sets the physical tape length; every shorter program plays out its remainder as dead air. Wasted tape costs the duplicator money. Splitting the songs across tracks interrupts the flow of the song. Check out the Harvest release of Dark Side of the Moon; “Money” is split across programs 2 and 3. “Us and Them” is split across programs 3 and 4. It retains the track order of the LP but pays for it by killing the vibe of the best songs on the album.

For every 8-track release, someone at a duplication plant had to divide the songs into four groups with the most balanced possible total durations. This is balanced 4-way number partitioning, an NP-hard optimization problem solved by hand with a stopwatch or a paper and pen. Here’s the job done well. Rumours — Fleetwood Mac, WEA, 1977 — eleven songs, shipped like this:

Program 110:07

Second Hand News2:54
Oh Daddy3:58
I Don't Want To Know3:15

Program 210:06

Dreams4:18
Never Going Back Again2:15
You Make Loving Fun3:33

Program 310:17

Don't Stop3:14
Go Your Own Way3:40
Songbird3:23

Program 49:33

The Chain4:31
Gold Dust Woman5:02

This is not the LP’s running order. It’s nowhere near the LP’s running order. The engineer responsible for this 8-track got really close to the optimum, but because we now have computers that can search every permutation, I can tell you the optimum 8-track ordering of Rumours:

Program 110:13

Dreams4:18
Never Going Back Again2:15
Go Your Own Way3:40

Program 210:10

Don't Stop3:14
Songbird3:23
You Make Loving Fun3:33

Program 310:07

Second Hand News2:54
I Don't Want To Know3:15
Oh Daddy3:58

Program 49:33

The Chain4:31
Gold Dust Woman5:02

The only change with the ideal ordering versus the release ordering is that “Go Your Own Way” is swapped with “You Make Loving Fun”. This saves four seconds of tape. 8-tracks run at 3.75 ips, so assuming 100,000 copies of Rumours were made on 8-track, this single change would have saved 23 miles of tape.

The Economics of Optimizing 8-Tracks

This is probably the best I can do for historical prices of 1/4” lubricated tape, either Scotch 158 or Ampex 675. The cost is about $0.003 per foot. The naive solution (LP track order) for putting Rumours on an 8-track versus the almost-optimum that actually shipped saves 62 seconds per cartridge, or $5,812.50 for a run of 100,000 cartridges. The engineer who put Rumours into an 8-track saved their employer a few months of their salary with a few hours’ work.

Optimizing Rumours further, by swapping “Go Your Own Way” with “You Make Loving Fun”, would save another four seconds. The cost for this tape across 100,000 copies would be about $400, or a little more than what the duplication engineer would make in a week.

All of this was with pencil and paper. Maybe a calculator. I expect my next project will be writing the Apple II Basic program that will search every permutation for the optimum solution to every possible LP to 8-track. On an Apple II, this task of testing every permutation would take several days.

The Corpus

Every bit of data in this exploration comes from two sources. Discogs is familiar to all overpaid millennial ‘computer workers’ and needs no introduction. MusicBrainz is the encyclopedia for all metadata for every release. For each 8-track I found through Discogs, I pulled the track arrangement, then pulled each song’s length from MusicBrainz, along with the LP’s track arrangement. Building this was a gigantic funnel:

51,197 Discogs 8-track releases processed, basically every 8-track on Discogs.
13,045 (25%) joined cleanly to per-song durations of LP/CD releases on MusicBrainz.
8,266 of those had an unambiguous four-program structure and a scoreable human makespan.
Deduplicated to 6,463 unique solves — the same master, duplicator, and program layout counts once, but cross-shop variants of an album are kept.

Yes, this is a study of just 12% of all 8-track cartridges I could find, but it’s completely honest. There are no 8-track exclusive compilation albums from Columbia House in this dataset, and I don’t really care about 8-tracks of religious sermons that were never committed to vinyl. This is a dataset of humans solving the 8-track partitioning problem, not a catalog of everything that has ever happened to quarter-inch tape.

You can get the full dataset here in the project’s repo

The Results

Before going over any results, I’d like to give an example of how humans solved this problem. The Rolling Stones Sticky Fingers has ten distinct 8-track versions:

The Rolling Stones — Sticky Fingers (1971)

Duplicator(s)	Programs (tracks in tape order)	Makespan
Atlantic Recording, Kinney Music, Melody Recordings, Rolling Stones	P1 (11:39) — Brown Sugar, Sway, I Got The Blues P2 (11:39) — Wild Horses, Moonlight Mile P3 (11:21) — Can't You Hear Me Knocking, Dead Flowers P4 (11:46) — You Gotta Move, Bitch, Sister Morphine	11:46
Quala Sonic	P1 (11:46) — You Gotta Move, Bitch, Sister Morphine P2 (11:39) — Moonlight Mile, Wild Horses P3 (11:39) — Brown Sugar, Sway, I Got The Blues P4 (11:21) — Dead Flower, Can't You Hear Me Knocking	11:46
WEA/Warner	P1 (11:32) — Brown Sugar, Sway, Wild Horses (Début / Begin) P2 (11:33) — Wild Horses (Fin / End), Can't You Hear Me Knocking, You Gotta Move P3 (11:39) — Bitch, I Got The Blues, Sister Morphine (Début / Begin) P4 (11:38) — Sister Morphine (Fin / End), Dead Flowers, Moonlight Mile	11:39
Sound Values Marketing	P1 (11:55) — Wild Horses, You Gotta Move, Bitch P2 (11:39) — Brown Sugar, Sway, I Got The Blues P3 (11:30) — Sister Morphine, Moonlight Mile P4 (11:21) — Dead Flowers, Can't You Hear Me Knocking	11:55
Sicamericana Sacifi	P1 (11:18) — No Puedes Escucharme Golpear = Can't You Hear Me Knocking, Flores Muertas = Dead Flowers P2 (11:40) — Caballos Salvajes = Wild Horses, Una Milla A La Luz De La Luna = Moonlight Mile P3 (11:34) — Tengo Blues = I Got The Blues, Azúcar Marrón = Brown Sugar, Balanceo = Sway P4 (12:01) — Debes Correrte = You Gotta Move, Perra = Bitch, Hermana Morfina = Sister Morphine	12:01
Rolling Stones	P1 (11:21) — Can't You Hear Me Knocking, Dead Flowers P2 (11:30) — Moonlight Mile, Sister Morphine P3 (12:10) — Wild Horses, Sway, You Gotta Move P4 (11:24) — Brown Sugar, Bitch, I Got The Blues	12:10
Not On Label The Rolling Stones	P1 (12:19) — Brown Sugar, You Gotta Move, Moonlight Mile P2 (11:19) — Wild Horses, Sister Morphine P3 (11:26) — Sway, Bitch, I Got The Blues P4 (11:21) — Can't You Hear Me Knocking, Dead Flowers	12:19
Sound Ventures 2	P1 (11:22) — Brown Sugar, Sway, Bitch P2 (11:19) — Wild Horses, Sister Morphine P3 (11:11) — Can't You Hear Me Knocking, I Got The Blues P4 (12:33) — You Gotta Move, Dead Flowers, Moonlight Mile	12:33
Br 8	P1 (13:28) — Bitch, I Got The Blues, Moonlight Mile P2 (9:40) — Sister Morphine, Dead Flowers P3 (9:49) — Can't You Hear Me Knocking, You Gotta Move P4 (13:28) — Brown Sugar, Sway, Wild Horses	13:28
Pieces O Eight	P1 (19:23) — Wild Horses, Moonlight Mile, Brown Sugar, Sway P2 (19:02) — Sister Morphine, Bitch, You Gotta Move, Can't You Hear Me Knocking P3 (3:55) — I Got The Blues P4 (4:05) — Dead Flowers	19:23

At the top, four separate shops - Atlantic, Kinney, Melody, and the Stones’ own label - independently landed on the provably optimal split. WEA/Warner saved seven seconds of tape by respecting the original track order of the LP and cutting “Wild Horses” and “Sister Morphine” in half. Other duplicators had a good showing, except for Pieces O Eight, a bootleg label.

House Styles

There are differences in how production houses approached this problem. When no perfect partition could be found, there are basically two options: splitting a track across programs, or repeating a short song to fill up tape. Here’s a table of what different duplicators did, restricting to the same LP duplicated by 2+ shops:

Duplicator	n	Split %	Repeat %	Split % on contested albums†
Hardman Industries	43	53.5	9.3	—
Precision	170	50.6	7.6	70%
Capitol	349	50.4	0.6	58%
Polydor	51	49.0	0.0	—
Apple	31	48.4	0.0	—
Solo Products	76	46.1	6.6	—
GRT	335	44.8	3.3	44%
Columbia/CBS	641	42.3	6.7	38%
Atlantic Recording	76	42.1	1.3	—
PolyGram	55	36.4	5.5	—
Club	39	33.3	12.8	—
Audio Devices	70	32.9	8.6	—
RCA	438	28.8	7.8	71%
MCA	53	26.4	37.7	—
WEA/Warner	239	24.7	5.0	40%
Lear Jet	51	23.5	5.9	—
A&M Limited	73	21.9	41.1	—
Ampex	474	15.2	23.8	31%
Quality Limited	51	3.9	15.7	—
ITCC	55	0.0	27.3	—

Capitol split half its tapes and repeated almost none (50.4% vs. 0.6%); Apple and Polydor never repeated a single song. At the other end, ITCC split zero of 55 tapes and repeated 27% of them, and Ampex and A&M Limited lean the same way — when the math didn’t work, they padded with a repeated short song rather than take a knife to a long one. The same problem, opposite tools.

An Engineer’s Style

Above is a great demonstration of that this data can tell us about house styles of the partition problem, but that doesn’t answer the question: How good were humans at solving this problem?. For this, we compare the engineer’s partition to several metrics. First, keeping the album order. Second, LPT, the standard greedy algorithm, published in the 1960s. Third, Karmarkar–Karp differencing, or KK algorithm, or LDM. This was an algorithm for partitions first published in 1982. Finally, because I have a computer made in the last 20 years, I can simply iterate through every possible solution to find the optimum solution.

Take the Rumours example again. It shipped on an LP, and we can determine what length of tape is required for different algorithms:

Approach	Programs (tracks)	Makespan
Keep the album order (best cuts)	P1 (9:27) — Second Hand News, Dreams, Never Going Back Again P2 (10:17) — Don't Stop, Go Your Own Way, Songbird P3 (11:19) — The Chain, You Make Loving Fun, I Don't Want To Know P4 (9:00) — Oh Daddy, Gold Dust Woman	11:19
LPT (greedy)	P1 (10:52) — Don't Stop, Go Your Own Way, Oh Daddy P2 (10:45) — Second Hand News, Dreams, You Make Loving Fun P3 (10:09) — Never Going Back Again, Songbird, The Chain P4 (8:17) — I Don't Want To Know, Gold Dust Woman	10:52
Karmarkar–Karp (1982)	P1 (10:35) — Second Hand News, Dreams, Songbird P2 (10:27) — Don't Stop, I Don't Want To Know, Oh Daddy P3 (10:19) — Never Going Back Again, The Chain, You Make Loving Fun P4 (8:42) — Go Your Own Way, Gold Dust Woman	10:35
WEA's engineer (1977)	P1 (10:17) — Don't Stop, Go Your Own Way, Songbird P2 (10:07) — Second Hand News, I Don't Want To Know, Oh Daddy P3 (10:06) — Dreams, Never Going Back Again, You Make Loving Fun P4 (9:33) — The Chain, Gold Dust Woman	10:17
Provable optimum (free shuffle)	P1 (10:13) — Dreams, Never Going Back Again, Go Your Own Way P2 (10:10) — Don't Stop, Songbird, You Make Loving Fun P3 (10:07) — Second Hand News, I Don't Want To Know, Oh Daddy P4 (9:33) — The Chain, Gold Dust Woman	10:13

One caveat for this table, which I should have made explicit before, the programs are interchangeable (you can swap program 1 with program 3 and it doesn’t change the result), and song order within a program is interchangeable. It doesn’t matter if program 4 goes The Chain to Gold Dust Woman or Gold Dust Woman to The Chain.

Given the same track listing, each of these algorithms produce different results. LPT doesn’t perform that much better than the naive, ‘copy the LP track order’ solution. Both LPT and KK have a short program – the failure model for those algorithms. The human engineer was one swap away from the optimum.

In any case, we can see that for Rumours, humans were good. Really good. Better than an algorithm that was invented a decade after this order was laid out. Does this observation maintain for the entire corpus of 8-track cartridges? Surprisingly, yes.

Human Results on Computer Problems

Across all 6,463 solves, the median engineer landed just 5 seconds over the provable optimum — on an album running about forty minutes, a miss of barely 0.2%. And the kicker on that number: the durations I’m scoring against are themselves only good to about three or four seconds, since LP and CD editions drift from the 8-track masters by a few seconds (I measured it). So the typical engineer wasn’t 5 seconds from perfect so much as sitting at the floor of what this data can even resolve. More than a quarter of all tapes — 27.5% — hit the exact optimum: the single best arrangement out of thousands, found by hand.

Humans vs LPT vs Karmarkar–Karp

Human vs LPT: wins 3192 (58%), ties 714 (13%), losses 1569 (29%) (n=5475)
Human vs KK: wins 2541 (46%), ties 861 (16%), losses 2073 (38%) (n=5475)

They beat the textbook, too. Re-scored on the identical durations, the human beat greedy LPT on 58% of tapes and lost on 29% — better than two to one. Against Karmarkar–Karp, the state-of-the-art differencing method that wasn’t even published until 1982, the humans still came out ahead on balance: 46% wins to 38% losses. A workforce of anonymous duplication-plant employees, working in the early seventies with a stopwatch and the LP’s label copy, was beating an algorithm from a decade in their future.

The obvious suspicion is that they were secretly running one of these methods without knowing its name. They weren’t. Only 6.8% of human partitions exactly match the arrangement LPT would have produced, and 7.6% match Karmarkar–Karp — meaning more than nine times in ten, the engineer shipped something neither textbook algorithm would have chosen. They matched the textbook’s results without using its methods. Whatever those shops were doing with a stopwatch and a razor blade, it isn’t in the literature.

This is the median, and there is a long tail of shops doing terrible jobs. One of the worst in the Discogs dataset is for Sublime’s 40oz to Freedom, an obvious one-off in the vein of my version of Frank Ocean’s Blonde. The Sublime 8-track is basically just some redditor four years ago, so there’s that, or I just care too much about the problem (I do, because I wrote this).

Again, all the documentation is available here on github. Go look, have fun.

AI is going to replace engineers partitioning LPs for 8-track duplicators

So we have a huge dataset of humans solving an NP-hard problem that was done without the aid of computers or even the algorithms computers would now use. We can use this for benchmarking AI models. This is a great test for LLM performance, because we can test against the provable optimum, classical algorithms, and human expert performance. Does the LLM beat the human? Not always.

For this test, the model receives only anonymized track lengths in mm:ss. No titles, no artist, nothing that would be found in training data. The prompt for each LLM is as follows:

You are mastering an album for 8-track cartridge. The tape has four programs, and all four play for exactly the same length of tape: the longest program determines the tape length, and every shorter program wastes the difference as silence.

Here are the track lengths:

 1. 3:38
 2. 6:19
 [... the rest of the tracks ...]

Assign every track to exactly one of the four programs so that the four program lengths are as equal as possible — specifically, minimize the length of the longest program. Tracks may go in any order and any program. Use every track exactly once.

Respond with JSON: {"programs": [[...], [...], [...], [...]]} where each inner list contains the track numbers (1-based) assigned to that program.

The response is something like:

{"programs": [[2, 1], [8, 7], [3, 6], [5, 4]]}

Each model was tested against different strata:

Stratum	n	Selection
perfect	100	the human provably hit the optimum — model can only tie or lose
near-miss	50	the human was 1–30s off — model can beat the actual human
hard	50	≤5 optimal partitions, no perfect split, 12+ songs

From this, we get some interesting results:

Model	perfect	near-miss	hard	vs human (all)
Fable 5, max effort	100/100	50/50	50/50	95W / 105T / 0L
Fable 5, default	95/96	45/45	48/48	88W / 100T / 12L¹
GPT-5.2, xhigh	87/88	34/34	30/30	61W / 90T / 1L on completed²
Haiku 4.5 + thinking	42/100	—	—	median residual 4.5s
GPT-5-mini @ high	28/29²	—	—	—
GPT-5-nano @ high	23/23²	—	—	—
GPT-5.2, no reasoning	2/93	—	—	median residual 106s
GPT-5-mini, minimal	0/93	—	—	median residual 163s
Haiku 4.5, no thinking	1/98	—	—	median residual 125s

¹ Losses are formatting failures, not worse partitions.

² I ran out of OpenAI credits; completion rate, not quality — GPT models with reasoning were at 97–100% of valid answers throughout.

These are interesting findings!

The best models rule everything, but need reasoning.

Both Fable 5 and GPT-5.2 xhigh nail this task. Fable 5 max effort could not have performed better, with zero losses over a human. Models without reasoning performed terribly, with the human beating them all the time. If there’s one takeaway, it’s comparing expert human performance to an LLM of a specific model and reasoning capability. Haiku 4.5 + thinking performed about as well as a human in the 1970s; Haiku 4.5 with an 8k thinking budget lands at a 4.5s median residual and 42% optima — statistically a typical duplication-plant engineer (human study: 5s median, 27.5% optima).

Better models use fewer tokens

Fable needs ~6k tokens, GPT-5.2 ~14k, GPT-5-mini ~30k, GPT-5-nano ~45k tokens. This is surprising, although I don’t – and I don’t think anyone does – know what ‘LLM thinking’ actually is. It’s more context, but hey lets give the non-deterministic god machine some credit here.

That said, the failures of LLMs, here expressed as malformed output (a common failure was not having all the songs on the 8-track), would not have happened with a human. Because they would be fired.

I actually did this

Before running all of these tests, I actually did this the old-fashioned way. With pencil and paper and thinking. This entire project was inspired by putting Blonde on an 8-track and from experience, I can tell you this is a hard problem. The trouble is, I can’t tell you how I did it. There’s some human heuristic I used, definitely not an algorithm, and I can’t write it down. This seems to be what humans in 1977 who gave a damn did too. This is not what the dude making the Sublime 8-track did.

So I can’t tell you how to do this without testing all possible permutations, but human intuition can get pretty close. This sort of thing has shown up in other fields, like Foldit, an online ‘let humans perfect protein folding’ game. Classical computer algorithms can only get so close, and humans watching these classical algorithms got frustrated when they saw a solution the computer didn’t. Humans can see stuff that classical algorithms can’t. And now there’s a dozen Nature publications to prove it.

But now we have LLMs. They’re also a black box, and if you throw enough tokens and context at them, they’ll out-perform humans. They won’t be able to tell you how they did it, either.

This isn’t a victory for humans over algorithms or LLMs over humans, or anything like that. It’s just a fact that a dead and derided music format left behind a benchmark where human intuition beat classical methods that wouldn’t be in a textbook for a decade after the work was done. And half a lifetime later, LLMs would outperform humans for reasons we can’t really inspect.

So that’s something.

back

Lobsters · 3 赞 · 0 评讨论 → 阅读原文 →

这条对你有帮助吗？