Goku MovieGen Bench Viewer
View Videos Generated by Goku for MovieGen Benchmark
Yup! That stays one chunk.
chunker.push("Last week she said, “Hi there. How are you?”");
chunker.flush()
Emitting "Last week she said, “Hi there. How are you?”"
The only exception is with newlines - I wanted it to emit when a newline was encountered.
chunker.push("Last week she said,\n“Hi there. How are you?”");
chunker.flush()
Emitting "Last week she said,"
Emitting "“Hi there. How are you?”"
If you want to disable this behavior, pass in {emitParagraphs: false}
to the constructor, i.e.:
const chunker = new SentenceChunker({emitParagraphs: false});
There's also chunkLength
to determine the character length maximum (128 by default), and emitTrimmed
on whether or not each emit should trim leading/trailing whitespace (default true.) One last thing, if your input is always growing - like if you're streaming one response and just concatenating it as one big string - you can use GrowingSentenceChunker
instead (in the same file.) Example:
const chunker = new GrowingSentenceChunker();
chunker.onChunk((chunk) => { console.log(`Emitting "${chunk}"`); });
chunker.push("Last week");
chunker.push("Last week she said");
chunker.push("Last week she said, “Hi there. How are you?”");
chunker.flush()
Emitting "Last week she said, “Hi there. How are you?”"
And just in case it's not obvious, the .flush()
call will just emit anything left in the buffer, even if it's shorter than the maximum length. If you don't call .flush()
, it will wait for another input that pushes it over the chunk limit before emitting again.
I spent a bit of time working on a JavaScript sentence splitter - it might work right out of the box for this purpose! It tries to split on punctuation when possible for smooth flow, but has a max length option to ensure run-on sentences still get split, too. It also maintains a buffer so you can just keep pushing streaming text into it and it will emit when it has a full chunk.
https://raw.githubusercontent.com/painebenjamin/anachrovox/refs/heads/main/www/sentence.js
Example:
const chunker = new SentenceChunker();
chunker.onChunk((sentenceChunk) => { console.log(`Emitting "${sentenceChunk}"`); });
chunker.push("The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.");
chunker.flush()
Output:
Emitting "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration."
Emitting "The best performing models also connect the encoder and decoder through an attention mechanism."
Emitting "We propose a new simple network architecture, the Transformer, based solely on attention mechanisms,"
Emitting "dispensing with recurrence and convolutions entirely."
Emitting "Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train."