IN - The type of the tokens in the sentencespublic class WordToSentenceProcessor<IN> extends java.lang.Object implements ListProcessor<IN,java.util.List<IN>>
The main behavior is to look for sentence ending tokens like "." or "?!?", and to split after them and any following sentence closers like ")". Overlaid on this is an overall choice of state: The WordToSentenceProcessor can be a non-splitter, which always returns one sentence. Otherwise, the WordToSentenceProcessor will also split based on paragraphs using one of these three states: (1) Ignore line breaks in splitting sentences, (2) Treat each line as a separate paragraph, or (3) Treat two consecutive line breaks as marking the end of a paragraph. The details of sentence breaking within paragraphs is controlled based on the following three variables:
<p>' tag.  If two of these follow each other, they are
 coalesced: no empty Sentence is output.  The end-of-file is not
 represented in this Set, but the code behaves as if it were a member.
 | Modifier and Type | Class and Description | 
|---|---|
| static class  | WordToSentenceProcessor.NewlineIsSentenceBreak | 
| Modifier and Type | Field and Description | 
|---|---|
| static java.util.Set<java.lang.String> | DEFAULT_BOUNDARY_FOLLOWERS | 
| static java.lang.String | DEFAULT_BOUNDARY_REGEX | 
| static java.util.Set<java.lang.String> | DEFAULT_SENTENCE_BOUNDARIES_TO_DISCARD | 
| Constructor and Description | 
|---|
| WordToSentenceProcessor()Create a  WordToSentenceProcessorusing a sensible default
 list of tokens for sentence ending for English/Latin writing systems. | 
| WordToSentenceProcessor(boolean isOneSentence)Create a  WordToSentenceProcessorwhich never breaks the input
 into multiple sentences. | 
| WordToSentenceProcessor(java.util.Set<java.lang.String> boundaryToDiscard)Set the set of Strings that will mark the end of a sentence,
 and which will be discarded after doing so. | 
| WordToSentenceProcessor(java.lang.String boundaryTokenRegex,
                       java.util.Set<java.lang.String> boundaryFollowers,
                       java.util.Set<java.lang.String> boundariesToDiscard,
                       java.util.Set<java.lang.String> xmlBreakElementsToDiscard,
                       java.lang.String regionElementRegex,
                       WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak,
                       SequencePattern<? super IN> sentenceBoundaryMultiTokenPattern,
                       java.util.Set<java.lang.String> tokenRegexesToDiscard,
                       boolean isOneSentence,
                       boolean allowEmptySentences)Flexibly set parameters for converting a list of tokens into sentences. | 
| WordToSentenceProcessor(java.lang.String boundaryTokenRegex,
                       java.util.Set<java.lang.String> boundaryToDiscard,
                       java.util.Set<java.lang.String> xmlBreakElementsToDiscard,
                       WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak)Flexibly set the set of acceptable sentence boundary tokens, but with
 a default set of allowed boundary following tokens. | 
| WordToSentenceProcessor(java.lang.String boundaryTokenRegex,
                       java.util.Set<java.lang.String> boundaryToDiscard,
                       java.util.Set<java.lang.String> xmlBreakElementsToDiscard,
                       WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak,
                       SequencePattern<? super IN> sentenceBoundaryMultiTokenPattern,
                       java.util.Set<java.lang.String> tokenRegexesToDiscard)Flexibly set the set of acceptable sentence boundary tokens, but with
 a default set of allowed boundary following tokens. | 
| WordToSentenceProcessor(WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak)Create a  WordToSentenceProcessorusing a sensible default
 list of tokens for sentence ending for English/Latin writing systems. | 
| Modifier and Type | Method and Description | 
|---|---|
| java.util.List<java.util.List<IN>> | process(java.util.List<? extends IN> words)Take a List (including a Sentence) of input, and return a
 List that has been processed in some way. | 
| <L,F> Document<L,F,java.util.List<IN>> | processDocument(Document<L,F,IN> in) | 
| static WordToSentenceProcessor.NewlineIsSentenceBreak | stringToNewlineIsSentenceBreak(java.lang.String name) | 
| java.util.List<java.util.List<IN>> | wordsToSentences(java.util.List<? extends IN> words)Returns a List of Lists where each element is built from a run
 of Words in the input Document. | 
public static final java.lang.String DEFAULT_BOUNDARY_REGEX
public static final java.util.Set<java.lang.String> DEFAULT_BOUNDARY_FOLLOWERS
public static final java.util.Set<java.lang.String> DEFAULT_SENTENCE_BOUNDARIES_TO_DISCARD
public WordToSentenceProcessor()
WordToSentenceProcessor using a sensible default
 list of tokens for sentence ending for English/Latin writing systems.
 The default set is: {".","?","!"} and
 any combination of ! or ?, as in !!!?!?!?!!!?!!?!!!.
 A sequence of two or more consecutive line breaks is taken as a paragraph break
 which also splits sentences. This is the usual constructor for sentence
 breaking reasonable text, which uses hard-line breaking, so two
 blank lines indicate a paragraph break.public WordToSentenceProcessor(WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak)
WordToSentenceProcessor using a sensible default
 list of tokens for sentence ending for English/Latin writing systems.
 The default set is: {".","?","!"} and
 any combination of ! or ?, as in !!!?!?!?!!!?!!?!!!.
 You can specify the treatment of newlines as sentence breaks as one
 of ignored, every newline is a sentence break, or only two or more
 consecutive newlines are a sentence break.newlineIsSentenceBreak - Strategy for treating newlines as
                               paragraph breaks.public WordToSentenceProcessor(boolean isOneSentence)
WordToSentenceProcessor which never breaks the input
 into multiple sentences. If the argument is true, the input stream
 is always output as one sentence. (If it is false, this is
 equivalent to the no argument constructor, so why use this?)isOneSentence - Marker argument: true means to treat input
                      as one sentencepublic WordToSentenceProcessor(java.lang.String boundaryTokenRegex,
                               java.util.Set<java.lang.String> boundaryToDiscard,
                               java.util.Set<java.lang.String> xmlBreakElementsToDiscard,
                               WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak)
boundaryTokenRegex - The set of boundary tokens. If null, use default.boundaryToDiscard - The set of regex for sentence boundary tokens that should be discarded.
                          If null, use default.xmlBreakElementsToDiscard - xml element names like "p", which will be recognized,
                                  treated as sentence ends, and discarded.
                                  If null, use none.newlineIsSentenceBreak - Strategy for counting line ends (boundaryToDiscard) as sentence ends.public WordToSentenceProcessor(java.lang.String boundaryTokenRegex,
                               java.util.Set<java.lang.String> boundaryToDiscard,
                               java.util.Set<java.lang.String> xmlBreakElementsToDiscard,
                               WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak,
                               SequencePattern<? super IN> sentenceBoundaryMultiTokenPattern,
                               java.util.Set<java.lang.String> tokenRegexesToDiscard)
boundaryTokenRegex - The set of boundary tokens. If null, use default.boundaryToDiscard - The set of regex for sentence boundary tokens that should be discarded.
                          If null, use default.xmlBreakElementsToDiscard - xml element names like "p", which will be recognized,
                                  treated as sentence ends, and discarded.
                                  If null, use none.newlineIsSentenceBreak - Strategy for counting line ends (boundaryToDiscard) as sentence ends.public WordToSentenceProcessor(java.util.Set<java.lang.String> boundaryToDiscard)
boundaryToDiscard - A Set of String that will be matched
                          with .equals() and will mark an
                          end of sentence and be discarded.public WordToSentenceProcessor(java.lang.String boundaryTokenRegex,
                               java.util.Set<java.lang.String> boundaryFollowers,
                               java.util.Set<java.lang.String> boundariesToDiscard,
                               java.util.Set<java.lang.String> xmlBreakElementsToDiscard,
                               java.lang.String regionElementRegex,
                               WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak,
                               SequencePattern<? super IN> sentenceBoundaryMultiTokenPattern,
                               java.util.Set<java.lang.String> tokenRegexesToDiscard,
                               boolean isOneSentence,
                               boolean allowEmptySentences)
boundaryTokenRegex - Tokens that match this regex will end a
                           sentence, but are retained at the end of
                           the sentence.boundaryFollowers - This is a Set of String that are matched with
                          .equals() which are allowed to be tacked onto
                          the end of a sentence after a sentence boundary
                          token, for example ")".boundariesToDiscard - This is normally used for newline tokens if
                            they are included in the tokenization. They
                            may end the sentence (depending on the setting
                            of newlineIsSentenceBreak), but at any rate
                            are deleted from sentences in the output.xmlBreakElementsToDiscard - These are elements like "p" or "sent",
                                  which will be wrapped into regex for
                                  approximate XML matching. They will be
                                  deleted in the output, and will always
                                  trigger a sentence boundary.public static WordToSentenceProcessor.NewlineIsSentenceBreak stringToNewlineIsSentenceBreak(java.lang.String name)
public java.util.List<java.util.List<IN>> process(java.util.List<? extends IN> words)
ListProcessorprocess in interface ListProcessor<IN,java.util.List<IN>>public java.util.List<java.util.List<IN>> wordsToSentences(java.util.List<? extends IN> words)
PTBTokenizer).words - A list of already tokenized words (must implement HasWord or be a String).WordToSentenceProcessor(String, Set, Set, Set, String, NewlineIsSentenceBreak, SequencePattern, Set, boolean, boolean)