public class ArabicTokenizer<T extends HasWord> extends AbstractTokenizer<T>
A single instance of an Arabic Tokenizer is not thread safe, as it uses a non-threadsafe jflex object to do the processing. Multiple instances can be created safely, though. A single instance of a ArabicTokenizerFactory is also not thread safe, as it keeps its options in a local variable.
TODO(spenceg): Merge in rules from ibm tokenizer (v5). TODO(spenceg): Add XML escaping TODO(spenceg): When running from the command line, the tokenizer does not produce the correct number of newline-delimited lines for the ATB data sets.
| Modifier and Type | Class and Description |
|---|---|
static class |
ArabicTokenizer.ArabicTokenizerFactory<T extends HasWord> |
nextToken| Constructor and Description |
|---|
ArabicTokenizer(java.io.Reader r,
LexedTokenFactory<T> tf,
java.util.Properties lexerProperties) |
| Modifier and Type | Method and Description |
|---|---|
static TokenizerFactory<CoreLabel> |
atbFactory() |
static TokenizerFactory<CoreLabel> |
factory() |
protected T |
getNext()
Internally fetches the next token.
|
static void |
main(java.lang.String[] args)
A fast, rule-based tokenizer for Modern Standard Arabic (UTF-8 encoding).
|
static ArabicTokenizer<CoreLabel> |
newArabicTokenizer(java.io.Reader r,
java.util.Properties lexerProperties) |
hasNext, next, peek, remove, tokenizepublic ArabicTokenizer(java.io.Reader r,
LexedTokenFactory<T> tf,
java.util.Properties lexerProperties)
public static ArabicTokenizer<CoreLabel> newArabicTokenizer(java.io.Reader r, java.util.Properties lexerProperties)
protected T getNext()
AbstractTokenizergetNext in class AbstractTokenizer<T extends HasWord>public static TokenizerFactory<CoreLabel> factory()
public static TokenizerFactory<CoreLabel> atbFactory()
public static void main(java.lang.String[] args)
Currently, this tokenizer does not do line splitting. It normalizes non-printing line separators across platforms and prints the system default line splitter to the output.
The following normalization options are provided:
useUTF8Ellipsis : Replaces sequences of three or more full stops with …normArDigits : Convert Arabic digits to ASCII equivalentsnormArPunc : Convert Arabic punctuation to ASCII equivalentsnormAlif : Change all alif forms to bare alifnormYa : Map ya to alif maqsuraremoveDiacritics : Strip all diacriticsremoveTatweel : Strip tatweel elongation characterremoveQuranChars : Remove diacritics that appear in the QuranremoveProMarker : Remove the ATB null pronoun markerremoveSegMarker : Remove the ATB clitic segmentation markerremoveMorphMarker : Remove the ATB morpheme boundary markersremoveLengthening : Replace all sequences of three or more identical (non-period) characters with one copyatbEscaping : Replace left/right parentheses with ATB escape charactersargs -