edu.berkeley.nlp.lm
Interface WordIndexer<W>

Type Parameters:
W - A type representing words in the language. Can be a String, or something more complex if needed
All Superinterfaces:
Serializable
All Known Implementing Classes:
StringWordIndexer

public interface WordIndexer<W>
extends Serializable

Enumerates words in the vocabulary of a language model. Stores a two-way mapping between integers and words.

Author:
adampauls

Nested Class Summary
static class WordIndexer.StaticMethods
           
 
Method Summary
 W getEndSymbol()
          Returns the start symbol (usually something like </s>
 int getIndexPossiblyUnk(W word)
          Should never add to vocabulary, and should return getUnkSymbol() if the word is not in the vocabulary.
 int getOrAddIndex(W word)
          Gets the index for a word, adding if necessary.
 int getOrAddIndexFromString(String word)
           
 W getStartSymbol()
          Returns the start symbol (usually something like <s>
 W getUnkSymbol()
          Returns the unk symbol (usually something like <unk>
 W getWord(int index)
          Gets the word object for an index.
 int numWords()
          Number of words that have been added so far
 void setEndSymbol(W sym)
           
 void setStartSymbol(W sym)
           
 void setUnkSymbol(W sym)
           
 void trimAndLock()
          Informs the implementation that no more words can be added to the vocabulary.
 

Method Detail

getOrAddIndex

int getOrAddIndex(W word)
Gets the index for a word, adding if necessary.

Parameters:
word -
Returns:

getOrAddIndexFromString

int getOrAddIndexFromString(String word)

getIndexPossiblyUnk

int getIndexPossiblyUnk(W word)
Should never add to vocabulary, and should return getUnkSymbol() if the word is not in the vocabulary.

Parameters:
word -
Returns:

getWord

W getWord(int index)
Gets the word object for an index.

Parameters:
index -
Returns:

numWords

int numWords()
Number of words that have been added so far

Returns:

getStartSymbol

W getStartSymbol()
Returns the start symbol (usually something like <s>

Returns:

setStartSymbol

void setStartSymbol(W sym)

getEndSymbol

W getEndSymbol()
Returns the start symbol (usually something like </s>

Returns:

setEndSymbol

void setEndSymbol(W sym)

getUnkSymbol

W getUnkSymbol()
Returns the unk symbol (usually something like <unk>

Returns:

setUnkSymbol

void setUnkSymbol(W sym)

trimAndLock

void trimAndLock()
Informs the implementation that no more words can be added to the vocabulary. Implementations may perform some space optimization, and should trigger an error if an attempt is made to add a word after this point.