edu.berkeley.nlp.lm.io
Class KneserNeyLmReaderCallback<W>

java.lang.Object
  extended by edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback<W>
Type Parameters:
W -
All Implemented Interfaces:
ArrayEncodedNgramLanguageModel<W>, LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>>, LmReaderCallback<LongRef>, NgramOrderedLmReaderCallback<LongRef>, NgramLanguageModel<W>, Serializable

public class KneserNeyLmReaderCallback<W>
extends Object
implements NgramOrderedLmReaderCallback<LongRef>, LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>>, ArrayEncodedNgramLanguageModel<W>, Serializable

Class for producing a Kneser-Ney language model in ARPA format from raw text. Confusingly, this class is both a LmReaderCallback (called from TextReader, which reads plain text), and a LmReader, which "reads" counts and produces Kneser-Ney probabilities and backoffs and passes them on an ArpaLmReaderCallback

Author:
adampauls
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from interface edu.berkeley.nlp.lm.ArrayEncodedNgramLanguageModel
ArrayEncodedNgramLanguageModel.DefaultImplementations
 
Nested classes/interfaces inherited from interface edu.berkeley.nlp.lm.NgramLanguageModel
NgramLanguageModel.StaticMethods
 
Field Summary
protected static float DEFAULT_DISCOUNT
           
protected  int lmOrder
           
protected  HashNgramMap<KneserNeyCountValueContainer.KneserNeyCounts> ngrams
           
protected  ConfigOptions opts
           
protected static long serialVersionUID
           
protected  int startIndex
           
protected  WordIndexer<W> wordIndexer
          This array represents the discount used for each ngram order.
 
Constructor Summary
KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer, int maxOrder)
           
KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer, int maxOrder, ConfigOptions opts)
           
 
Method Summary
 void addNgram(int[] ngram, int startPos, int endPos, LongRef value, String words, boolean justLastWord, long[][] scratch)
           
 void call(int[] ngram, int startPos, int endPos, LongRef value, String words)
          Called for each n-gram
 void call(W[] ngram, LongRef value)
           
 void callJustLast(W[] ngram, LongRef value, long[][] scratch)
           
 void cleanup()
          Called once all reading is done.
static double[] defaultDiscounts()
           
static double[] defaultMinCounts()
           
protected  float getDiscountForOrder(int ngramOrder)
           
protected  float getHighestOrderProb(int[] ngram, int startPos, int endPos)
           
 int getLmOrder()
          Maximum size of n-grams stored by the model.
 float getLogProb(int[] ngram)
          Equivalent to getLogProb(ngram, 0, ngram.length)
 float getLogProb(int[] ngram, int startPos, int endPos)
          Calculate language model score of an n-gram.
 float getLogProb(List<W> ngram)
          Scores an n-gram.
protected  float getLowerOrderBackoff(int[] ngram, int startPos, int endPos)
           
protected  float getLowerOrderProb(int[] ngram, int startPos, int endPos)
           
 long getTotalSize()
           
 WordIndexer<W> getWordIndexer()
          Each LM must have a WordIndexer which assigns integer IDs to each word W in the language.
 void handleNgramOrderFinished(int order)
          Called when all n-grams of a given order are finished
 void handleNgramOrderStarted(int order)
          Called when n-grams of a given order are started
protected  float interpolateProb(int[] ngram, int startPos, int endPos)
           
 void parse(ArpaLmReaderCallback<ProbBackoffPair> callback)
           
 float scoreSentence(List<W> sentence)
          Scores a complete sentence, taking appropriate care with the start- and end-of-sentence symbols.
 void setOovWordLogProb(float logProb)
          Sets the (log) probability for an OOV word.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

serialVersionUID

protected static final long serialVersionUID
See Also:
Constant Field Values

DEFAULT_DISCOUNT

protected static final float DEFAULT_DISCOUNT
See Also:
Constant Field Values

lmOrder

protected final int lmOrder

wordIndexer

protected final WordIndexer<W> wordIndexer
This array represents the discount used for each ngram order. The original Kneser-Ney discounting (-ukndiscount) uses one discounting constant for each N-gram order. These constants are estimated as D = n1 / (n1 + 2*n2) where n1 and n2 are the total number of N-grams with exactly one and two counts, respectively. For simplicity, our code just uses a constant discount for each order of 0.75. However, other discounts can be specified.


ngrams

protected final HashNgramMap<KneserNeyCountValueContainer.KneserNeyCounts> ngrams

opts

protected final ConfigOptions opts

startIndex

protected final int startIndex
Constructor Detail

KneserNeyLmReaderCallback

public KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer,
                                 int maxOrder)
Parameters:
wordIndexer -
maxOrder -
inputIsSentences - If true, input n-grams are assumed to be sentences, and all sub-ngrams of up to order maxOrder are added. If false, input n-grams are assumed to be atomic.

KneserNeyLmReaderCallback

public KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer,
                                 int maxOrder,
                                 ConfigOptions opts)
Method Detail

call

public void call(W[] ngram,
                 LongRef value)

callJustLast

public void callJustLast(W[] ngram,
                         LongRef value,
                         long[][] scratch)

call

public void call(int[] ngram,
                 int startPos,
                 int endPos,
                 LongRef value,
                 String words)
Description copied from interface: LmReaderCallback
Called for each n-gram

Specified by:
call in interface LmReaderCallback<LongRef>
Parameters:
ngram - The integer representation of the words as given by the provided WordIndexer
value - The value of the n-gram
words - The string representation of the n-gram (space separated)

addNgram

public void addNgram(int[] ngram,
                     int startPos,
                     int endPos,
                     LongRef value,
                     String words,
                     boolean justLastWord,
                     long[][] scratch)
Parameters:
ngram -
startPos -
endPos -
value -
words -

interpolateProb

protected float interpolateProb(int[] ngram,
                                int startPos,
                                int endPos)

getHighestOrderProb

protected float getHighestOrderProb(int[] ngram,
                                    int startPos,
                                    int endPos)

getLowerOrderProb

protected float getLowerOrderProb(int[] ngram,
                                  int startPos,
                                  int endPos)

getLowerOrderBackoff

protected float getLowerOrderBackoff(int[] ngram,
                                     int startPos,
                                     int endPos)

getDiscountForOrder

protected float getDiscountForOrder(int ngramOrder)

cleanup

public void cleanup()
Description copied from interface: LmReaderCallback
Called once all reading is done.

Specified by:
cleanup in interface LmReaderCallback<LongRef>

defaultDiscounts

public static double[] defaultDiscounts()

defaultMinCounts

public static double[] defaultMinCounts()

parse

public void parse(ArpaLmReaderCallback<ProbBackoffPair> callback)
Specified by:
parse in interface LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>>

getWordIndexer

public WordIndexer<W> getWordIndexer()
Description copied from interface: NgramLanguageModel
Each LM must have a WordIndexer which assigns integer IDs to each word W in the language.

Specified by:
getWordIndexer in interface NgramLanguageModel<W>
Returns:

handleNgramOrderFinished

public void handleNgramOrderFinished(int order)
Description copied from interface: NgramOrderedLmReaderCallback
Called when all n-grams of a given order are finished

Specified by:
handleNgramOrderFinished in interface NgramOrderedLmReaderCallback<LongRef>

handleNgramOrderStarted

public void handleNgramOrderStarted(int order)
Description copied from interface: NgramOrderedLmReaderCallback
Called when n-grams of a given order are started

Specified by:
handleNgramOrderStarted in interface NgramOrderedLmReaderCallback<LongRef>

getLmOrder

public int getLmOrder()
Description copied from interface: NgramLanguageModel
Maximum size of n-grams stored by the model.

Specified by:
getLmOrder in interface NgramLanguageModel<W>
Returns:

scoreSentence

public float scoreSentence(List<W> sentence)
Description copied from interface: NgramLanguageModel
Scores a complete sentence, taking appropriate care with the start- and end-of-sentence symbols. This is a convenience method and will generally be inefficient.

Specified by:
scoreSentence in interface NgramLanguageModel<W>
Returns:

getLogProb

public float getLogProb(List<W> ngram)
Description copied from interface: NgramLanguageModel
Scores an n-gram. This is a convenience method and will generally be relatively inefficient. More efficient versions are available in ArrayEncodedNgramLanguageModel.getLogProb(int[], int, int) and ContextEncodedNgramLanguageModel.getLogProb(long, int, int, edu.berkeley.nlp.lm.ContextEncodedNgramLanguageModel.LmContextInfo) .

Specified by:
getLogProb in interface NgramLanguageModel<W>

getLogProb

public float getLogProb(int[] ngram,
                        int startPos,
                        int endPos)
Description copied from interface: ArrayEncodedNgramLanguageModel
Calculate language model score of an n-gram. Warning: if you pass in an n-gram of length greater than getLmOrder(), this call will silently ignore the extra words of context. In other words, if you pass in a 5-gram (endPos-startPos == 5) to a 3-gram model, it will only score the words from startPos + 2 to endPos.

Specified by:
getLogProb in interface ArrayEncodedNgramLanguageModel<W>
Parameters:
ngram - array of words in integer representation
startPos - start of the portion of the array to be read
endPos - end of the portion of the array to be read.
Returns:

getLogProb

public float getLogProb(int[] ngram)
Description copied from interface: ArrayEncodedNgramLanguageModel
Equivalent to getLogProb(ngram, 0, ngram.length)

Specified by:
getLogProb in interface ArrayEncodedNgramLanguageModel<W>
See Also:
ArrayEncodedNgramLanguageModel.getLogProb(int[], int, int)

getTotalSize

public long getTotalSize()

setOovWordLogProb

public void setOovWordLogProb(float logProb)
Description copied from interface: NgramLanguageModel
Sets the (log) probability for an OOV word. Note that this is in general different from the log prob of the unk tag probability.

Specified by:
setOovWordLogProb in interface NgramLanguageModel<W>