|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback<W>
W
- public class KneserNeyLmReaderCallback<W>
Class for producing a Kneser-Ney language model in ARPA format from raw text.
Confusingly, this class is both a LmReaderCallback
(called from
TextReader
, which reads plain text), and a LmReader
, which
"reads" counts and produces Kneser-Ney probabilities and backoffs and passes
them on an ArpaLmReaderCallback
Nested Class Summary |
---|
Nested classes/interfaces inherited from interface edu.berkeley.nlp.lm.ArrayEncodedNgramLanguageModel |
---|
ArrayEncodedNgramLanguageModel.DefaultImplementations |
Nested classes/interfaces inherited from interface edu.berkeley.nlp.lm.NgramLanguageModel |
---|
NgramLanguageModel.StaticMethods |
Field Summary | |
---|---|
protected static float |
DEFAULT_DISCOUNT
|
protected int |
lmOrder
|
protected HashNgramMap<KneserNeyCountValueContainer.KneserNeyCounts> |
ngrams
|
protected ConfigOptions |
opts
|
protected static long |
serialVersionUID
|
protected int |
startIndex
|
protected WordIndexer<W> |
wordIndexer
This array represents the discount used for each ngram order. |
Constructor Summary | |
---|---|
KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer,
int maxOrder)
|
|
KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer,
int maxOrder,
ConfigOptions opts)
|
Method Summary | |
---|---|
void |
addNgram(int[] ngram,
int startPos,
int endPos,
LongRef value,
String words,
boolean justLastWord,
long[][] scratch)
|
void |
call(int[] ngram,
int startPos,
int endPos,
LongRef value,
String words)
Called for each n-gram |
void |
call(W[] ngram,
LongRef value)
|
void |
callJustLast(W[] ngram,
LongRef value,
long[][] scratch)
|
void |
cleanup()
Called once all reading is done. |
static double[] |
defaultDiscounts()
|
static double[] |
defaultMinCounts()
|
protected float |
getDiscountForOrder(int ngramOrder)
|
protected float |
getHighestOrderProb(int[] ngram,
int startPos,
int endPos)
|
int |
getLmOrder()
Maximum size of n-grams stored by the model. |
float |
getLogProb(int[] ngram)
Equivalent to getLogProb(ngram, 0, ngram.length) |
float |
getLogProb(int[] ngram,
int startPos,
int endPos)
Calculate language model score of an n-gram. |
float |
getLogProb(List<W> ngram)
Scores an n-gram. |
protected float |
getLowerOrderBackoff(int[] ngram,
int startPos,
int endPos)
|
protected float |
getLowerOrderProb(int[] ngram,
int startPos,
int endPos)
|
long |
getTotalSize()
|
WordIndexer<W> |
getWordIndexer()
Each LM must have a WordIndexer which assigns integer IDs to each word W in the language. |
void |
handleNgramOrderFinished(int order)
Called when all n-grams of a given order are finished |
void |
handleNgramOrderStarted(int order)
Called when n-grams of a given order are started |
protected float |
interpolateProb(int[] ngram,
int startPos,
int endPos)
|
void |
parse(ArpaLmReaderCallback<ProbBackoffPair> callback)
|
float |
scoreSentence(List<W> sentence)
Scores a complete sentence, taking appropriate care with the start- and end-of-sentence symbols. |
void |
setOovWordLogProb(float logProb)
Sets the (log) probability for an OOV word. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static final long serialVersionUID
protected static final float DEFAULT_DISCOUNT
protected final int lmOrder
protected final WordIndexer<W> wordIndexer
protected final HashNgramMap<KneserNeyCountValueContainer.KneserNeyCounts> ngrams
protected final ConfigOptions opts
protected final int startIndex
Constructor Detail |
---|
public KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer, int maxOrder)
wordIndexer
- maxOrder
- inputIsSentences
- If true, input n-grams are assumed to be sentences, and all
sub-ngrams of up to order maxOrder
are added. If
false, input n-grams are assumed to be atomic.public KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer, int maxOrder, ConfigOptions opts)
Method Detail |
---|
public void call(W[] ngram, LongRef value)
public void callJustLast(W[] ngram, LongRef value, long[][] scratch)
public void call(int[] ngram, int startPos, int endPos, LongRef value, String words)
LmReaderCallback
call
in interface LmReaderCallback<LongRef>
ngram
- The integer representation of the words as given by the
provided WordIndexervalue
- The value of the n-gramwords
- The string representation of the n-gram (space separated)public void addNgram(int[] ngram, int startPos, int endPos, LongRef value, String words, boolean justLastWord, long[][] scratch)
ngram
- startPos
- endPos
- value
- words
- protected float interpolateProb(int[] ngram, int startPos, int endPos)
protected float getHighestOrderProb(int[] ngram, int startPos, int endPos)
protected float getLowerOrderProb(int[] ngram, int startPos, int endPos)
protected float getLowerOrderBackoff(int[] ngram, int startPos, int endPos)
protected float getDiscountForOrder(int ngramOrder)
public void cleanup()
LmReaderCallback
cleanup
in interface LmReaderCallback<LongRef>
public static double[] defaultDiscounts()
public static double[] defaultMinCounts()
public void parse(ArpaLmReaderCallback<ProbBackoffPair> callback)
parse
in interface LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>>
public WordIndexer<W> getWordIndexer()
NgramLanguageModel
getWordIndexer
in interface NgramLanguageModel<W>
public void handleNgramOrderFinished(int order)
NgramOrderedLmReaderCallback
handleNgramOrderFinished
in interface NgramOrderedLmReaderCallback<LongRef>
public void handleNgramOrderStarted(int order)
NgramOrderedLmReaderCallback
handleNgramOrderStarted
in interface NgramOrderedLmReaderCallback<LongRef>
public int getLmOrder()
NgramLanguageModel
getLmOrder
in interface NgramLanguageModel<W>
public float scoreSentence(List<W> sentence)
NgramLanguageModel
scoreSentence
in interface NgramLanguageModel<W>
public float getLogProb(List<W> ngram)
NgramLanguageModel
ArrayEncodedNgramLanguageModel.getLogProb(int[], int, int)
and
ContextEncodedNgramLanguageModel.getLogProb(long, int, int, edu.berkeley.nlp.lm.ContextEncodedNgramLanguageModel.LmContextInfo)
.
getLogProb
in interface NgramLanguageModel<W>
public float getLogProb(int[] ngram, int startPos, int endPos)
ArrayEncodedNgramLanguageModel
getLmOrder()
,
this call will silently ignore the extra words of context. In other
words, if you pass in a 5-gram (endPos-startPos == 5
) to
a 3-gram model, it will only score the words from startPos + 2
to endPos
.
getLogProb
in interface ArrayEncodedNgramLanguageModel<W>
ngram
- array of words in integer representationstartPos
- start of the portion of the array to be readendPos
- end of the portion of the array to be read.
public float getLogProb(int[] ngram)
ArrayEncodedNgramLanguageModel
getLogProb(ngram, 0, ngram.length)
getLogProb
in interface ArrayEncodedNgramLanguageModel<W>
ArrayEncodedNgramLanguageModel.getLogProb(int[], int, int)
public long getTotalSize()
public void setOovWordLogProb(float logProb)
NgramLanguageModel
unk
tag probability.
setOovWordLogProb
in interface NgramLanguageModel<W>
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |