Berkeley LM Binaries
These binary files can be loaded by the Berkeley LM toolkit. They contain all counts for the Web 1T corpora provided by Google for English, Chinese, and 10 EU languages. Due to licensing restrictions, the binaries do not contain vocabularies so that the corpora cannot reproduced unless you have independent access to the corpora. The vocabularies are contained in files called vocab_cs.gz
for all corpora except Chinese. In that case, you must build the file manullay with the following command:
zcat ngrams-00000-of-00394.gz | sort -rgk2 | gzip > vocab_cs.gz
These files can be loaded programmatically by calling the method
edu.berkeley.nlp.lm.io.LmReaders.readGoogleLmBinary
or
edu.berkeley.nlp.lm.io.LmReaders.readNgramMapFromBinary
The former reads an n-gram language model estimated using stupid backoff, and the latter gives access to a data structures that implements Java's Map interface to allow queries of raw counts for n-grams.
The files can be downloaded here:
English
Chinese
Czech
Dutch
Frenchh
German
Italian
Polish
Portuguese
Romanian
Spanish
Swedish