Using wordpiece

This package applies WordPiece tokenization to input text, given an appropriate WordPiece vocabulary. The BERT tokenization conventions are used. The basic tokenization algorithm is:

  • Put spaces around punctuation.
  • For each resulting word, if the word is found in the WordPiece vocabulary, keep it as-is. If not, starting from the beginning, pull off the biggest piece that is in the vocabulary, and prefix “##” to the remaining piece. Repeat until the entire word is represented by pieces from the vocabulary, if possible.
  • If the word can’t be represented by vocabulary pieces, or if it exceeds a certain length, replace it with a specified “unknown” token.

Ideally, a WordPiece vocabulary will be complete enough to represent any word, but this is not required.

Provided Vocabularies

Two vocabularies are provided via the {wordpiece.data} package. These are the wordpiece vocabularies used in Google Research’s BERT models (and most models based on BERT).

library(wordpiece)

# The default vocabulary is uncased.
wordpiece_tokenize(
  "I like tacos!"
)
#> [[1]]
#>     i  like    ta ##cos     ! 
#>  1045  2066 11937 13186   999

# A cased vocabulary is also provided.
wordpiece_tokenize(
  "I like tacos!",
  vocab = wordpiece_vocab(cased = TRUE)
)
#> [[1]]
#>     I  like    ta ##cos     ! 
#>   146  1176 27629 13538   106

Loading a Vocabulary

For the rest of this vignette, we use a tiny vocabulary for illustrative purposes. You should not use this vocabulary for actual tokenization.

The vocabulary is represented by the package as a named integer vector, with a logical attribute is_cased to indicate whether the vocabulary is case sensitive. The names are the actual tokens, and the integer values are the token indices. The integer values would be the input to a BERT model, for example.

A vocabulary can be read from a text file containing a single token per line. The token index is taken to be the line number, starting from zero. These conventions are adopted for compatibility with the vocabulary and file format used in the pretrained BERT checkpoints released by Google Research. The casedness of the vocabulary is inferred from the content of the vocabulary.

# Get path to sample vocabulary included with package.
vocab_path <- system.file("extdata", "tiny_vocab.txt", package = "wordpiece")

# Load the vocabulary.
vocab <- load_vocab(vocab_path)

# Take a peek at the vocabulary.
head(vocab)
#> [1] "[PAD]" "[CLS]" "[SEP]" "!"     "."     ","

When a text vocabulary is loaded with load_or_retrieve_vocabulary in an interactive R session, the option is given to cache the vocabulary as an RDS file for faster future loading.

Tokenizing Text

Tokenize text by calling wordpiece_tokenize on the text, passing the vocabulary as the vocab parameter. The output of wordpiece_tokenize is a named integer vector of token indices.

# Now tokenize some text!
wordpiece_tokenize(text = "I love tacos, apples, and tea!", vocab = vocab)
#> [[1]]
#>     i  love tacos     ,   app ##les     ,   and     t   ##e   ##a     ! 
#>     6     7     8     5    10    11     5     9    30    41    37     3

Vocabulary Case

The above vocabulary contained no tokens starting with an uppercase letter, so it was assumed to be uncased. When tokenizing text with an uncased vocabulary, the input is converted to lowercase before any other processing is applied. If the vocabulary contains at least one capitalized token, it will be taken as case-sensitive, and the case of the input text is preserved. Note that in a cased vocabulary, capitalized and uncapitalized versions of the same word are different tokens, and must both be included in the vocabulary to be recognized.

# The above vocabulary was uncased.
attr(vocab, "is_cased")
#> [1] FALSE

# Here is the same vocabulary, but containing the capitalized token "Hi".
vocab_path2 <- system.file("extdata", "tiny_vocab_cased.txt", 
                           package = "wordpiece")
vocab_cased <- load_vocab(vocab_path2)
head(vocab_cased)
#> [1] "[PAD]" "[CLS]" "[SEP]" "!"     "."     ","

# vocab_cased is inferred to be case-sensitive...
attr(vocab_cased, "is_cased")
#> [1] TRUE

# ... so the tokenization will *not* convert strings to lowercase, and so the
# words "I" and "And" are not found in the vocabulary (though "and" still is).
wordpiece_tokenize(text = "And I love tacos and salsa!", vocab = vocab_cased)
#> [[1]]
#> [UNK] [UNK]  love tacos   and     s   ##a   ##l   ##s   ##a     ! 
#>    64    64     8     9    10    30    38    49    56    38     3

Representing “Unknown” Tokens

Note that the default value for the unk_token argument, “[UNK]”, is present in the above vocabularies, so it had an integer index in the tokenization. If that token were not in the vocabulary, its index would be coded as NA.

wordpiece_tokenize(text = "I love tacos!", 
                   vocab = vocab_cased, 
                   unk_token = "[missing]")
#> [[1]]
#> [missing]      love     tacos         ! 
#>        NA         8         9         3

The package defaults are set to be compatible with BERT tokenization. If you have a different use case, be sure to check all parameter values.