Title: | R Implementation of Wordpiece Tokenization |
---|---|
Description: | Apply 'Wordpiece' (<arXiv:1609.08144>) tokenization to input text, given an appropriate vocabulary. The 'BERT' (<arXiv:1810.04805>) tokenization conventions are used by default. |
Authors: | Jonathan Bratt [aut, cre] , Jon Harmon [aut] , Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph] |
Maintainer: | Jonathan Bratt <[email protected]> |
License: | Apache License (>= 2) |
Version: | 2.1.3 |
Built: | 2024-10-27 03:49:29 UTC |
Source: | https://github.com/macmillancontentscience/wordpiece |
Load a vocabulary file, or retrieve from cache
load_or_retrieve_vocab(vocab_file)
load_or_retrieve_vocab(vocab_file)
vocab_file |
path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number corresponding to the index of that token in the vocabulary. |
The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.
Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing (the order of the tokens), it would break any pre-trained models.
Load a vocabulary file
load_vocab(vocab_file)
load_vocab(vocab_file)
vocab_file |
path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number corresponding to the index of that token in the vocabulary. |
The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.
Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing (the order of the tokens), it would break any pre-trained models.
# Get path to sample vocabulary included with package. vocab_path <- system.file("extdata", "tiny_vocab.txt", package = "wordpiece") vocab <- load_vocab(vocab_file = vocab_path)
# Get path to sample vocabulary included with package. vocab_path <- system.file("extdata", "tiny_vocab.txt", package = "wordpiece") vocab <- load_vocab(vocab_file = vocab_path)
We use a special named integer vector with class wordpiece_vocabulary to
provide information about tokens used in wordpiece_tokenize
.
This function takes a character vector of tokens and puts it into that
format.
prepare_vocab(token_list)
prepare_vocab(token_list)
token_list |
A character vector of tokens. |
The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.
Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing (the order of the tokens), it would break any pre-trained models.
my_vocab <- prepare_vocab(c("some", "example", "tokens")) class(my_vocab) attr(my_vocab, "is_cased")
my_vocab <- prepare_vocab(c("some", "example", "tokens")) class(my_vocab) attr(my_vocab, "is_cased")
Use this function to override the cache path used by wordpiece for the
current session. Set the WORDPIECE_CACHE_DIR
environment variable
for a more permanent change.
set_wordpiece_cache_dir(cache_dir = NULL)
set_wordpiece_cache_dir(cache_dir = NULL)
cache_dir |
Character scalar; a path to a cache directory. |
A normalized path to a cache directory. The directory is created if the user has write access and the directory does not exist.
The wordpiece cache directory is a platform- and user-specific path where wordpiece saves caches (such as a downloaded vocabulary). You can override the default location in a few ways:
Option:
wordpiece.dir
Use set_wordpiece_cache_dir
to set a
specific cache directory for this session
Environment:
WORDPIECE_CACHE_DIR
Set this environment variable to specify a
wordpiece cache directory for all sessions.
Environment:
R_USER_CACHE_DIR
Set this environment variable to specify a cache
directory root for all packages that use the caching system.
wordpiece_cache_dir()
wordpiece_cache_dir()
A character vector with the normalized path to the cache.
Given a sequence of text and a wordpiece vocabulary, tokenizes the text.
wordpiece_tokenize( text, vocab = wordpiece_vocab(), unk_token = "[UNK]", max_chars = 100 )
wordpiece_tokenize( text, vocab = wordpiece_vocab(), unk_token = "[UNK]", max_chars = 100 )
text |
Character; text to tokenize. |
vocab |
Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations. |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
A list of named integer vectors, giving the tokenization of the input sequences. The integer values are the token ids, and the names are the tokens.
tokens <- wordpiece_tokenize( text = c( "I love tacos!", "I also kinda like apples." ) )
tokens <- wordpiece_tokenize( text = c( "I love tacos!", "I also kinda like apples." ) )