Title: | Data for Wordpiece-Style Tokenization |
---|---|
Description: | Provides data to be used by the wordpiece algorithm in order to tokenize text into somewhat meaningful chunks. Included vocabularies were retrieved from <https://huggingface.co/bert-base-cased/resolve/main/vocab.txt> and <https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt> and parsed into an R-friendly format. |
Authors: | Jonathan Bratt [aut] , Jon Harmon [aut, cre] , Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph], Google, Inc [cph] (original BERT vocabularies) |
Maintainer: | Jon Harmon <[email protected]> |
License: | Apache License (>= 2) |
Version: | 2.0.0 |
Built: | 2024-10-29 06:02:16 UTC |
Source: | https://github.com/macmillancontentscience/wordpiece.data |
A wordpiece vocabulary is a named integer vector with class "wordpiece_vocabulary". The names of the vector are the tokens, and the values are the integer identifiers of those tokens. The vocabulary is 0-indexed for compatibility with Python implementations.
wordpiece_vocab(cased = FALSE)
wordpiece_vocab(cased = FALSE)
cased |
Logical; load the uncased vocabulary, or the cased vocabulary? |
A wordpiece_vocabulary.
head(wordpiece_vocab()) head(wordpiece_vocab(cased = TRUE))
head(wordpiece_vocab()) head(wordpiece_vocab(cased = TRUE))