Package 'wordpiece.data'

Title: Data for Wordpiece-Style Tokenization
Description: Provides data to be used by the wordpiece algorithm in order to tokenize text into somewhat meaningful chunks. Included vocabularies were retrieved from <https://huggingface.co/bert-base-cased/resolve/main/vocab.txt> and <https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt> and parsed into an R-friendly format.
Authors: Jonathan Bratt [aut] , Jon Harmon [aut, cre] , Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph], Google, Inc [cph] (original BERT vocabularies)
Maintainer: Jon Harmon <[email protected]>
License: Apache License (>= 2)
Version: 2.0.0
Built: 2024-10-29 06:02:16 UTC
Source: https://github.com/macmillancontentscience/wordpiece.data

Help Index


Load a wordpiece Vocabulary

Description

A wordpiece vocabulary is a named integer vector with class "wordpiece_vocabulary". The names of the vector are the tokens, and the values are the integer identifiers of those tokens. The vocabulary is 0-indexed for compatibility with Python implementations.

Usage

wordpiece_vocab(cased = FALSE)

Arguments

cased

Logical; load the uncased vocabulary, or the cased vocabulary?

Value

A wordpiece_vocabulary.

Examples

head(wordpiece_vocab())
head(wordpiece_vocab(cased = TRUE))