Package 'morphemepiece' reference manual

Title:	Morpheme Tokenization
Description:	Tokenize text into morphemes. The morphemepiece algorithm uses a lookup table to determine the morpheme breakdown of words, and falls back on a modified wordpiece tokenization algorithm for words not found in the lookup table.
Authors:	Jonathan Bratt [aut, cre] , Jon Harmon [aut] , Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph]
Maintainer:	Jonathan Bratt <[email protected]>
License:	Apache License (>= 2)
Version:	1.2.3
Built:	2024-11-19 03:55:21 UTC
Source:	https://github.com/macmillancontentscience/morphemepiece

morphemepiece: Morpheme Tokenization

Description

Tokenize words into morphemes (the smallest unit of meaning).

Load a morphemepiece lookup file

Description

Usually you will want to use the included lookup that can be accessed via morphemepiece_lookup(). This function can be used to load a different lookup from a file.

Usage

load_lookup(lookup_file)
load_lookup(lookup_file)

Arguments

lookup_file

path to lookup file. File is assumed to be a text file, with one word per line. The lookup value, if different from the word, follows the word on the same line, after a space.

Value

The lookup as a named list. Names are words in lookup.

Load a lookup file, or retrieve from cache

Description

Usually you will want to use the included lookup that can be accessed via morphemepiece_lookup(). This function can be used to load (and cache) a different lookup from a file.

Usage

load_or_retrieve_lookup(lookup_file)
load_or_retrieve_lookup(lookup_file)

Arguments

lookup_file

path to lookup file. File is assumed to be a text file, with one word per line. The lookup value, if different from the word, follows the word on the same line, after a space.

Value

The lookup table as a named character vector.

Load a vocabulary file, or retrieve from cache

Description

Usually you will want to use the included vocabulary that can be accessed via morphemepiece_vocab(). This function can be used to load (and cache) a different vocabulary from a file.

Usage

load_or_retrieve_vocab(vocab_file)
load_or_retrieve_vocab(vocab_file)

Arguments

vocab_file

path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number (starting at zero) corresponding to the index of that token in the vocabulary.

Value

The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.

Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing, it would break any pre-trained models using that vocabulary.

Load a vocabulary file

Description

Usually you will want to use the included vocabulary that can be accessed via morphemepiece_vocab(). This function can be used to load a different vocabulary from a file.

Usage

load_vocab(vocab_file)
load_vocab(vocab_file)

Arguments

vocab_file

path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number (starting at zero) corresponding to the index of that token in the vocabulary.

Value

Retrieve Directory for Morphemepiece Cache

Description

The morphemepiece cache directory is a platform- and user-specific path where morphemepiece saves caches (such as a downloaded lookup). You can override the default location in a few ways:

Option: morphemepiece.dirUse set_morphemepiece_cache_dir to set a specific cache directory for this session
Environment: MORPHEMEPIECE_CACHE_DIRSet this environment variable to specify a morphemepiece cache directory for all sessions.
Environment: R_USER_CACHE_DIRSet this environment variable to specify a cache directory root for all packages that use the caching system.

Usage

morphemepiece_cache_dir()
morphemepiece_cache_dir()

Value

A character vector with the normalized path to the cache.

Tokenize Sequence with Morpheme Pieces

Description

Given a single sequence of text and a morphemepiece vocabulary, tokenizes the text.

Usage

morphemepiece_tokenize(
  text,
  vocab = morphemepiece_vocab(),
  lookup = morphemepiece_lookup(),
  unk_token = "[UNK]",
  max_chars = 100
)
morphemepiece_tokenize(
  text,
  vocab = morphemepiece_vocab(),
  lookup = morphemepiece_lookup(),
  unk_token = "[UNK]",
  max_chars = 100
)

Arguments

`text`	Character scalar; text to tokenize.
`vocab`	A morphemepiece vocabulary.
`lookup`	A morphemepiece lookup table.
`unk_token`	Token to represent unknown words.
`max_chars`	Maximum length of word recognized.

Value

A character vector of tokenized text (later, this should be a named integer vector, as in the wordpiece package.)

Format a Token List as a Vocabulary

Description

We use a character vector with class morphemepiece_vocabulary to provide information about tokens used in morphemepiece_tokenize. This function takes a character vector of tokens and puts it into that format.

Usage

prepare_vocab(token_list)
prepare_vocab(token_list)

Arguments

token_list

A character vector of tokens.

Value

Examples

my_vocab <- prepare_vocab(c("some", "example", "tokens"))
class(my_vocab)
attr(my_vocab, "is_cased")
my_vocab <- prepare_vocab(c("some", "example", "tokens"))
class(my_vocab)
attr(my_vocab, "is_cased")

Set a Cache Directory for Morphemepiece

Description

Use this function to override the cache path used by morphemepiece for the current session. Set the MORPHEMEPIECE_CACHE_DIR environment variable for a more permanent change.

Usage

set_morphemepiece_cache_dir(cache_dir = NULL)
set_morphemepiece_cache_dir(cache_dir = NULL)

Arguments

cache_dir

Character scalar; a path to a cache directory.

Value

A normalized path to a cache directory. The directory is created if the user has write access and the directory does not exist.

Package 'morphemepiece'

Help Index

morphemepiece: Morpheme Tokenization

Description

Load a morphemepiece lookup file

Description

Usage

Arguments

Value

Load a lookup file, or retrieve from cache

Description

Usage

Arguments

Value

Load a vocabulary file, or retrieve from cache

Description

Usage

Arguments

Value

Load a vocabulary file

Description

Usage

Arguments

Value

Retrieve Directory for Morphemepiece Cache

Description

Usage

Value

Tokenize Sequence with Morpheme Pieces

Description

Usage

Arguments

Value

Format a Token List as a Vocabulary

Description

Usage

Arguments

Value

Examples

Set a Cache Directory for Morphemepiece

Description

Usage

Arguments

Value