Title: | Tools for Preparing Text for Tokenizers |
---|---|
Description: | Tokenizers break text into pieces that are more usable by machine learning models. Many tokenizers share some preparation steps. This package provides those shared steps, along with a simple tokenizer. |
Authors: | Jon Harmon [aut, cre] , Jonathan Bratt [aut] , Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph] |
Maintainer: | Jon Harmon <[email protected]> |
License: | Apache License (>= 2) |
Version: | 1.0.2.9000 |
Built: | 2024-11-07 02:40:30 UTC |
Source: | https://github.com/macmillancontentscience/piecemaker |
This is an extremely simple tokenizer that simply splits text on spaces. It
also optionally applies the cleaning processes from
prepare_text
.
prepare_and_tokenize(text, prepare = TRUE, ...)
prepare_and_tokenize(text, prepare = TRUE, ...)
text |
A character vector to clean. |
prepare |
Logical; should the text be passed through
|
... |
Arguments passed on to
|
The text as a list of character vectors. Each element of each vector is roughly equivalent to a word.
prepare_and_tokenize("This is some text.") prepare_and_tokenize("This is some text.", space_punctuation = FALSE)
prepare_and_tokenize("This is some text.") prepare_and_tokenize("This is some text.", space_punctuation = FALSE)
This function combines the other functions in this package to prepare text for tokenization. The text gets converted to valid UTF-8 (if possible), and then various cleaning functions are applied.
prepare_text( text, squish_whitespace = TRUE, remove_terminal_hyphens = TRUE, remove_control_characters = TRUE, remove_replacement_characters = TRUE, remove_diacritics = TRUE, space_cjk = TRUE, space_punctuation = TRUE, space_hyphens = TRUE, space_abbreviations = TRUE )
prepare_text( text, squish_whitespace = TRUE, remove_terminal_hyphens = TRUE, remove_control_characters = TRUE, remove_replacement_characters = TRUE, remove_diacritics = TRUE, space_cjk = TRUE, space_punctuation = TRUE, space_hyphens = TRUE, space_abbreviations = TRUE )
text |
A character vector to clean. |
squish_whitespace |
Logical scalar; squish whitespace characters (using
|
remove_terminal_hyphens |
Logical; should hyphens at the end of lines after a word be removed? For example, "un-\nbroken" would become "unbroken". |
remove_control_characters |
Logical scalar; remove control characters? |
remove_replacement_characters |
Logical scalar; remove the "replacement
character", |
remove_diacritics |
Logical scalar; remove diacritical marks (accents, etc) from characters? |
space_cjk |
Logical scalar; add spaces around Chinese/Japanese/Korean ideographs? |
space_punctuation |
Logical scalar; add spaces around punctuation (to make it easier to keep punctuation during tokenization)? |
space_hyphens |
Logical; treat hyphens between letters and at the start/end of words as punctuation? Other hyphens are always treated as punctuation. |
space_abbreviations |
Logical; treat apostrophes between letters as punctuation? Other apostrophes are always treated as punctuation. |
The character vector, cleaned as specified.
piece1 <- " This is a \n\nfa\xE7ile\n\n example.\n" # Specify encoding so this example behaves the same on all systems. Encoding(piece1) <- "latin1" example_text <- paste( piece1, "It has the bell character, \a, and the replacement character,", intToUtf8(65533) ) prepare_text(example_text) prepare_text(example_text, squish_whitespace = FALSE) prepare_text(example_text, remove_control_characters = FALSE) prepare_text(example_text, remove_replacement_characters = FALSE) prepare_text(example_text, remove_diacritics = FALSE)
piece1 <- " This is a \n\nfa\xE7ile\n\n example.\n" # Specify encoding so this example behaves the same on all systems. Encoding(piece1) <- "latin1" example_text <- paste( piece1, "It has the bell character, \a, and the replacement character,", intToUtf8(65533) ) prepare_text(example_text) prepare_text(example_text, squish_whitespace = FALSE) prepare_text(example_text, remove_control_characters = FALSE) prepare_text(example_text, remove_replacement_characters = FALSE) prepare_text(example_text, remove_diacritics = FALSE)
Unicode includes several control codes, such as U+0000
(NULL, used in
null-terminated strings) and U+000D
(carriage return). This function
removes all such characters from text.
remove_control_characters(text)
remove_control_characters(text)
text |
A character vector to clean. |
Note: We highly recommend that you first condense all space-like characters
(including new lines) before removing control codes. You can easily do so
with str_squish
. We also recommend validating text at
the start of any cleaning process using validate_utf8
.
The character vector without control characters.
remove_control_characters("Line 1\nLine2")
remove_control_characters("Line 1\nLine2")
Accent characters and other diacritical marks are often difficult to type, and thus can be missing from text. To normalize the various ways a user might spell a word that should have a diacritical mark, you can convert all such characters to their simpler equivalent character.
remove_diacritics(text)
remove_diacritics(text)
text |
A character vector to clean. |
The character vector with simpler character representations.
# This text can appear differently between machines if we aren't careful, so # we explicitly encode the desired characters. sample_text <- "fa\u00e7ile r\u00e9sum\u00e9" sample_text remove_diacritics(sample_text)
# This text can appear differently between machines if we aren't careful, so # we explicitly encode the desired characters. sample_text <- "fa\u00e7ile r\u00e9sum\u00e9" sample_text remove_diacritics(sample_text)
The replacement character, U+FFFD
, is used to mark characters that
could not be loaded. These characters might be a sign of encoding issues, so
it is advisable to investigate and try to eliminate any cases in your text,
but in the end these characters will almost definitely confuse downstream
processes.
remove_replacement_characters(text)
remove_replacement_characters(text)
text |
A character vector to clean. |
The character vector with replacement characters removed.
remove_replacement_characters( paste( "The replacement character:", intToUtf8(65533) ) )
remove_replacement_characters( paste( "The replacement character:", intToUtf8(65533) ) )
To tokenize Chinese, Japanese, and Korean (CJK) characters, it's convenient to add spaces around the characters.
space_cjk(text)
space_cjk(text)
text |
A character vector to clean. |
A character vector the same length as the input text, with spaces added between ideographs.
to_space <- intToUtf8(13312:13320) to_space space_cjk(to_space)
to_space <- intToUtf8(13312:13320) to_space space_cjk(to_space)
To keep punctuation during tokenization, it's convenient to add spacing around punctuation. This function does that, with options to keep certain types of punctuation together as part of the word.
space_punctuation(text, space_hyphens = TRUE, space_abbreviations = TRUE)
space_punctuation(text, space_hyphens = TRUE, space_abbreviations = TRUE)
text |
A character vector to clean. |
space_hyphens |
Logical; treat hyphens between letters and at the start/end of words as punctuation? Other hyphens are always treated as punctuation. |
space_abbreviations |
Logical; treat apostrophes between letters as punctuation? Other apostrophes are always treated as punctuation. |
A character vector the same length as the input text, with spaces added around punctuation characters.
to_space <- "This is some 'gosh-darn' $5 text. Isn't it lovely?" to_space space_punctuation(to_space) space_punctuation(to_space, space_hyphens = FALSE) space_punctuation(to_space, space_abbreviations = FALSE)
to_space <- "This is some 'gosh-darn' $5 text. Isn't it lovely?" to_space space_punctuation(to_space) space_punctuation(to_space, space_hyphens = FALSE) space_punctuation(to_space, space_abbreviations = FALSE)
This function is mostly a wrapper around str_squish
,
with the additional option to remove hyphens at the ends of lines.
squish_whitespace(text, remove_terminal_hyphens = TRUE)
squish_whitespace(text, remove_terminal_hyphens = TRUE)
text |
A character vector to clean. |
remove_terminal_hyphens |
Logical; should hyphens at the end of lines after a word be removed? For example, "un-\nbroken" would become "unbroken". |
The character vector with spacing at the start and end removed, and with internal spacing reduced to a single space character each.
sample_text <- "This had many space char-\\nacters." squish_whitespace(sample_text)
sample_text <- "This had many space char-\\nacters." squish_whitespace(sample_text)
This is an extremely simple tokenizer, breaking only and exactly on the space
character. This tokenizer is intended to work in tandem with
prepare_text
, so that spaces are cleaned up and inserted as
necessary before the tokenizer runs. This function and
prepare_text
are combined together in
prepare_and_tokenize
.
tokenize_space(text)
tokenize_space(text)
text |
A character vector to clean. |
The text as a list of character vectors (one vector per element of
text
). Each element of each vector is roughly equivalent to a word.
tokenize_space("This is some text.")
tokenize_space("This is some text.")
Text cleaning works best if the encoding is known. This function attempts to convert text to UTF-8 encoding, and provides an informative error if that is not possible.
validate_utf8(text)
validate_utf8(text)
text |
A character vector to clean. |
The text with formal UTF-8 encoding, if possible.
text <- "fa\xE7ile" # Specify the encoding so the example is the same on all systems. Encoding(text) <- "latin1" validate_utf8(text)
text <- "fa\xE7ile" # Specify the encoding so the example is the same on all systems. Encoding(text) <- "latin1" validate_utf8(text)