utils
Functions for preprocessing texts. Not language-specific.
cltk_normalize
Normalize text to NFC or NFKC, defaulting to compatibility form.
remove_non_ascii
Remove non-ascii characters.
Source: http://stackoverflow.com/a/1342373
remove_non_latin
Remove non-Latin characters.
also_keep should be a list which will add chars (e.g. punctuation)
that will not be filtered.
Source code in cltk/text/utils.py
split_trailing_punct
Split trailing punctuation from words.
Some tokenizers, including that in Stanza, do not always
handle punctuation properly. For example, a trailing colon ("οἶδα:")
is not split into an extra punctuation token. This function
does such splitting on raw text before being sent to such
a tokenizer.
Parameters:
-
text(str) –Input text string.
-
punctuation(Optional[list[str]], default:None) –List of punctuation that should be split when trailing a word.
Returns:
-
str–Text string with trailing punctuation separated by a whitespace character.
Examples:
raw_text = "κατηγόρων’, οὐκ οἶδα: ἐγὼ δ᾽ οὖν"
split_trailing_punct(text=raw_text)
# 'κατηγόρων ’, οὐκ οἶδα : ἐγὼ δ᾽ οὖν'
Source code in cltk/text/utils.py
split_leading_punct
Split leading punctuation from words.
Some tokenizers, including that in Stanza, do not always
handle punctuation properly. For example, an open curly
quote ("‘κατηγόρων’") is not split into an extra punctuation
token. This function does such splitting on raw text before
being sent to such a tokenizer.
Parameters:
-
text(str) –Input text string.
-
punctuation(Optional[list[str]], default:None) –List of punctuation that should be split when before a word.
Returns:
-
str–Text string with leading punctuation separated by a whitespace character.
Examples:
raw_text = "‘κατηγόρων’, οὐκ οἶδα: ἐγὼ δ᾽ οὖν"
split_leading_punct(text=raw_text)
# '‘ κατηγόρων’, οὐκ οἶδα: ἐγὼ δ᾽ οὖν'
Source code in cltk/text/utils.py
remove_odd_punct
Remove certain characters that downstream processes do not handle well.
It would be better to use split_leading_punct()
and split_trailing_punct(), however the default models
out of Stanza make very strange mistakes when, e.g., "‘"
is made its own token.
What to do about the apostrophe following an elision (e.g.,
"δ᾽"")?
Examples:
raw_text = "‘κατηγόρων’, οὐκ οἶδα: ἐγὼ δ᾽ οὖν"
remove_odd_punct(raw_text)
# 'κατηγόρων, οὐκ οἶδα ἐγὼ δ᾽ οὖν'
Source code in cltk/text/utils.py
strip_section_numbers
Remove section numbers like '1.2.2', '[55]', '[55A]', '1.2.2A', '55', '55B', '2:4b' from the text.