Source: texcla/preprocessing/char_tokenizer.py#L0
CharTokenizer
CharTokenizer.has_vocab
CharTokenizer.num_texts
The number of texts used to build the vocabulary.
CharTokenizer.num_tokens
Number of unique tokens for use in enccoding/decoding.
This can change with calls to apply_encoding_options
.
CharTokenizer.token_counts
Dictionary of token -> count values for the text corpus used to build_vocab
.
CharTokenizer.token_index
Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options
.
CharTokenizer.__init__
__init__(self, lang="en", lower=True, charset=None)
Encodes text into (samples, characters)
Args:
- lang: The spacy language to use. (Default value: 'en')
- lower: Lower cases the tokens if True. (Default value: True)
- charset: The character set to use. For example
charset = 'abc123'
. If None, all characters will be used. (Default value: None)
SentenceCharTokenizer
SentenceCharTokenizer.has_vocab
SentenceCharTokenizer.num_texts
The number of texts used to build the vocabulary.
SentenceCharTokenizer.num_tokens
Number of unique tokens for use in enccoding/decoding.
This can change with calls to apply_encoding_options
.
SentenceCharTokenizer.token_counts
Dictionary of token -> count values for the text corpus used to build_vocab
.
SentenceCharTokenizer.token_index
Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options
.
SentenceCharTokenizer.__init__
__init__(self, lang="en", lower=True, charset=None)
Encodes text into (samples, sentences, characters)
Args:
- lang: The spacy language to use. (Default value: 'en')
- lower: Lower cases the tokens if True. (Default value: True)
- charset: The character set to use. For example
charset = 'abc123'
. If None, all characters will be used. (Default value: None)