Source: texcla/preprocessing/tokenizer.py#L0
Tokenizer
Tokenizer.has_vocab
Tokenizer.num_texts
The number of texts used to build the vocabulary.
Tokenizer.num_tokens
Number of unique tokens for use in enccoding/decoding.
This can change with calls to apply_encoding_options
.
Tokenizer.token_counts
Dictionary of token -> count values for the text corpus used to build_vocab
.
Tokenizer.token_index
Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options
.
Tokenizer.__init__
__init__(self, lang="en", lower=True, special_token=['<PAD>', '<UNK>'])
Encodes text into (samples, aux_indices..., token)
where each token is mapped to a unique index starting
from i
. i
is the number of special tokens.
Args:
- lang: The spacy language to use. (Default value: 'en')
- lower: Lower cases the tokens if True. (Default value: True)
- special_token: The tokens that are reserved. Default: ['
', ' '], for unknown words and for padding token.