Source: texcla/preprocessing/tokenizer.py#L0


Tokenizer

Tokenizer.has_vocab

Tokenizer.num_texts

The number of texts used to build the vocabulary.

Tokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding. This can change with calls to apply_encoding_options.

Tokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

Tokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.


Tokenizer.__init__

__init__(self, lang="en", lower=True, special_token=['<PAD>', '<UNK>'])

Encodes text into (samples, aux_indices..., token) where each token is mapped to a unique index starting from i. i is the number of special tokens.

Args:

  • lang: The spacy language to use. (Default value: 'en')
  • lower: Lower cases the tokens if True. (Default value: True)
  • special_token: The tokens that are reserved. Default: ['', ''], for unknown words and for padding token.