Tokenizer

Source: texcla/preprocessing/tokenizer.py#L0

Tokenizer

Tokenizer.has_vocab

Tokenizer.num_texts

The number of texts used to build the vocabulary.

Tokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding. This can change with calls to apply_encoding_options.

Tokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

Tokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.

Tokenizer.`init`

__init__(self, lang="en", lower=True, special_token=['<PAD>', '<UNK>'])

Encodes text into (samples, aux_indices..., token) where each token is mapped to a unique index starting from i. i is the number of special tokens.

Args:

lang: The spacy language to use. (Default value: 'en')
lower: Lower cases the tokens if True. (Default value: True)
special_token: The tokens that are reserved. Default: ['', ''], for unknown words and for padding token.

Tokenizer

Tokenizer.has_vocab

Tokenizer.num_texts

Tokenizer.num_tokens

Tokenizer.token_counts

Tokenizer.token_index

Tokenizer.__init__

Tokenizer.`init`