Source: texcla/preprocessing/char_tokenizer.py#L0


CharTokenizer

CharTokenizer.has_vocab

CharTokenizer.num_texts

The number of texts used to build the vocabulary.

CharTokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding. This can change with calls to apply_encoding_options.

CharTokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

CharTokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.


CharTokenizer.__init__

__init__(self, lang="en", lower=True, charset=None)

Encodes text into (samples, characters)

Args:

  • lang: The spacy language to use. (Default value: 'en')
  • lower: Lower cases the tokens if True. (Default value: True)
  • charset: The character set to use. For example charset = 'abc123'. If None, all characters will be used. (Default value: None)

SentenceCharTokenizer

SentenceCharTokenizer.has_vocab

SentenceCharTokenizer.num_texts

The number of texts used to build the vocabulary.

SentenceCharTokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding. This can change with calls to apply_encoding_options.

SentenceCharTokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

SentenceCharTokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.


SentenceCharTokenizer.__init__

__init__(self, lang="en", lower=True, charset=None)

Encodes text into (samples, sentences, characters)

Args:

  • lang: The spacy language to use. (Default value: 'en')
  • lower: Lower cases the tokens if True. (Default value: True)
  • charset: The character set to use. For example charset = 'abc123'. If None, all characters will be used. (Default value: None)