Source: texcla/preprocessing/word_tokenizer.py#L0
SpacyTokenizer
SpacyTokenizer.has_vocab
SpacyTokenizer.num_texts
The number of texts used to build the vocabulary.
SpacyTokenizer.num_tokens
Number of unique tokens for use in enccoding/decoding.
This can change with calls to apply_encoding_options
.
SpacyTokenizer.token_counts
Dictionary of token -> count values for the text corpus used to build_vocab
.
SpacyTokenizer.token_index
Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options
.
SpacyTokenizer.__init__
__init__(self, lang="en", lower=True, lemmatize=False, remove_punct=True, remove_digits=True, \
remove_stop_words=False, exclude_oov=False, exclude_pos_tags=None, \
exclude_entities=['PERSON'])
Encodes text into (samples, words)
Args:
- lang: The spacy language to use. (Default value: 'en')
- lower: Lower cases the tokens if True. (Default value: True)
- lemmatize: Lemmatizes words when set to True. This also makes the word lower case
irrespective if the
lower
setting. (Default value: False) - remove_punct: Removes punct words if True. (Default value: True)
- remove_digits: Removes digit words if True. (Default value: True)
- remove_stop_words: Removes stop words if True. (Default value: False)
- exclude_oov: Exclude words that are out of spacy embedding's vocabulary. By default, GloVe 1 million, 300 dim are used. You can override spacy vocabulary with a custom embedding to change this. (Default value: False)
- exclude_pos_tags: A list of parts of speech tags to exclude. Can be any of spacy.parts_of_speech.IDS (Default value: None)
- exclude_entities: A list of entity types to be excluded. Supported entity types can be found here: https://spacy.io/docs/usage/entity-recognition#entity-types (Default value: ['PERSON'])
TwokenizeTokenizer
TwokenizeTokenizer.has_vocab
TwokenizeTokenizer.num_texts
The number of texts used to build the vocabulary.
TwokenizeTokenizer.num_tokens
Number of unique tokens for use in enccoding/decoding.
This can change with calls to apply_encoding_options
.
TwokenizeTokenizer.token_counts
Dictionary of token -> count values for the text corpus used to build_vocab
.
TwokenizeTokenizer.token_index
Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options
.
TwokenizeTokenizer.__init__
__init__(self, lang="en", lower=True)
Encodes text into (samples, aux_indices..., token)
where each token is mapped to a unique index starting
from i
. i
is the number of special tokens.
Args:
- lang: The spacy language to use. (Default value: 'en')
- lower: Lower cases the tokens if True. (Default value: True)
- special_token: The tokens that are reserved. Default: ['
', ' '], for unknown words and for padding token.
SimpleTokenizer
SimpleTokenizer.has_vocab
SimpleTokenizer.num_texts
The number of texts used to build the vocabulary.
SimpleTokenizer.num_tokens
Number of unique tokens for use in enccoding/decoding.
This can change with calls to apply_encoding_options
.
SimpleTokenizer.token_counts
Dictionary of token -> count values for the text corpus used to build_vocab
.
SimpleTokenizer.token_index
Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options
.
SimpleTokenizer.__init__
__init__(self, lang="en", lower=True)
Encodes text into (samples, aux_indices..., token)
where each token is mapped to a unique index starting
from i
. i
is the number of special tokens.
Args:
- lang: The spacy language to use. (Default value: 'en')
- lower: Lower cases the tokens if True. (Default value: True)
- special_token: The tokens that are reserved. Default: ['
', ' '], for unknown words and for padding token.
FastTextWikiTokenizer
FastTextWikiTokenizer.has_vocab
FastTextWikiTokenizer.num_texts
The number of texts used to build the vocabulary.
FastTextWikiTokenizer.num_tokens
Number of unique tokens for use in enccoding/decoding.
This can change with calls to apply_encoding_options
.
FastTextWikiTokenizer.token_counts
Dictionary of token -> count values for the text corpus used to build_vocab
.
FastTextWikiTokenizer.token_index
Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options
.
FastTextWikiTokenizer.__init__
__init__(self, lang="en")
Encodes text into (samples, aux_indices..., token)
where each token is mapped to a unique index starting
from i
. i
is the number of special tokens.
Args:
- lang: The spacy language to use. (Default value: 'en')
- lower: Lower cases the tokens if True. (Default value: True)
- special_token: The tokens that are reserved. Default: ['
', ' '], for unknown words and for padding token.