Source: texcla/preprocessing/sentence_tokenizer.py#L0


SpacySentenceTokenizer

SpacySentenceTokenizer.has_vocab

SpacySentenceTokenizer.num_texts

The number of texts used to build the vocabulary.

SpacySentenceTokenizer.num_tokens

Number of unique tokens for use in enccoding/decoding. This can change with calls to apply_encoding_options.

SpacySentenceTokenizer.token_counts

Dictionary of token -> count values for the text corpus used to build_vocab.

SpacySentenceTokenizer.token_index

Dictionary of token -> idx mappings. This can change with calls to apply_encoding_options.


SpacySentenceTokenizer.__init__

__init__(self, lang="en", lower=True, lemmatize=False, remove_punct=True, remove_digits=True, \
    remove_stop_words=False, exclude_oov=False, exclude_pos_tags=None, \
    exclude_entities=['PERSON'])

Encodes text into (samples, sentences, words)

Args:

  • lang: The spacy language to use. (Default value: 'en')
  • lower: Lower cases the tokens if True. (Default value: True)
  • lemmatize: Lemmatizes words when set to True. This also makes the word lower case irrespective if the lower setting. (Default value: False)
  • remove_punct: Removes punct words if True. (Default value: True)
  • remove_digits: Removes digit words if True. (Default value: True)
  • remove_stop_words: Removes stop words if True. (Default value: False)
  • exclude_oov: Exclude words that are out of spacy embedding's vocabulary. By default, GloVe 1 million, 300 dim are used. You can override spacy vocabulary with a custom embedding to change this. (Default value: False)
  • exclude_pos_tags: A list of parts of speech tags to exclude. Can be any of spacy.parts_of_speech.IDS (Default value: None)
  • exclude_entities: A list of entity types to be excluded. Supported entity types can be found here: https://spacy.io/docs/usage/entity-recognition#entity-types (Default value: ['PERSON'])