Source: texcla/embeddings.py#L0


build_embedding_weights

build_embedding_weights(word_index, embeddings_index)

Builds an embedding matrix for all words in vocab using embeddings_index


build_fasttext_wiki_embedding_obj

build_fasttext_wiki_embedding_obj(embedding_type)

FastText pre-trained word vectors for 294 languages, with 300 dimensions, trained on Wikipedia. It's recommended to use the same tokenizer for your data that was used to construct the embeddings. It's implemented as 'FasttextWikiTokenizer'. More information: https://fasttext.cc/docs/en/pretrained-vectors.html.

Args:

  • embedding_type: A string in the format fastext.wiki.$LANG_CODE. e.g. fasttext.wiki.de or fasttext.wiki.es Returns:

Object with the URL and filename used later on for downloading the file.


build_fasttext_cc_embedding_obj

build_fasttext_cc_embedding_obj(embedding_type)

FastText pre-trained word vectors for 157 languages, with 300 dimensions, trained on Common Crawl and Wikipedia. Released in 2018, it succeesed the 2017 FastText Wikipedia embeddings. It's recommended to use the same tokenizer for your data that was used to construct the embeddings. This information and more can be find on their Website: https://fasttext.cc/docs/en/crawl-vectors.html.

Args:

  • embedding_type: A string in the format fastext.cc.$LANG_CODE. e.g. fasttext.cc.de or fasttext.cc.es Returns:

Object with the URL and filename used later on for downloading the file.


get_embedding_type

get_embedding_type(embedding_type)

get_embeddings_index

get_embeddings_index(embedding_type="glove.42B.300d", embedding_path=None, \
    embedding_dims=None)

Retrieves embeddings index from embedding name or path. Will automatically download and cache as needed.

Args:

  • embedding_type: The embedding type to load.
  • embedding_path: Path to a local embedding to use instead of the embedding type. Ignores embedding_type if specified.

Returns:

The embeddings indexed by word.