sourced.ml.core.algorithms.swivel¶
Submatrix-wise Vector Embedding Learner.
Implementation of SwiVel algorithm described at: http://arxiv.org/abs/1602.02215
This program expects an input directory that contains the following files.
row_vocab.txt, col_vocab.txt
The row an column vocabulary files. Each file should contain one token per line; these will be used to generate a tab-separate file containing the trained embeddings.row_sums.txt, col_sum.txt
The matrix row and column marginal sums. Each file should contain one decimal floating point number per line which corresponds to the marginal count of the matrix for that row or column.shards.recs
A file containing the sub-matrix shards, stored as TFRecords. Each shard is expected to be a serialzed tf.Example protocol buffer with the following properties:
global_row: the global row indices contained in the shard global_col: the global column indices contained in the shard sparse_local_row, sparse_local_col, sparse_value: three parallel arrays that are a sparse representation of the submatrix counts.
It will generate embeddings, training from the input directory for the specified number of epochs. When complete, it will output the trained vectors to a tab-separated file that contains one line per embedding. Row and column embeddings are stored in separate files.
Module Contents¶
-
sourced.ml.core.algorithms.swivel.embeddings_with_init(vocab_size, embedding_dim, name)[source]¶ Creates and initializes the embedding tensors.
-
sourced.ml.core.algorithms.swivel.count_matrix_input(filenames, submatrix_rows, submatrix_cols)[source]¶ Reads submatrix shards from disk.
-
sourced.ml.core.algorithms.swivel.read_marginals_file(filename)[source]¶ Reads text file with one number per line to an array.
-
sourced.ml.core.algorithms.swivel.write_embedding_tensor_to_disk(vocab_path, output_path, sess, embedding)[source]¶ Writes tensor to output_path as tsv
-
sourced.ml.core.algorithms.swivel.write_embeddings_to_disk(config, model, sess)[source]¶ Writes row and column embeddings disk