sourced.ml.core.models

Package Contents

class sourced.ml.core.models.BOW[source]

Bases:modelforge.Model

Weighted bag of words model. Every word is correspond to an index and its matrix column. Bag is a word set from repository, file or anything else. Word is source code identifier or its part. This model depends on sourced.ml.models.DocumentFrequencies.

NAME = bow
VENDOR = source{d}
DESCRIPTION = Model that contains source code as weighted bag of words.
LICENSE
matrix

Returns the bags as a sparse matrix. Rows are documents and columns are tokens weight.

documents

The list of documents in the model.

tokens

The list of tokens in the model.

construct(self, documents: List[str], tokens: List[str], matrix: sparse.spmatrix)
dump(self)
save(self, output: str, series: str, deps: Iterable = tuple(), create_missing_dirs: bool = True)
convert_bow_to_vw(self, output: str)
documents_index(self)
class sourced.ml.core.models.Cooccurrences[source]

Bases:modelforge.model.Model

Co-occurrence matrix.

NAME = co-occurrences
VENDOR = source{d}
DESCRIPTION = Model that contains the sparse co-occurrence matrix of source code identifiers.
LICENSE
tokens

Returns the tokens in the order which corresponds to the matrix’s rows and cols.

matrix

Returns the sparse co-occurrence matrix.

construct(self, tokens, matrix)
dump(self)
matrix_to_rdd(self, spark_context: 'pyspark.SparkContext')
class sourced.ml.core.models.DocumentFrequencies[source]

Bases:modelforge.Model

Document frequencies - number of times a source code identifier appeared in different repositories. Each repository counts only once.

NAME = docfreq
VENDOR = source{d}
DESCRIPTION = Model that contains document frequencies of features extracted from code.
LICENSE
docs

Returns the number of documents.

construct(self, docs: int, tokfreqs: Union[Iterable[Dict[str, int]], Dict[str, int]])

Initializes this model.

Parameters:
  • docs – The number of documents.
  • tokfreqs – The dictionary of token -> frequency or the iterable collection of such dictionaries.
Returns:

self

dump(self)
prune(self, threshold: int)

Removes tokens which occur less than threshold times. The operation happens not in-place - a new model is returned. :param threshold: Minimum number of occurrences. :return: The new model if the current one had to be changed, otherwise self.

greatest(self, max_size: int)

Truncates the model to most frequent max_size tokens. The operation happens not in-place - a new model is returned. :param max_size: The maximum vocabulary size. :return: The new model if the current one had to be changed, otherwise self.

get(self, item, default=None)

Return the document frequency for a given token.

Parameters:
  • item – The token to query.
  • default – Returned value in case the token is missing.
Returns:

int or default

tokens(self)

Returns the list of tokens.

class sourced.ml.core.models.OrderedDocumentFrequencies[source]

Bases:sourced.ml.core.models.DocumentFrequencies

Compatible with the original DocumentFrequencies. This model maintains the determinitic sequence of the tokens.

order
construct(self, docs: int, tokfreqs: Iterable[Dict[str, int]])
tokens(self)
prune(self, threshold: int)
greatest(self, max_size: int)
class sourced.ml.core.models.Id2Vec[source]

Bases:modelforge.Model

id2vec model - source code identifier embeddings.

NAME = id2vec
VENDOR = source{d}
DESCRIPTION = Model that contains information on source code as identifier embeddings.
LICENSE
embeddings

numpy.ndarray with the embeddings of shape (N tokens x embedding dims).

tokens

List with the processed source code identifiers.

construct(self, embeddings, tokens)
dump(self)
items(self)

Returns the tuples belonging to token -> index mapping.

class sourced.ml.core.models.TensorFlowModel[source]

Bases:modelforge.Model

TensorFlow Protobuf model exported in the Modelforge format with GraphDef inside.

NAME = tensorflow-model
VENDOR = source{d}
DESCRIPTION = TensorFlow Protobuf model that contains a GraphDef instance.
LICENSE
graphdef

Returns the wrapped TensorFlow GraphDef.

construct(self, graphdef: 'tensorflow.GraphDef' = None, session: 'tensorflow.Session' = None, outputs: List[str] = None)
class sourced.ml.core.models.Topics[source]

Bases:modelforge.Model

NAME = topics
VENDOR = source{d}
DESCRIPTION = Model that is used to identify topics of source code repositories.
LICENSE
tokens
topics

May be None if no topics are labeled.

matrix

Rows: tokens Columns: topics

construct(self, tokens: list, topics: Union[list, None], matrix)
dump(self)
label_topics(self, labels)
class sourced.ml.core.models.QuantizationLevels[source]

Bases:modelforge.Model

This model contains quantization levels for multiple schemes (feature types). Every feature “class” (type, possible distinct value) corresponds to the numpy array with integer level borders. The size of each numpy array is (the number of levels + 1).

NAME = quant
VENDOR = source{d}
DESCRIPTION = Model that contains quantization levels for multiple schemes (feature types).
LICENSE
levels
construct(self, levels: Dict[str, Dict[str, numpy.ndarray]])
dump(self)
apply_quantization(self, extractors)
class sourced.ml.core.models.MergeDocFreq(min_docfreq: int, vocabulary_size: int, ordered: bool = False, *args, **kwargs)[source]

Bases:sourced.ml.core.models.model_converters.base.Model2Base

Merges several DocumentFrequencies models together.

MODEL_FROM_CLASS
MODEL_TO_CLASS
convert_model(self, model: DocumentFrequencies)
finalize(self, index: int, destdir: str)
class sourced.ml.core.models.MergeBOW(features=None, *args, **kwargs)[source]

Bases:sourced.ml.core.models.model_converters.base.Model2Base

Merges several BOW models together.

MODEL_FROM_CLASS
MODEL_TO_CLASS
convert_model(self, model: BOW)
finalize(self, index: int, destdir: str)