`sourced.ml.core.models`¶

Subpackages¶

sourced.ml.core.models.model_converters

Submodules¶

Package Contents¶

class sourced.ml.core.models.BOW[source]¶

Bases:modelforge.Model

Weighted bag of words model. Every word is correspond to an index and its matrix column. Bag is a word set from repository, file or anything else. Word is source code identifier or its part. This model depends on sourced.ml.models.DocumentFrequencies.

NAME = bow¶

VENDOR = source{d}¶

DESCRIPTION = Model that contains source code as weighted bag of words.¶

LICENSE¶

matrix¶: Returns the bags as a sparse matrix. Rows are documents and columns are tokens weight.

documents¶: The list of documents in the model.

tokens¶: The list of tokens in the model.

construct(self, documents: List[str], tokens: List[str], matrix: sparse.spmatrix)¶

dump(self)¶

save(self, output: str, series: str, deps: Iterable = tuple(), create_missing_dirs: bool = True)¶

convert_bow_to_vw(self, output: str)¶

documents_index(self)¶

class sourced.ml.core.models.Cooccurrences[source]¶

Bases:modelforge.model.Model

Co-occurrence matrix.

NAME = co-occurrences¶

VENDOR = source{d}¶

DESCRIPTION = Model that contains the sparse co-occurrence matrix of source code identifiers.¶

LICENSE¶

tokens¶: Returns the tokens in the order which corresponds to the matrix’s rows and cols.

matrix¶: Returns the sparse co-occurrence matrix.

construct(self, tokens, matrix)¶

dump(self)¶

matrix_to_rdd(self, spark_context: 'pyspark.SparkContext')¶

class sourced.ml.core.models.DocumentFrequencies[source]¶

Bases:modelforge.Model

Document frequencies - number of times a source code identifier appeared in different repositories. Each repository counts only once.

NAME = docfreq¶

VENDOR = source{d}¶

DESCRIPTION = Model that contains document frequencies of features extracted from code.¶

LICENSE¶

docs¶: Returns the number of documents.

construct(self, docs: int, tokfreqs: Union[Iterable[Dict[str, int]], Dict[str, int]])¶

Initializes this model.

Parameters:	docs – The number of documents. tokfreqs – The dictionary of token -> frequency or the iterable collection of such dictionaries.
Returns:	self

dump(self)¶

prune(self, threshold: int)¶: Removes tokens which occur less than threshold times. The operation happens not in-place - a new model is returned. :param threshold: Minimum number of occurrences. :return: The new model if the current one had to be changed, otherwise self.

greatest(self, max_size: int)¶: Truncates the model to most frequent max_size tokens. The operation happens not in-place - a new model is returned. :param max_size: The maximum vocabulary size. :return: The new model if the current one had to be changed, otherwise self.

get(self, item, default=None)¶

Return the document frequency for a given token.

Parameters:	item – The token to query. default – Returned value in case the token is missing.
Returns:	int or default

tokens(self)¶: Returns the list of tokens.

class sourced.ml.core.models.OrderedDocumentFrequencies[source]¶

Bases:sourced.ml.core.models.DocumentFrequencies

Compatible with the original DocumentFrequencies. This model maintains the determinitic sequence of the tokens.

order¶

construct(self, docs: int, tokfreqs: Iterable[Dict[str, int]])¶

tokens(self)¶

prune(self, threshold: int)¶

greatest(self, max_size: int)¶

class sourced.ml.core.models.Id2Vec[source]¶

Bases:modelforge.Model

id2vec model - source code identifier embeddings.

NAME = id2vec¶

VENDOR = source{d}¶

DESCRIPTION = Model that contains information on source code as identifier embeddings.¶

LICENSE¶

embeddings¶: numpy.ndarray with the embeddings of shape (N tokens x embedding dims).

tokens¶: List with the processed source code identifiers.

construct(self, embeddings, tokens)¶

dump(self)¶

items(self)¶: Returns the tuples belonging to token -> index mapping.

class sourced.ml.core.models.TensorFlowModel[source]¶

Bases:modelforge.Model

TensorFlow Protobuf model exported in the Modelforge format with GraphDef inside.

NAME = tensorflow-model¶

VENDOR = source{d}¶

DESCRIPTION = TensorFlow Protobuf model that contains a GraphDef instance.¶

LICENSE¶

graphdef¶: Returns the wrapped TensorFlow GraphDef.

construct(self, graphdef: 'tensorflow.GraphDef' = None, session: 'tensorflow.Session' = None, outputs: List[str] = None)¶

class sourced.ml.core.models.Topics[source]¶

Bases:modelforge.Model

NAME = topics¶

VENDOR = source{d}¶

DESCRIPTION = Model that is used to identify topics of source code repositories.¶

LICENSE¶

tokens¶

topics¶: May be None if no topics are labeled.

matrix¶: Rows: tokens Columns: topics

construct(self, tokens: list, topics: Union[list, None], matrix)¶

dump(self)¶

label_topics(self, labels)¶

class sourced.ml.core.models.QuantizationLevels[source]¶

Bases:modelforge.Model

This model contains quantization levels for multiple schemes (feature types). Every feature “class” (type, possible distinct value) corresponds to the numpy array with integer level borders. The size of each numpy array is (the number of levels + 1).

NAME = quant¶

VENDOR = source{d}¶

DESCRIPTION = Model that contains quantization levels for multiple schemes (feature types).¶

LICENSE¶

levels¶

construct(self, levels: Dict[str, Dict[str, numpy.ndarray]])¶

dump(self)¶

apply_quantization(self, extractors)¶

class sourced.ml.core.models.MergeDocFreq(min_docfreq: int, vocabulary_size: int, ordered: bool = False, *args, **kwargs)[source]¶

Bases:sourced.ml.core.models.model_converters.base.Model2Base

Merges several DocumentFrequencies models together.

MODEL_FROM_CLASS¶

MODEL_TO_CLASS¶

convert_model(self, model: DocumentFrequencies)¶

finalize(self, index: int, destdir: str)¶

class sourced.ml.core.models.MergeBOW(features=None, *args, **kwargs)[source]¶

Bases:sourced.ml.core.models.model_converters.base.Model2Base

Merges several BOW models together.

MODEL_FROM_CLASS¶

MODEL_TO_CLASS¶

convert_model(self, model: BOW)¶

finalize(self, index: int, destdir: str)¶

`sourced.ml.core.models`¶

Subpackages¶

Submodules¶

Package Contents¶

sourced.ml.core

Navigation

Related Topics

sourced.ml.core.models¶

Subpackages¶

Submodules¶

Package Contents¶

`sourced.ml.core.models`¶