sourced.ml.core.models¶
Subpackages¶
Submodules¶
sourced.ml.core.models.bowsourced.ml.core.models.cooccsourced.ml.core.models.dfsourced.ml.core.models.id2vecsourced.ml.core.models.id_splittersourced.ml.core.models.licensesourced.ml.core.models.ordered_dfsourced.ml.core.models.quantsourced.ml.core.models.tensorflowsourced.ml.core.models.topics
Package Contents¶
-
class
sourced.ml.core.models.BOW[source]¶ Bases:
modelforge.ModelWeighted bag of words model. Every word is correspond to an index and its matrix column. Bag is a word set from repository, file or anything else. Word is source code identifier or its part. This model depends on
sourced.ml.models.DocumentFrequencies.-
NAME= bow¶
-
VENDOR= source{d}¶
-
DESCRIPTION= Model that contains source code as weighted bag of words.¶
-
LICENSE¶
-
matrix¶ Returns the bags as a sparse matrix. Rows are documents and columns are tokens weight.
-
documents¶ The list of documents in the model.
-
tokens¶ The list of tokens in the model.
-
construct(self, documents: List[str], tokens: List[str], matrix: sparse.spmatrix)¶
-
dump(self)¶
-
save(self, output: str, series: str, deps: Iterable = tuple(), create_missing_dirs: bool = True)¶
-
convert_bow_to_vw(self, output: str)¶
-
documents_index(self)¶
-
-
class
sourced.ml.core.models.Cooccurrences[source]¶ Bases:
modelforge.model.ModelCo-occurrence matrix.
-
NAME= co-occurrences¶
-
VENDOR= source{d}¶
-
DESCRIPTION= Model that contains the sparse co-occurrence matrix of source code identifiers.¶
-
LICENSE¶
-
tokens¶ Returns the tokens in the order which corresponds to the matrix’s rows and cols.
-
matrix¶ Returns the sparse co-occurrence matrix.
-
construct(self, tokens, matrix)¶
-
dump(self)¶
-
matrix_to_rdd(self, spark_context: 'pyspark.SparkContext')¶
-
-
class
sourced.ml.core.models.DocumentFrequencies[source]¶ Bases:
modelforge.ModelDocument frequencies - number of times a source code identifier appeared in different repositories. Each repository counts only once.
-
NAME= docfreq¶
-
VENDOR= source{d}¶
-
DESCRIPTION= Model that contains document frequencies of features extracted from code.¶
-
LICENSE¶
-
docs¶ Returns the number of documents.
-
construct(self, docs: int, tokfreqs: Union[Iterable[Dict[str, int]], Dict[str, int]])¶ Initializes this model.
Parameters: - docs – The number of documents.
- tokfreqs – The dictionary of token -> frequency or the iterable collection of such dictionaries.
Returns: self
-
dump(self)¶
-
prune(self, threshold: int)¶ Removes tokens which occur less than threshold times. The operation happens not in-place - a new model is returned. :param threshold: Minimum number of occurrences. :return: The new model if the current one had to be changed, otherwise self.
-
greatest(self, max_size: int)¶ Truncates the model to most frequent max_size tokens. The operation happens not in-place - a new model is returned. :param max_size: The maximum vocabulary size. :return: The new model if the current one had to be changed, otherwise self.
-
get(self, item, default=None)¶ Return the document frequency for a given token.
Parameters: - item – The token to query.
- default – Returned value in case the token is missing.
Returns: int or default
-
tokens(self)¶ Returns the list of tokens.
-
-
class
sourced.ml.core.models.OrderedDocumentFrequencies[source]¶ Bases:
sourced.ml.core.models.DocumentFrequenciesCompatible with the original DocumentFrequencies. This model maintains the determinitic sequence of the tokens.
-
order¶
-
construct(self, docs: int, tokfreqs: Iterable[Dict[str, int]])¶
-
tokens(self)¶
-
prune(self, threshold: int)¶
-
greatest(self, max_size: int)¶
-
-
class
sourced.ml.core.models.Id2Vec[source]¶ Bases:
modelforge.Modelid2vec model - source code identifier embeddings.
-
NAME= id2vec¶
-
VENDOR= source{d}¶
-
DESCRIPTION= Model that contains information on source code as identifier embeddings.¶
-
LICENSE¶
-
embeddings¶ numpy.ndarraywith the embeddings of shape (N tokens x embedding dims).
-
tokens¶ List with the processed source code identifiers.
-
construct(self, embeddings, tokens)¶
-
dump(self)¶
-
items(self)¶ Returns the tuples belonging to token -> index mapping.
-
-
class
sourced.ml.core.models.TensorFlowModel[source]¶ Bases:
modelforge.ModelTensorFlow Protobuf model exported in the Modelforge format with GraphDef inside.
-
NAME= tensorflow-model¶
-
VENDOR= source{d}¶
-
DESCRIPTION= TensorFlow Protobuf model that contains a GraphDef instance.¶
-
LICENSE¶
-
graphdef¶ Returns the wrapped TensorFlow GraphDef.
-
construct(self, graphdef: 'tensorflow.GraphDef' = None, session: 'tensorflow.Session' = None, outputs: List[str] = None)¶
-
-
class
sourced.ml.core.models.Topics[source]¶ Bases:
modelforge.Model-
NAME= topics¶
-
VENDOR= source{d}¶
-
DESCRIPTION= Model that is used to identify topics of source code repositories.¶
-
LICENSE¶
-
tokens¶
-
topics¶ May be None if no topics are labeled.
-
matrix¶ Rows: tokens Columns: topics
-
construct(self, tokens: list, topics: Union[list, None], matrix)¶
-
dump(self)¶
-
label_topics(self, labels)¶
-
-
class
sourced.ml.core.models.QuantizationLevels[source]¶ Bases:
modelforge.ModelThis model contains quantization levels for multiple schemes (feature types). Every feature “class” (type, possible distinct value) corresponds to the numpy array with integer level borders. The size of each numpy array is (the number of levels + 1).
-
NAME= quant¶
-
VENDOR= source{d}¶
-
DESCRIPTION= Model that contains quantization levels for multiple schemes (feature types).¶
-
LICENSE¶
-
levels¶
-
construct(self, levels: Dict[str, Dict[str, numpy.ndarray]])¶
-
dump(self)¶
-
apply_quantization(self, extractors)¶
-
-
class
sourced.ml.core.models.MergeDocFreq(min_docfreq: int, vocabulary_size: int, ordered: bool = False, *args, **kwargs)[source]¶ Bases:
sourced.ml.core.models.model_converters.base.Model2BaseMerges several
DocumentFrequenciesmodels together.-
MODEL_FROM_CLASS¶
-
MODEL_TO_CLASS¶
-
convert_model(self, model: DocumentFrequencies)¶
-
finalize(self, index: int, destdir: str)¶
-