:mod:`sourced.ml.core.models` ============================= .. py:module:: sourced.ml.core.models Subpackages ----------- .. toctree:: :titlesonly: :maxdepth: 3 model_converters/index.rst Submodules ---------- .. toctree:: :titlesonly: :maxdepth: 1 bow/index.rst coocc/index.rst df/index.rst id2vec/index.rst id_splitter/index.rst license/index.rst ordered_df/index.rst quant/index.rst tensorflow/index.rst topics/index.rst Package Contents ---------------- .. py:class:: BOW Bases::class:`modelforge.Model` Weighted bag of words model. Every word is correspond to an index and its matrix column. Bag is a word set from repository, file or anything else. Word is source code identifier or its part. This model depends on :class:`sourced.ml.models.DocumentFrequencies`. .. attribute:: NAME :annotation: = bow .. attribute:: VENDOR :annotation: = source{d} .. attribute:: DESCRIPTION :annotation: = Model that contains source code as weighted bag of words. .. attribute:: LICENSE .. attribute:: matrix Returns the bags as a sparse matrix. Rows are documents and columns are tokens weight. .. attribute:: documents The list of documents in the model. .. attribute:: tokens The list of tokens in the model. .. method:: construct(self, documents: List[str], tokens: List[str], matrix: sparse.spmatrix) .. method:: dump(self) .. method:: save(self, output: str, series: str, deps: Iterable = tuple(), create_missing_dirs: bool = True) .. method:: convert_bow_to_vw(self, output: str) .. method:: documents_index(self) .. py:class:: Cooccurrences Bases::class:`modelforge.model.Model` Co-occurrence matrix. .. attribute:: NAME :annotation: = co-occurrences .. attribute:: VENDOR :annotation: = source{d} .. attribute:: DESCRIPTION :annotation: = Model that contains the sparse co-occurrence matrix of source code identifiers. .. attribute:: LICENSE .. attribute:: tokens Returns the tokens in the order which corresponds to the matrix's rows and cols. .. attribute:: matrix Returns the sparse co-occurrence matrix. .. method:: construct(self, tokens, matrix) .. method:: dump(self) .. method:: matrix_to_rdd(self, spark_context: 'pyspark.SparkContext') .. py:class:: DocumentFrequencies Bases::class:`modelforge.Model` Document frequencies - number of times a source code identifier appeared in different repositories. Each repository counts only once. .. attribute:: NAME :annotation: = docfreq .. attribute:: VENDOR :annotation: = source{d} .. attribute:: DESCRIPTION :annotation: = Model that contains document frequencies of features extracted from code. .. attribute:: LICENSE .. attribute:: docs Returns the number of documents. .. method:: construct(self, docs: int, tokfreqs: Union[Iterable[Dict[str, int]], Dict[str, int]]) Initializes this model. :param docs: The number of documents. :param tokfreqs: The dictionary of token -> frequency or the iterable collection of such dictionaries. :return: self .. method:: dump(self) .. method:: prune(self, threshold: int) Removes tokens which occur less than `threshold` times. The operation happens *not* in-place - a new model is returned. :param threshold: Minimum number of occurrences. :return: The new model if the current one had to be changed, otherwise self. .. method:: greatest(self, max_size: int) Truncates the model to most frequent `max_size` tokens. The operation happens *not* in-place - a new model is returned. :param max_size: The maximum vocabulary size. :return: The new model if the current one had to be changed, otherwise self. .. method:: get(self, item, default=None) Return the document frequency for a given token. :param item: The token to query. :param default: Returned value in case the token is missing. :return: int or `default` .. method:: tokens(self) Returns the list of tokens. .. py:class:: OrderedDocumentFrequencies Bases::class:`sourced.ml.core.models.DocumentFrequencies` Compatible with the original DocumentFrequencies. This model maintains the determinitic sequence of the tokens. .. attribute:: order .. method:: construct(self, docs: int, tokfreqs: Iterable[Dict[str, int]]) .. method:: tokens(self) .. method:: prune(self, threshold: int) .. method:: greatest(self, max_size: int) .. py:class:: Id2Vec Bases::class:`modelforge.Model` id2vec model - source code identifier embeddings. .. attribute:: NAME :annotation: = id2vec .. attribute:: VENDOR :annotation: = source{d} .. attribute:: DESCRIPTION :annotation: = Model that contains information on source code as identifier embeddings. .. attribute:: LICENSE .. attribute:: embeddings :class:`numpy.ndarray` with the embeddings of shape (N tokens x embedding dims). .. attribute:: tokens List with the processed source code identifiers. .. method:: construct(self, embeddings, tokens) .. method:: dump(self) .. method:: items(self) Returns the tuples belonging to token -> index mapping. .. py:class:: TensorFlowModel Bases::class:`modelforge.Model` TensorFlow Protobuf model exported in the Modelforge format with GraphDef inside. .. attribute:: NAME :annotation: = tensorflow-model .. attribute:: VENDOR :annotation: = source{d} .. attribute:: DESCRIPTION :annotation: = TensorFlow Protobuf model that contains a GraphDef instance. .. attribute:: LICENSE .. attribute:: graphdef Returns the wrapped TensorFlow GraphDef. .. method:: construct(self, graphdef: 'tensorflow.GraphDef' = None, session: 'tensorflow.Session' = None, outputs: List[str] = None) .. py:class:: Topics Bases::class:`modelforge.Model` .. attribute:: NAME :annotation: = topics .. attribute:: VENDOR :annotation: = source{d} .. attribute:: DESCRIPTION :annotation: = Model that is used to identify topics of source code repositories. .. attribute:: LICENSE .. attribute:: tokens .. attribute:: topics May be None if no topics are labeled. .. attribute:: matrix Rows: tokens Columns: topics .. method:: construct(self, tokens: list, topics: Union[list, None], matrix) .. method:: dump(self) .. method:: label_topics(self, labels) .. py:class:: QuantizationLevels Bases::class:`modelforge.Model` This model contains quantization levels for multiple schemes (feature types). Every feature "class" (type, possible distinct value) corresponds to the numpy array with integer level borders. The size of each numpy array is (the number of levels + 1). .. attribute:: NAME :annotation: = quant .. attribute:: VENDOR :annotation: = source{d} .. attribute:: DESCRIPTION :annotation: = Model that contains quantization levels for multiple schemes (feature types). .. attribute:: LICENSE .. attribute:: levels .. method:: construct(self, levels: Dict[str, Dict[str, numpy.ndarray]]) .. method:: dump(self) .. method:: apply_quantization(self, extractors) .. py:class:: MergeDocFreq(min_docfreq: int, vocabulary_size: int, ordered: bool = False, *args, **kwargs) Bases::class:`sourced.ml.core.models.model_converters.base.Model2Base` Merges several :class:`DocumentFrequencies` models together. .. attribute:: MODEL_FROM_CLASS .. attribute:: MODEL_TO_CLASS .. method:: convert_model(self, model: DocumentFrequencies) .. method:: finalize(self, index: int, destdir: str) .. py:class:: MergeBOW(features=None, *args, **kwargs) Bases::class:`sourced.ml.core.models.model_converters.base.Model2Base` Merges several :class:`BOW` models together. .. attribute:: MODEL_FROM_CLASS .. attribute:: MODEL_TO_CLASS .. method:: convert_model(self, model: BOW) .. method:: finalize(self, index: int, destdir: str)