:mod:`sourced.ml.core.models`
=============================

.. py:module:: sourced.ml.core.models


Subpackages
-----------
.. toctree::
   :titlesonly:
   :maxdepth: 3

   model_converters/index.rst
   

Submodules
----------
.. toctree::
   :titlesonly:
   :maxdepth: 1

   bow/index.rst
   
   coocc/index.rst
   
   df/index.rst
   
   id2vec/index.rst
   
   id_splitter/index.rst
   
   license/index.rst
   
   ordered_df/index.rst
   
   quant/index.rst
   
   tensorflow/index.rst
   
   topics/index.rst
   

Package Contents
----------------


.. py:class:: BOW

   Bases::class:`modelforge.Model`

   
   Weighted bag of words model. Every word is correspond to an index and its matrix column.
   Bag is a word set from repository, file or anything else.
   Word is source code identifier or its part.
   This model depends on :class:`sourced.ml.models.DocumentFrequencies`.


   .. attribute:: NAME
      :annotation: = bow 

      
   .. attribute:: VENDOR
      :annotation: = source{d} 

      
   .. attribute:: DESCRIPTION
      :annotation: = Model that contains source code as weighted bag of words. 

      
   .. attribute:: LICENSE
      

   .. attribute:: matrix
      

      Returns the bags as a sparse matrix. Rows are documents and columns are tokens weight.


   .. attribute:: documents
      

      The list of documents in the model.


   .. attribute:: tokens
      

      The list of tokens in the model.


   .. method:: construct(self, documents: List[str], tokens: List[str], matrix: sparse.spmatrix)

      
   .. method:: dump(self)

      
   .. method:: save(self, output: str, series: str, deps: Iterable = tuple(), create_missing_dirs: bool = True)

      
   .. method:: convert_bow_to_vw(self, output: str)

      
   .. method:: documents_index(self)

      
.. py:class:: Cooccurrences

   Bases::class:`modelforge.model.Model`

   
   Co-occurrence matrix.


   .. attribute:: NAME
      :annotation: = co-occurrences 

      
   .. attribute:: VENDOR
      :annotation: = source{d} 

      
   .. attribute:: DESCRIPTION
      :annotation: = Model that contains the sparse co-occurrence matrix of source code identifiers. 

      
   .. attribute:: LICENSE
      

   .. attribute:: tokens
      

      Returns the tokens in the order which corresponds to the matrix's rows and cols.


   .. attribute:: matrix
      

      Returns the sparse co-occurrence matrix.


   .. method:: construct(self, tokens, matrix)

      
   .. method:: dump(self)

      
   .. method:: matrix_to_rdd(self, spark_context: 'pyspark.SparkContext')

      
.. py:class:: DocumentFrequencies

   Bases::class:`modelforge.Model`

   
   Document frequencies - number of times a source code identifier appeared
   in different repositories. Each repository counts only once.


   .. attribute:: NAME
      :annotation: = docfreq 

      
   .. attribute:: VENDOR
      :annotation: = source{d} 

      
   .. attribute:: DESCRIPTION
      :annotation: = Model that contains document frequencies of features extracted from code. 

      
   .. attribute:: LICENSE
      

   .. attribute:: docs
      

      Returns the number of documents.


   .. method:: construct(self, docs: int, tokfreqs: Union[Iterable[Dict[str, int]], Dict[str, int]])

      
      Initializes this model.

      :param docs: The number of documents.
      :param tokfreqs: The dictionary of token -> frequency or the iterable collection of such
          dictionaries.
      :return: self

      
   .. method:: dump(self)

      
   .. method:: prune(self, threshold: int)

      
      Removes tokens which occur less than `threshold` times.
      The operation happens *not* in-place - a new model is returned.
      :param threshold: Minimum number of occurrences.
      :return: The new model if the current one had to be changed, otherwise self.

      
   .. method:: greatest(self, max_size: int)

      
      Truncates the model to most frequent `max_size` tokens.
      The operation happens *not* in-place - a new model is returned.
      :param max_size: The maximum vocabulary size.
      :return: The new model if the current one had to be changed, otherwise self.

      
   .. method:: get(self, item, default=None)

      
      Return the document frequency for a given token.

      :param item: The token to query.
      :param default: Returned value in case the token is missing.
      :return: int or `default`

      
   .. method:: tokens(self)

      
      Returns the list of tokens.

      
.. py:class:: OrderedDocumentFrequencies

   Bases::class:`sourced.ml.core.models.DocumentFrequencies`

   
   Compatible with the original DocumentFrequencies. This model maintains the determinitic
   sequence of the tokens.


   .. attribute:: order
      

   .. method:: construct(self, docs: int, tokfreqs: Iterable[Dict[str, int]])

      
   .. method:: tokens(self)

      
   .. method:: prune(self, threshold: int)

      
   .. method:: greatest(self, max_size: int)

      
.. py:class:: Id2Vec

   Bases::class:`modelforge.Model`

   
   id2vec model - source code identifier embeddings.


   .. attribute:: NAME
      :annotation: = id2vec 

      
   .. attribute:: VENDOR
      :annotation: = source{d} 

      
   .. attribute:: DESCRIPTION
      :annotation: = Model that contains information on source code as identifier embeddings. 

      
   .. attribute:: LICENSE
      

   .. attribute:: embeddings
      

      :class:`numpy.ndarray` with the embeddings of shape
      (N tokens x embedding dims).


   .. attribute:: tokens
      

      List with the processed source code identifiers.


   .. method:: construct(self, embeddings, tokens)

      
   .. method:: dump(self)

      
   .. method:: items(self)

      
      Returns the tuples belonging to token -> index mapping.

      
.. py:class:: TensorFlowModel

   Bases::class:`modelforge.Model`

   
   TensorFlow Protobuf model exported in the Modelforge format with GraphDef inside.


   .. attribute:: NAME
      :annotation: = tensorflow-model 

      
   .. attribute:: VENDOR
      :annotation: = source{d} 

      
   .. attribute:: DESCRIPTION
      :annotation: = TensorFlow Protobuf model that contains a GraphDef instance. 

      
   .. attribute:: LICENSE
      

   .. attribute:: graphdef
      

      Returns the wrapped TensorFlow GraphDef.


   .. method:: construct(self, graphdef: 'tensorflow.GraphDef' = None, session: 'tensorflow.Session' = None, outputs: List[str] = None)

      
.. py:class:: Topics

   Bases::class:`modelforge.Model`

   
   .. attribute:: NAME
      :annotation: = topics 

      
   .. attribute:: VENDOR
      :annotation: = source{d} 

      
   .. attribute:: DESCRIPTION
      :annotation: = Model that is used to identify topics of source code repositories. 

      
   .. attribute:: LICENSE
      

   .. attribute:: tokens
      

   .. attribute:: topics
      

      May be None if no topics are labeled.


   .. attribute:: matrix
      

      Rows: tokens
      Columns: topics


   .. method:: construct(self, tokens: list, topics: Union[list, None], matrix)

      
   .. method:: dump(self)

      
   .. method:: label_topics(self, labels)

      
.. py:class:: QuantizationLevels

   Bases::class:`modelforge.Model`

   
   This model contains quantization levels for multiple schemes (feature types).
   Every feature "class" (type, possible distinct value) corresponds to the numpy array
   with integer level borders. The size of each numpy array is (the number of levels + 1).


   .. attribute:: NAME
      :annotation: = quant 

      
   .. attribute:: VENDOR
      :annotation: = source{d} 

      
   .. attribute:: DESCRIPTION
      :annotation: = Model that contains quantization levels for multiple schemes (feature types). 

      
   .. attribute:: LICENSE
      

   .. attribute:: levels
      

   .. method:: construct(self, levels: Dict[str, Dict[str, numpy.ndarray]])

      
   .. method:: dump(self)

      
   .. method:: apply_quantization(self, extractors)

      
.. py:class:: MergeDocFreq(min_docfreq: int, vocabulary_size: int, ordered: bool = False, *args, **kwargs)

   Bases::class:`sourced.ml.core.models.model_converters.base.Model2Base`

   
   Merges several :class:`DocumentFrequencies` models together.


   .. attribute:: MODEL_FROM_CLASS
      

   .. attribute:: MODEL_TO_CLASS
      

   .. method:: convert_model(self, model: DocumentFrequencies)

      
   .. method:: finalize(self, index: int, destdir: str)

      
.. py:class:: MergeBOW(features=None, *args, **kwargs)

   Bases::class:`sourced.ml.core.models.model_converters.base.Model2Base`

   
   Merges several :class:`BOW` models together.


   .. attribute:: MODEL_FROM_CLASS
      

   .. attribute:: MODEL_TO_CLASS
      

   .. method:: convert_model(self, model: BOW)

      
   .. method:: finalize(self, index: int, destdir: str)