:mod:`sourced.ml.core.models.df` ================================ .. py:module:: sourced.ml.core.models.df Module Contents --------------- .. py:class:: DocumentFrequencies Bases::class:`modelforge.Model` Document frequencies - number of times a source code identifier appeared in different repositories. Each repository counts only once. .. attribute:: NAME :annotation: = docfreq .. attribute:: VENDOR :annotation: = source{d} .. attribute:: DESCRIPTION :annotation: = Model that contains document frequencies of features extracted from code. .. attribute:: LICENSE .. attribute:: docs Returns the number of documents. .. method:: construct(self, docs: int, tokfreqs: Union[Iterable[Dict[str, int]], Dict[str, int]]) Initializes this model. :param docs: The number of documents. :param tokfreqs: The dictionary of token -> frequency or the iterable collection of such dictionaries. :return: self .. method:: dump(self) .. method:: prune(self, threshold: int) Removes tokens which occur less than `threshold` times. The operation happens *not* in-place - a new model is returned. :param threshold: Minimum number of occurrences. :return: The new model if the current one had to be changed, otherwise self. .. method:: greatest(self, max_size: int) Truncates the model to most frequent `max_size` tokens. The operation happens *not* in-place - a new model is returned. :param max_size: The maximum vocabulary size. :return: The new model if the current one had to be changed, otherwise self. .. method:: get(self, item, default=None) Return the document frequency for a given token. :param item: The token to query. :param default: Returned value in case the token is missing. :return: int or `default` .. method:: tokens(self) Returns the list of tokens.