sourced.ml.core.models.df

Module Contents

class sourced.ml.core.models.df.DocumentFrequencies[source]

Bases:modelforge.Model

Document frequencies - number of times a source code identifier appeared in different repositories. Each repository counts only once.

NAME = docfreq[source]
VENDOR = source{d}[source]
DESCRIPTION = Model that contains document frequencies of features extracted from code.[source]
LICENSE[source]
docs[source]

Returns the number of documents.

construct(self, docs: int, tokfreqs: Union[Iterable[Dict[str, int]], Dict[str, int]])[source]

Initializes this model.

Parameters:
  • docs – The number of documents.
  • tokfreqs – The dictionary of token -> frequency or the iterable collection of such dictionaries.
Returns:

self

dump(self)[source]
prune(self, threshold: int)[source]

Removes tokens which occur less than threshold times. The operation happens not in-place - a new model is returned. :param threshold: Minimum number of occurrences. :return: The new model if the current one had to be changed, otherwise self.

greatest(self, max_size: int)[source]

Truncates the model to most frequent max_size tokens. The operation happens not in-place - a new model is returned. :param max_size: The maximum vocabulary size. :return: The new model if the current one had to be changed, otherwise self.

get(self, item, default=None)[source]

Return the document frequency for a given token.

Parameters:
  • item – The token to query.
  • default – Returned value in case the token is missing.
Returns:

int or default

tokens(self)[source]

Returns the list of tokens.