ML-Core documentation¶
sourced.ml.core
¶
MLonCode research playground.
Subpackages¶
sourced.ml.core.algorithms
¶
Subpackages¶
sourced.ml.core.algorithms.id_splitter
¶
sourced.ml.core.algorithms.id_splitter.features
¶-
sourced.ml.core.algorithms.id_splitter.features.
read_identifiers
(csv_path: str, use_header: bool, max_identifier_len: int, identifier_col: int, split_identifier_col: int, shuffle: bool = True)[source]¶ Reads and filters too long identifiers in the CSV file.
Parameters: - csv_path – path to the CSV file.
- use_header – uses header as normal line (True) or treat as header line with column names.
- max_identifier_len – maximum length of raw identifiers. Skip identifiers that are longer.
- identifier_col – column name in the CSV file for the raw identifier.
- split_identifier_col – column name in the CSV file for the split identifier lowercase.
- shuffle – indicates whether to reorder the list of identifiers at random after reading it.
Returns: list of split identifiers.
-
sourced.ml.core.algorithms.id_splitter.features.
prepare_features
(csv_path: str, use_header: bool, max_identifier_len: int, identifier_col: int, split_identifier_col: int, test_ratio: float, padding: str, shuffle: bool = True)[source]¶ Prepare the features to train the identifier splitting task.
Parameters: - csv_path – path to the CSV file.
- use_header – uses header as normal line (True) or treat as header line with column names.
- max_identifier_len – maximum length of raw identifiers. Skip identifiers that are longer.
- identifier_col – column in the CSV file for the raw identifier.
- split_identifier_col – column in the CSV file for the split identifier.
- shuffle – indicates whether to reorder the list of identifiers at random after reading it.
- test_ratio – Proportion of test samples used for evaluation.
- padding – position where to add padding values: after the intput sequence if “post”, before if “pre”.
Returns: training and testing features to train the neural net for the splitting task.
sourced.ml.core.algorithms.id_splitter.nn_model
¶-
sourced.ml.core.algorithms.id_splitter.nn_model.
register_metric
(metric: Union[str, Callable])[source]¶ Decorator function to register the metrics in the METRICS constant.
Parameters: metric – name of the tensorflow metric or custom function metric. Returns: the metric.
-
sourced.ml.core.algorithms.id_splitter.nn_model.
prepare_devices
(devices: str)[source]¶ Extract devices from arguments.
Parameters: devices – devices to use passed as one string argument. Returns: split devices.
-
sourced.ml.core.algorithms.id_splitter.nn_model.
prepare_input_emb
(maxlen: int)[source]¶ Builds character embeddings, a dense representation of characters to feed the RNN with.
Parameters: maxlen – maximum length of the input sequence. Returns: input and one-hot character embedding layer.
-
sourced.ml.core.algorithms.id_splitter.nn_model.
add_output_layer
(hidden_layer: tf.Tensor)[source]¶ Applies a Dense layer to each of the timestamps of a hidden layer, independently. The output layer has 1 sigmoid per character which predicts if there is a space or not before the character.
Parameters: hidden_layer – hidden layer before the output layer. Returns: output layer.
-
sourced.ml.core.algorithms.id_splitter.nn_model.
add_rnn
(X: tf.Tensor, units: int, rnn_layer: str, dev0: str = '/gpu:0', dev1: str = '/gpu:1')[source]¶ Adds a bidirectional RNN layer with the specified parameters.
Parameters: - X – input layer.
- units – number of neurons in the output layer.
- rnn_layer – type of cell in the RNN.
- dev0 – device that will be used as forward pass of RNN and concatenation.
- dev1 – device that will be used as backward pass.
Returns: output bidirectional RNN layer.
-
sourced.ml.core.algorithms.id_splitter.nn_model.
build_rnn
(maxlen: int, units: int, stack: int, optimizer: str, dev0: str, dev1: str, rnn_layer: str)[source]¶ Builds a RNN model with the parameters specified as arguments.
Parameters: - maxlen – maximum length of the input sequence.
- units – number of neurons or dimensionality of the output RNN.
- stack – number of RNN layers to stack.
- optimizer – algorithm to use as an optimizer for the RNN.
- rnn_layer – recurrent layer type to use.
- dev0 – first device to use when running specific operations.
- dev1 – second device to use when running specific operations.
Returns: compiled RNN model.
-
sourced.ml.core.algorithms.id_splitter.nn_model.
add_conv
(X: tf.Tensor, filters: List[int], kernel_sizes: List[int], output_n_filters: int)[source]¶ Builds a single convolutional layer.
Parameters: - X – input layer.
- filters – number of output filters in the convolution.
- kernel_sizes – list of lengths of the 1D convolution window.
- output_n_filters – number of 1D output filters.
Returns: output layer.
-
sourced.ml.core.algorithms.id_splitter.nn_model.
build_cnn
(maxlen: int, filters: List[int], output_n_filters: int, stack: int, kernel_sizes: List[int], optimizer: str, device: str)[source]¶ Builds a CNN model with the parameters specified as arguments.
Parameters: - maxlen – maximum length of the input sequence.
- filters – number of output filters in the convolution.
- output_n_filters – number of 1d output filters.
- stack – number of CNN layers to stack.
- kernel_sizes – list of lengths of the 1D convolution window.
- optimizer – algorithm to use as an optimizer for the CNN.
- device – device to use when running specific operations.
Returns: compiled CNN model.
-
sourced.ml.core.algorithms.id_splitter.nn_model.
precision
(y_true: tf.Tensor, y_pred: tf.Tensor)[source]¶ Computes the precision, a metric for multi-label classification of how many selected items are relevant.
Parameters: - y_true – tensor of true labels.
- y_pred – tensor of predicted labels.
Returns: a tensor batch-wise average of precision.
-
sourced.ml.core.algorithms.id_splitter.nn_model.
recall
(y_true: tf.Tensor, y_pred: tf.Tensor)[source]¶ Computes the recall, a metric for multi-label classification of how many relevant items are selected.
Parameters: - y_true – tensor of true labels.
- y_pred – tensor of predicted labels.
Returns: a tensor batch-wise average of recall.
-
sourced.ml.core.algorithms.id_splitter.nn_model.
f1score
(y_true: tf.Tensor, y_pred: tf.Tensor)[source]¶ Computes the F1 score, the harmonic average of precision and recall.
Parameters: - y_true – tensor of true labels.
- y_pred – tensor of predicted labels.
Returns: a tensor batch-wise average of F1 score.
sourced.ml.core.algorithms.id_splitter.pipeline
¶-
sourced.ml.core.algorithms.id_splitter.pipeline.
set_random_seed
(seed: int)[source]¶ Fixes a random seed for reproducibility.
Parameters: seed – seed value.
-
sourced.ml.core.algorithms.id_splitter.pipeline.
binarize
(matrix: numpy.array, threshold: float, inplace: bool = True)[source]¶ Helper function to binarize a matrix.
Parameters: - matrix – matrix as a numpy.array.
- threshold – if value >= threshold then the value will be 1, else 0.
- inplace – whether to modify the matrix inplace or not.
Returns: the binarized matrix.
-
sourced.ml.core.algorithms.id_splitter.pipeline.
str2ints
(params: str)[source]¶ Convert a string with integer parameters to a list of integers.
Parameters: params – string that contains integer parameters separated by commas. Returns: list of integers.
-
sourced.ml.core.algorithms.id_splitter.pipeline.
precision_np
(y_true: numpy.array, y_pred: numpy.array, epsilon: float = EPSILON)[source]¶ Computes the precision metric, a metric for multi-label classification of how many selected items are relevant.
Parameters: - y_true – ground truth labels - expect binary values.
- y_pred – predicted labels - expect binary values.
- epsilon – added to the denominator to avoid any division by zero.
Returns: precision metric.
-
sourced.ml.core.algorithms.id_splitter.pipeline.
recall_np
(y_true: numpy.array, y_pred: numpy.array, epsilon: float = EPSILON)[source]¶ Computes the recall metric, a metric for multi-label classification of how many relevant items are selected.
Parameters: - y_true – matrix with ground truth labels - expect binary values.
- y_pred – matrix with predicted labels - expect binary values.
- epsilon – added to the denominator to avoid any division by zero.
Returns: recall metric.
-
sourced.ml.core.algorithms.id_splitter.pipeline.
report
(model: keras.engine.training.Model, X: numpy.array, y: numpy.array, batch_size: int, threshold: float = DEFAULT_THRESHOLD, epsilon: float = EPSILON)[source]¶ Prints a metric report of the model on the data X & y. The metrics printed are precision, recall, F1 score.
Parameters: - model – model considered.
- X – features.
- y – labels (expected binary labels).
- batch_size – batch size that will be used for prediction.
- threshold – threshold to binarize the predictions.
- epsilon – added to the denominator to avoid any division by zero.
-
sourced.ml.core.algorithms.id_splitter.pipeline.
config_keras
()[source]¶ Initializes keras backend session.
-
sourced.ml.core.algorithms.id_splitter.pipeline.
build_train_generator
(X: numpy.array, y: numpy.array, batch_size: int = 500)[source]¶ Builds the generator that yields features and their labels.
Parameters: - X – features.
- y – binary labels.
- batch_size – higher values better utilize GPUs.
Returns: generator of features and their labels.
-
sourced.ml.core.algorithms.id_splitter.pipeline.
build_schedule
(lr: float, final_lr: float, n_epochs: int)[source]¶ Builds the schedule of which the learning rate decreases. The schedule makes the learning rate decrease linearly.
Parameters: - lr – initial learning rate.
- final_lr – final learning rate.
- n_epochs – number of training epochs.
Returns: the schedule of the learning rate.
-
sourced.ml.core.algorithms.id_splitter.pipeline.
make_lr_scheduler
(lr: float, final_lr: float, n_epochs: int, verbose: int = 1)[source]¶ Prepares the scheduler to decrease the learning rate while training.
Parameters: - lr – initial learning rate.
- final_lr – final learning rate.
- n_epochs – number of training epochs.
- verbose – level of verbosity.
Returns: LearningRateScheduler with linear schedule of the learning rate.
-
sourced.ml.core.algorithms.id_splitter.pipeline.
prepare_callbacks
(output_dir: str)[source]¶ Prepares logging, tensorboard, model checkpoint callbacks and stores the outputs in output_dir.
Parameters: output_dir – path to the results. Returns: list of callbacks.
-
sourced.ml.core.algorithms.id_splitter.pipeline.
create_generator_params
(batch_size: int, samples_per_epoch: int, n_samples: int, epochs: int)[source]¶ Helper function to split a huge dataset into smaller ones to enable more frequent reports.
Parameters: - batch_size – batch size.
- samples_per_epoch – number of samples per mini-epoch or before each report.
- n_samples – total number of samples.
- epochs – number of epochs over the full dataset.
Returns: number of steps per epoch (should be used with the generator) and number of sub-epochs where during sub-epoch only samples_per_epoch will be generated.
Submodules¶
sourced.ml.core.algorithms.swivel
¶
Submatrix-wise Vector Embedding Learner.
Implementation of SwiVel algorithm described at: http://arxiv.org/abs/1602.02215
This program expects an input directory that contains the following files.
row_vocab.txt, col_vocab.txt
The row an column vocabulary files. Each file should contain one token per line; these will be used to generate a tab-separate file containing the trained embeddings.row_sums.txt, col_sum.txt
The matrix row and column marginal sums. Each file should contain one decimal floating point number per line which corresponds to the marginal count of the matrix for that row or column.shards.recs
A file containing the sub-matrix shards, stored as TFRecords. Each shard is expected to be a serialzed tf.Example protocol buffer with the following properties:
global_row: the global row indices contained in the shard global_col: the global column indices contained in the shard sparse_local_row, sparse_local_col, sparse_value: three parallel arrays that are a sparse representation of the submatrix counts.
It will generate embeddings, training from the input directory for the specified number of epochs. When complete, it will output the trained vectors to a tab-separated file that contains one line per embedding. Row and column embeddings are stored in separate files.
-
sourced.ml.core.algorithms.swivel.
embeddings_with_init
(vocab_size, embedding_dim, name)[source]¶ Creates and initializes the embedding tensors.
-
sourced.ml.core.algorithms.swivel.
count_matrix_input
(filenames, submatrix_rows, submatrix_cols)[source]¶ Reads submatrix shards from disk.
-
sourced.ml.core.algorithms.swivel.
read_marginals_file
(filename)[source]¶ Reads text file with one number per line to an array.
-
sourced.ml.core.algorithms.swivel.
write_embedding_tensor_to_disk
(vocab_path, output_path, sess, embedding)[source]¶ Writes tensor to output_path as tsv
-
sourced.ml.core.algorithms.swivel.
write_embeddings_to_disk
(config, model, sess)[source]¶ Writes row and column embeddings disk
sourced.ml.core.algorithms.token_parser
¶
-
class
sourced.ml.core.algorithms.token_parser.
TokenStyle
[source]¶ Bases:
enum.Enum
Metadata that should allow to reconstruct initial identifier from a list of tokens.
-
class
sourced.ml.core.algorithms.token_parser.
TokenParser
(stem_threshold=STEM_THRESHOLD, max_token_length=MAX_TOKEN_LENGTH, min_split_length=MIN_SPLIT_LENGTH, single_shot=False, save_token_style=False, attach_upper=True, use_nn=False, nn_model=None)[source]¶ Common utilities for splitting and stemming tokens.
sourced.ml.core.algorithms.uast_id_distance
¶
-
class
sourced.ml.core.algorithms.uast_id_distance.
Uast2IdDistance
(token2index=None, token_parser=None, max_distance=DEFAULT_MAX_DISTANCE)[source]¶ Bases:
sourced.ml.core.algorithms.uast_ids_to_bag.UastIds2Bag
Converts a UAST to a list of identifiers pair and UAST distance between. Distance metric must be defined in the inheritors.
__call__ is overridden here and return list instead of bag-of-words (dist).
-
class
sourced.ml.core.algorithms.uast_id_distance.
Uast2IdTreeDistance
[source]¶ Bases:
sourced.ml.core.algorithms.uast_id_distance.Uast2IdDistance
Converts a UAST to a list of identifiers pair and UAST tree distance between.
__call__ is overridden here and return list instead of bag-of-words (dist).
-
class
sourced.ml.core.algorithms.uast_id_distance.
Uast2IdLineDistance
[source]¶ Bases:
sourced.ml.core.algorithms.uast_id_distance.Uast2IdDistance
Converts a UAST to a list of identifiers pair and code line distance between where applicable.
__call__ is overridden here and return list instead of bag-of-words (dist).
sourced.ml.core.algorithms.uast_ids_to_bag
¶
-
class
sourced.ml.core.algorithms.uast_ids_to_bag.
UastTokens2Bag
(token2index=None, token_parser=None)[source]¶ Bases:
sourced.ml.core.algorithms.uast_to_bag.Uast2BagBase
Converts a UAST to a weighed bag of tokens via xpath.
-
class
sourced.ml.core.algorithms.uast_ids_to_bag.
UastIds2Bag
(token2index=None, token_parser=None)[source]¶ Bases:
sourced.ml.core.algorithms.uast_ids_to_bag.UastTokens2Bag
Converts a UAST to a bag-of-identifiers.
sourced.ml.core.algorithms.uast_inttypes_to_graphlets
¶
-
class
sourced.ml.core.algorithms.uast_inttypes_to_graphlets.
Uast2GraphletBag
[source]¶ Bases:
sourced.ml.core.algorithms.uast_ids_to_bag.Uast2BagBase
Converts a UAST to a bag of graphlets. The graphlet of a UAST node is composed from the node itself, its parent and its children. Each node is represented by the internal role string.
-
uast2graphlets
(self, uast)[source]¶ Parameters: uast – The UAST root node. Generate: The nodes which compose the UAST. :class: ‘Node’ is used to access the nodes of the graphlets.
-
node2key
(self, node)[source]¶ Builds the string joining internal types of all the nodes in the node’s graphlet in the following order: parent_node_child1_child2_child3. The children are sorted by alphabetic order. str format is required for BagsExtractor.
Parameters: node – a node of UAST Returns: The string key of node
-
sourced.ml.core.algorithms.uast_inttypes_to_nodes
¶
-
class
sourced.ml.core.algorithms.uast_inttypes_to_nodes.
Uast2QuantizedChildren
(npartitions: int = 20)[source]¶ Bases:
sourced.ml.core.algorithms.uast_to_bag.Uast2BagThroughSingleScan
Converts a UAST to a bag of children counts.
-
node2key
(self, node: bblfsh.Node)[source]¶ Return the key for a given Node.
Parameters: node – a node in UAST. Returns: The string which consists of the internal type of the node and its number of children.
-
quantize_unwrapped
(self, children_freq: Iterable[Tuple[int, int]])[source]¶ Builds the quantization partition P that is a vector of length nb_partitions whose entries are in strictly ascending order. Quantization of x is defined as:
0 if x <= P[0] m if P[m-1] < x <= P[m] n if P[n] <= xParameters: children_freq – distribution of the number of children. Returns: The array with quantization levels.
-
sourced.ml.core.algorithms.uast_struct_to_bag
¶
-
class
sourced.ml.core.algorithms.uast_struct_to_bag.
Uast2StructBagBase
(stride, seq_len, node2index=None)[source]¶ Bases:
sourced.ml.core.algorithms.uast_ids_to_bag.Uast2BagBase
-
class
sourced.ml.core.algorithms.uast_struct_to_bag.
UastSeq2Bag
(stride=1, seq_len=(3, 4), node2index=None)[source]¶ Bases:
sourced.ml.core.algorithms.uast_struct_to_bag.Uast2StructBagBase
DFS traversal + preserves the order of node children.
-
class
sourced.ml.core.algorithms.uast_struct_to_bag.
Uast2RandomWalks
(p_explore_neighborhood, q_leave_neighborhood, n_walks, n_steps, node2index=None, seed=None)[source]¶ Generation of random walks for UAST.
-
alias_sample
(self, walk)[source]¶ Compare to node2vec this sampling is a bit simpler because there is no loop in tree -> so there are only 2 options with unnormalized probabilities 1/p & 1/q Related article: https://arxiv.org/abs/1607.00653
Parameters: walk – list of visited nodes Returns: next node to visit
-
-
class
sourced.ml.core.algorithms.uast_struct_to_bag.
UastRandomWalk2Bag
(p_explore_neighborhood=0.79, q_leave_neighborhood=0.82, n_walks=2, n_steps=10, stride=1, seq_len=(2, 3), seed=42)[source]¶ Bases:
sourced.ml.core.algorithms.uast_struct_to_bag.Uast2StructBagBase
sourced.ml.core.algorithms.uast_to_bag
¶
-
class
sourced.ml.core.algorithms.uast_to_bag.
Uast2BagBase
[source]¶ Base class to convert UAST to a bag of anything.
-
class
sourced.ml.core.algorithms.uast_to_bag.
Uast2BagThroughSingleScan
[source]¶ Bases:
sourced.ml.core.algorithms.uast_to_bag.Uast2BagBase
Constructs the bag by doing a single tree traversal and turning every node into a string.
sourced.ml.core.algorithms.uast_to_id_sequence
¶
-
class
sourced.ml.core.algorithms.uast_to_id_sequence.
Uast2IdSequence
[source]¶ Bases:
sourced.ml.core.algorithms.uast_id_distance.Uast2IdLineDistance
Converts a UAST to a sorted sequence of identifiers. Identifiers are sorted by position in code. We do not change the order if positions are not present.
__call__ is overridden here and return list instead of bag-of-words (dist).
sourced.ml.core.algorithms.uast_to_role_id_pairs
¶
-
class
sourced.ml.core.algorithms.uast_to_role_id_pairs.
Uast2RoleIdPairs
(token2index=None, token_parser=None)[source]¶ Bases:
sourced.ml.core.algorithms.uast_ids_to_bag.UastIds2Bag
Converts a UAST to a list of pairs. Pair is identifier and role, where role is Node role where identifier was found.
__call__ is overridden here and returns list instead of bag-of-words (dist).
Package Contents¶
-
class
sourced.ml.core.algorithms.
UastIds2Bag
(token2index=None, token_parser=None)[source]¶ Bases:
sourced.ml.core.algorithms.uast_ids_to_bag.UastTokens2Bag
Converts a UAST to a bag-of-identifiers.
-
XPATH
= //*[@roleIdentifier]¶
-
-
class
sourced.ml.core.algorithms.
UastRandomWalk2Bag
(p_explore_neighborhood=0.79, q_leave_neighborhood=0.82, n_walks=2, n_steps=10, stride=1, seq_len=(2, 3), seed=42)[source]¶ Bases:
sourced.ml.core.algorithms.uast_struct_to_bag.Uast2StructBagBase
-
class
sourced.ml.core.algorithms.
UastSeq2Bag
(stride=1, seq_len=(3, 4), node2index=None)[source]¶ Bases:
sourced.ml.core.algorithms.uast_struct_to_bag.Uast2StructBagBase
DFS traversal + preserves the order of node children.
-
class
sourced.ml.core.algorithms.
Uast2QuantizedChildren
(npartitions: int = 20)[source]¶ Bases:
sourced.ml.core.algorithms.uast_to_bag.Uast2BagThroughSingleScan
Converts a UAST to a bag of children counts.
-
node2key
(self, node: bblfsh.Node)¶ Return the key for a given Node.
Parameters: node – a node in UAST. Returns: The string which consists of the internal type of the node and its number of children.
-
quantize
(self, frequencies: Iterable[Tuple[str, Iterable[Tuple[int, int]]]])¶
-
quantize_unwrapped
(self, children_freq: Iterable[Tuple[int, int]])¶ Builds the quantization partition P that is a vector of length nb_partitions whose entries are in strictly ascending order. Quantization of x is defined as:
0 if x <= P[0] m if P[m-1] < x <= P[m] n if P[n] <= xParameters: children_freq – distribution of the number of children. Returns: The array with quantization levels.
-
-
class
sourced.ml.core.algorithms.
Uast2GraphletBag
[source]¶ Bases:
sourced.ml.core.algorithms.uast_ids_to_bag.Uast2BagBase
Converts a UAST to a bag of graphlets. The graphlet of a UAST node is composed from the node itself, its parent and its children. Each node is represented by the internal role string.
-
uast2graphlets
(self, uast)¶ Parameters: uast – The UAST root node. Generate: The nodes which compose the UAST. :class: ‘Node’ is used to access the nodes of the graphlets.
-
node2key
(self, node)¶ Builds the string joining internal types of all the nodes in the node’s graphlet in the following order: parent_node_child1_child2_child3. The children are sorted by alphabetic order. str format is required for BagsExtractor.
Parameters: node – a node of UAST Returns: The string key of node
-
-
class
sourced.ml.core.algorithms.
Uast2RoleIdPairs
(token2index=None, token_parser=None)[source]¶ Bases:
sourced.ml.core.algorithms.uast_ids_to_bag.UastIds2Bag
Converts a UAST to a list of pairs. Pair is identifier and role, where role is Node role where identifier was found.
__call__ is overridden here and returns list instead of bag-of-words (dist).
-
static
merge_roles
(roles: Iterable[int])¶
-
static
-
class
sourced.ml.core.algorithms.
Uast2IdLineDistance
[source]¶ Bases:
sourced.ml.core.algorithms.uast_id_distance.Uast2IdDistance
Converts a UAST to a list of identifiers pair and code line distance between where applicable.
__call__ is overridden here and return list instead of bag-of-words (dist).
-
distance
(self, point1, point2)¶
-
-
class
sourced.ml.core.algorithms.
Uast2IdTreeDistance
[source]¶ Bases:
sourced.ml.core.algorithms.uast_id_distance.Uast2IdDistance
Converts a UAST to a list of identifiers pair and UAST tree distance between.
__call__ is overridden here and return list instead of bag-of-words (dist).
-
distance
(self, point1, point2)¶
-
static
calc_tree_distance
(last_common_level, level1, level2)¶
-
-
class
sourced.ml.core.algorithms.
Uast2IdSequence
[source]¶ Bases:
sourced.ml.core.algorithms.uast_id_distance.Uast2IdLineDistance
Converts a UAST to a sorted sequence of identifiers. Identifiers are sorted by position in code. We do not change the order if positions are not present.
__call__ is overridden here and return list instead of bag-of-words (dist).
-
static
concat
(id_sequence: Iterable)¶
-
static
sourced.ml.core.extractors
¶
Submodules¶
sourced.ml.core.extractors.bags_extractor
¶
-
class
sourced.ml.core.extractors.bags_extractor.
Extractor
[source]¶ Bases:
sourced.ml.core.utils.pickleable_logger.PickleableLogger
Converts a single UAST via algorithm to anything you need. It is a wrapper to use in Uast2Features Transformer in a pipeline.
-
class
sourced.ml.core.extractors.bags_extractor.
BagsExtractor
(docfreq_threshold=None, weight=None, **kwargs)[source]¶ Bases:
sourced.ml.core.extractors.bags_extractor.Extractor
Converts a single UAST into the weighted set (dictionary), where elements are strings and the values are floats. The derived classes must implement uast_to_bag().
sourced.ml.core.extractors.children
¶
-
class
sourced.ml.core.extractors.children.
ChildrenBagExtractor
(docfreq_threshold=None, **kwargs)[source]¶ Bases:
sourced.ml.core.extractors.bags_extractor.BagsExtractor
Converts a UAST to the bag of pairs (internal type, quantized number of children).
sourced.ml.core.extractors.helpers
¶
sourced.ml.core.extractors.id_sequence
¶
-
class
sourced.ml.core.extractors.id_sequence.
IdSequenceExtractor
(split_stem=False, **kwargs)[source]¶ Bases:
sourced.ml.core.extractors.bags_extractor.BagsExtractor
Extractor wrapper for Uast2RoleIdPairs algorithm. Note that this is unusual BagsExtractor since it returns iterable instead of bag.
The class did not wrap with @register_extractor because it does not produce bags as others do. So nobody outside code will see it or use it directly. For the same reason we a free to override NAMESPACE, NAME, OPTS fields with any value we want.
TODO(zurk): Split BagsExtractor into two clases: Extractor and BagsExtractor(Extractor), re-inherit this class from Extractor, delete explanations from docstring.
sourced.ml.core.extractors.identifier_distance
¶
-
class
sourced.ml.core.extractors.identifier_distance.
IdentifierDistance
(split_stem=False, type='tree', max_distance=DEFAULT_MAX_DISTANCE, **kwargs)[source]¶ Bases:
sourced.ml.core.extractors.bags_extractor.BagsExtractor
Extractor wrapper for Uast2IdTreeDistance and Uast2IdLineDistance algorithm. Note that this is an unusual BagsExtractor since it returns iterable instead of bag.
The class did not wrap with @register_extractor because it does not produce bags as others do. So nobody outside code will see it or use it directly. For the same reason we a free to override NAMESPACE, NAME, OPTS fields with any value we want.
TODO(zurk): Split BagsExtractor into two clases: Extractor and BagsExtractor(Extractor), re-inherit this class from Extractor, delete explanations from docstring.
sourced.ml.core.extractors.literals
¶
Package Contents¶
-
class
sourced.ml.core.extractors.
Extractor
[source]¶ Bases:
sourced.ml.core.utils.pickleable_logger.PickleableLogger
Converts a single UAST via algorithm to anything you need. It is a wrapper to use in Uast2Features Transformer in a pipeline.
-
NAME
¶
-
ALGORITHM
¶
-
OPTS
¶
-
classmethod
get_kwargs_fromcmdline
(cls, args)¶
-
extract
(self, uast: bblfsh.Node)¶
-
-
class
sourced.ml.core.extractors.
BagsExtractor
(docfreq_threshold=None, weight=None, **kwargs)[source]¶ Bases:
sourced.ml.core.extractors.bags_extractor.Extractor
Converts a single UAST into the weighted set (dictionary), where elements are strings and the values are floats. The derived classes must implement uast_to_bag().
-
DEFAULT_DOCFREQ_THRESHOLD
= 5¶
-
NAMESPACE
¶
-
OPTS
¶
-
docfreq_threhold
¶
-
ndocs
¶
-
extract
(self, uast)¶
-
uast_to_bag
(self, uast)¶
-
-
class
sourced.ml.core.extractors.
RoleIdsExtractor
[source]¶ Bases:
sourced.ml.core.extractors.bags_extractor.Extractor
-
NAME
= roleids¶
-
ALGORITHM
¶
-
-
class
sourced.ml.core.extractors.
IdentifiersBagExtractor
(docfreq_threshold=None, split_stem=True, **kwargs)[source]¶ Bases:
sourced.ml.core.extractors.bags_extractor.BagsExtractor
-
NAME
= id¶
-
NAMESPACE
= i.¶
-
OPTS
¶
-
uast_to_bag
(self, uast)¶
-
-
class
sourced.ml.core.extractors.
LiteralsBagExtractor
(docfreq_threshold=None, **kwargs)[source]¶ Bases:
sourced.ml.core.extractors.bags_extractor.BagsExtractor
-
NAME
= lit¶
-
NAMESPACE
= l.¶
-
OPTS
¶
-
uast_to_bag
(self, uast)¶
-
-
class
sourced.ml.core.extractors.
UastRandomWalkBagExtractor
(docfreq_threshold=None, **kwargs)[source]¶ Bases:
sourced.ml.core.extractors.helpers.BagsExtractor
-
NAME
= node2vec¶
-
NAMESPACE
= r.¶
-
OPTS
¶
-
uast_to_bag
(self, uast)¶
-
-
class
sourced.ml.core.extractors.
UastSeqBagExtractor
(docfreq_threshold=None, **kwargs)[source]¶ Bases:
sourced.ml.core.extractors.helpers.BagsExtractor
-
NAME
= uast2seq¶
-
NAMESPACE
= s.¶
-
OPTS
¶
-
uast_to_bag
(self, uast)¶
-
-
class
sourced.ml.core.extractors.
ChildrenBagExtractor
(docfreq_threshold=None, **kwargs)[source]¶ Bases:
sourced.ml.core.extractors.bags_extractor.BagsExtractor
Converts a UAST to the bag of pairs (internal type, quantized number of children).
-
NAME
= children¶
-
NAMESPACE
= c.¶
-
OPTS
¶
-
npartitions
¶
-
levels
¶
-
extract
(self, uast)¶
-
quantize
(self, frequencies: Iterable[Tuple[str, Iterable[Tuple[int, int]]]])¶
-
-
class
sourced.ml.core.extractors.
GraphletBagExtractor
(docfreq_threshold=None, **kwargs)[source]¶ Bases:
sourced.ml.core.extractors.bags_extractor.BagsExtractor
-
NAME
= graphlet¶
-
NAMESPACE
= g.¶
-
OPTS
¶
-
uast_to_bag
(self, uast)¶
-
-
class
sourced.ml.core.extractors.
IdentifierDistance
(split_stem=False, type='tree', max_distance=DEFAULT_MAX_DISTANCE, **kwargs)[source]¶ Bases:
sourced.ml.core.extractors.bags_extractor.BagsExtractor
Extractor wrapper for Uast2IdTreeDistance and Uast2IdLineDistance algorithm. Note that this is an unusual BagsExtractor since it returns iterable instead of bag.
The class did not wrap with @register_extractor because it does not produce bags as others do. So nobody outside code will see it or use it directly. For the same reason we a free to override NAMESPACE, NAME, OPTS fields with any value we want.
TODO(zurk): Split BagsExtractor into two clases: Extractor and BagsExtractor(Extractor), re-inherit this class from Extractor, delete explanations from docstring.
-
NAMESPACE
=¶
-
NAME
= Identifier distance¶
-
OPTS
¶
-
DEFAULT_MAX_DISTANCE
¶
-
extract
(self, uast: bblfsh.Node)¶
-
-
class
sourced.ml.core.extractors.
IdSequenceExtractor
(split_stem=False, **kwargs)[source]¶ Bases:
sourced.ml.core.extractors.bags_extractor.BagsExtractor
Extractor wrapper for Uast2RoleIdPairs algorithm. Note that this is unusual BagsExtractor since it returns iterable instead of bag.
The class did not wrap with @register_extractor because it does not produce bags as others do. So nobody outside code will see it or use it directly. For the same reason we a free to override NAMESPACE, NAME, OPTS fields with any value we want.
TODO(zurk): Split BagsExtractor into two clases: Extractor and BagsExtractor(Extractor), re-inherit this class from Extractor, delete explanations from docstring.
-
NAMESPACE
=¶
-
NAME
= id sequence¶
-
OPTS
¶
-
extract
(self, uast: bblfsh.Node)¶
-
sourced.ml.core.models
¶
Subpackages¶
sourced.ml.core.models.model_converters
¶
sourced.ml.core.models.model_converters.base
¶-
class
sourced.ml.core.models.model_converters.base.
Model2Base
(num_processes: int = 0, log_level: int = logging.DEBUG, overwrite_existing: bool = True)[source]¶ Bases:
sourced.ml.core.utils.pickleable_logger.PickleableLogger
Base class for model -> model conversions.
-
convert
(self, models_path: List[str], destdir: str)[source]¶ Performs the model -> model conversion. Runs the conversions in a pool of processes.
Parameters: - models_path – List of Models path.
- destdir – The directory where to store the models. The directory structure is preserved.
Returns: The number of converted files.
-
sourced.ml.core.models.model_converters.merge_bow
¶-
class
sourced.ml.core.models.model_converters.merge_bow.
MergeBOW
(features=None, *args, **kwargs)[source]¶ Bases:
sourced.ml.core.models.model_converters.base.Model2Base
Merges several
BOW
models together.
sourced.ml.core.models.model_converters.merge_df
¶-
class
sourced.ml.core.models.model_converters.merge_df.
MergeDocFreq
(min_docfreq: int, vocabulary_size: int, ordered: bool = False, *args, **kwargs)[source]¶ Bases:
sourced.ml.core.models.model_converters.base.Model2Base
Merges several
DocumentFrequencies
models together.
Submodules¶
sourced.ml.core.models.bow
¶
-
class
sourced.ml.core.models.bow.
BOW
[source]¶ Bases:
modelforge.Model
Weighted bag of words model. Every word is correspond to an index and its matrix column. Bag is a word set from repository, file or anything else. Word is source code identifier or its part. This model depends on
sourced.ml.models.DocumentFrequencies
.-
matrix
[source]¶ Returns the bags as a sparse matrix. Rows are documents and columns are tokens weight.
-
sourced.ml.core.models.coocc
¶
sourced.ml.core.models.df
¶
-
class
sourced.ml.core.models.df.
DocumentFrequencies
[source]¶ Bases:
modelforge.Model
Document frequencies - number of times a source code identifier appeared in different repositories. Each repository counts only once.
-
construct
(self, docs: int, tokfreqs: Union[Iterable[Dict[str, int]], Dict[str, int]])[source]¶ Initializes this model.
Parameters: - docs – The number of documents.
- tokfreqs – The dictionary of token -> frequency or the iterable collection of such dictionaries.
Returns: self
-
prune
(self, threshold: int)[source]¶ Removes tokens which occur less than threshold times. The operation happens not in-place - a new model is returned. :param threshold: Minimum number of occurrences. :return: The new model if the current one had to be changed, otherwise self.
-
greatest
(self, max_size: int)[source]¶ Truncates the model to most frequent max_size tokens. The operation happens not in-place - a new model is returned. :param max_size: The maximum vocabulary size. :return: The new model if the current one had to be changed, otherwise self.
-
sourced.ml.core.models.id2vec
¶
sourced.ml.core.models.id_splitter
¶
-
class
sourced.ml.core.models.id_splitter.
IdentifierSplitterBiLSTM
(**kwargs)[source]¶ Bases:
modelforge.Model
Bidirectional LSTM Model. Splits identifiers without need for a conventional pattern. Reference: https://arxiv.org/abs/1805.11651
-
construct
(self, model: keras.models.Model, maxlen: int = DEFAULT_MAXLEN, padding: str = DEFAULT_PADDING, mapping: Dict[str, int] = DEFAULT_MAPPING, batch_size: int = DEFAULT_BATCH_SIZE)[source]¶ Construct IdentifierSplitterBiLSTM model.
Parameters: - model – keras model used for identifier splitting.
- maxlen – Maximum length of input identifers.
- padding – Where to pad the identifiers of length < maxlen. Can be “left” or “right”.
- mapping – Mapping of characters to integers.
- batch_size – Batch size of input data fed to the model.
Returns: BiLSTM based source code identifier splitter.
-
prepare_input
(self, identifiers: Sequence[str])[source]¶ Prepare input by converting a sequence of identifiers to the corresponding ascii code 2D-array and the list of lowercase cleaned identifiers.
-
sourced.ml.core.models.license
¶
Default license used for the models.
sourced.ml.core.models.ordered_df
¶
-
class
sourced.ml.core.models.ordered_df.
OrderedDocumentFrequencies
[source]¶ Bases:
sourced.ml.core.models.DocumentFrequencies
Compatible with the original DocumentFrequencies. This model maintains the determinitic sequence of the tokens.
sourced.ml.core.models.quant
¶
-
class
sourced.ml.core.models.quant.
QuantizationLevels
[source]¶ Bases:
modelforge.Model
This model contains quantization levels for multiple schemes (feature types). Every feature “class” (type, possible distinct value) corresponds to the numpy array with integer level borders. The size of each numpy array is (the number of levels + 1).
sourced.ml.core.models.tensorflow
¶
sourced.ml.core.models.topics
¶
Package Contents¶
-
class
sourced.ml.core.models.
BOW
[source]¶ Bases:
modelforge.Model
Weighted bag of words model. Every word is correspond to an index and its matrix column. Bag is a word set from repository, file or anything else. Word is source code identifier or its part. This model depends on
sourced.ml.models.DocumentFrequencies
.-
NAME
= bow¶
-
VENDOR
= source{d}¶
-
DESCRIPTION
= Model that contains source code as weighted bag of words.¶
-
LICENSE
¶
-
matrix
¶ Returns the bags as a sparse matrix. Rows are documents and columns are tokens weight.
-
documents
¶ The list of documents in the model.
-
tokens
¶ The list of tokens in the model.
-
construct
(self, documents: List[str], tokens: List[str], matrix: sparse.spmatrix)¶
-
dump
(self)¶
-
save
(self, output: str, series: str, deps: Iterable = tuple(), create_missing_dirs: bool = True)¶
-
convert_bow_to_vw
(self, output: str)¶
-
documents_index
(self)¶
-
-
class
sourced.ml.core.models.
Cooccurrences
[source]¶ Bases:
modelforge.model.Model
Co-occurrence matrix.
-
NAME
= co-occurrences¶
-
VENDOR
= source{d}¶
-
DESCRIPTION
= Model that contains the sparse co-occurrence matrix of source code identifiers.¶
-
LICENSE
¶
-
tokens
¶ Returns the tokens in the order which corresponds to the matrix’s rows and cols.
-
matrix
¶ Returns the sparse co-occurrence matrix.
-
construct
(self, tokens, matrix)¶
-
dump
(self)¶
-
matrix_to_rdd
(self, spark_context: 'pyspark.SparkContext')¶
-
-
class
sourced.ml.core.models.
DocumentFrequencies
[source]¶ Bases:
modelforge.Model
Document frequencies - number of times a source code identifier appeared in different repositories. Each repository counts only once.
-
NAME
= docfreq¶
-
VENDOR
= source{d}¶
-
DESCRIPTION
= Model that contains document frequencies of features extracted from code.¶
-
LICENSE
¶
-
docs
¶ Returns the number of documents.
-
construct
(self, docs: int, tokfreqs: Union[Iterable[Dict[str, int]], Dict[str, int]])¶ Initializes this model.
Parameters: - docs – The number of documents.
- tokfreqs – The dictionary of token -> frequency or the iterable collection of such dictionaries.
Returns: self
-
dump
(self)¶
-
prune
(self, threshold: int)¶ Removes tokens which occur less than threshold times. The operation happens not in-place - a new model is returned. :param threshold: Minimum number of occurrences. :return: The new model if the current one had to be changed, otherwise self.
-
greatest
(self, max_size: int)¶ Truncates the model to most frequent max_size tokens. The operation happens not in-place - a new model is returned. :param max_size: The maximum vocabulary size. :return: The new model if the current one had to be changed, otherwise self.
-
get
(self, item, default=None)¶ Return the document frequency for a given token.
Parameters: - item – The token to query.
- default – Returned value in case the token is missing.
Returns: int or default
-
tokens
(self)¶ Returns the list of tokens.
-
-
class
sourced.ml.core.models.
OrderedDocumentFrequencies
[source]¶ Bases:
sourced.ml.core.models.DocumentFrequencies
Compatible with the original DocumentFrequencies. This model maintains the determinitic sequence of the tokens.
-
order
¶
-
construct
(self, docs: int, tokfreqs: Iterable[Dict[str, int]])¶
-
tokens
(self)¶
-
prune
(self, threshold: int)¶
-
greatest
(self, max_size: int)¶
-
-
class
sourced.ml.core.models.
Id2Vec
[source]¶ Bases:
modelforge.Model
id2vec model - source code identifier embeddings.
-
NAME
= id2vec¶
-
VENDOR
= source{d}¶
-
DESCRIPTION
= Model that contains information on source code as identifier embeddings.¶
-
LICENSE
¶
-
embeddings
¶ numpy.ndarray
with the embeddings of shape (N tokens x embedding dims).
-
tokens
¶ List with the processed source code identifiers.
-
construct
(self, embeddings, tokens)¶
-
dump
(self)¶
-
items
(self)¶ Returns the tuples belonging to token -> index mapping.
-
-
class
sourced.ml.core.models.
TensorFlowModel
[source]¶ Bases:
modelforge.Model
TensorFlow Protobuf model exported in the Modelforge format with GraphDef inside.
-
NAME
= tensorflow-model¶
-
VENDOR
= source{d}¶
-
DESCRIPTION
= TensorFlow Protobuf model that contains a GraphDef instance.¶
-
LICENSE
¶
-
graphdef
¶ Returns the wrapped TensorFlow GraphDef.
-
construct
(self, graphdef: 'tensorflow.GraphDef' = None, session: 'tensorflow.Session' = None, outputs: List[str] = None)¶
-
-
class
sourced.ml.core.models.
Topics
[source]¶ Bases:
modelforge.Model
-
NAME
= topics¶
-
VENDOR
= source{d}¶
-
DESCRIPTION
= Model that is used to identify topics of source code repositories.¶
-
LICENSE
¶
-
tokens
¶
-
topics
¶ May be None if no topics are labeled.
-
matrix
¶ Rows: tokens Columns: topics
-
construct
(self, tokens: list, topics: Union[list, None], matrix)¶
-
dump
(self)¶
-
label_topics
(self, labels)¶
-
-
class
sourced.ml.core.models.
QuantizationLevels
[source]¶ Bases:
modelforge.Model
This model contains quantization levels for multiple schemes (feature types). Every feature “class” (type, possible distinct value) corresponds to the numpy array with integer level borders. The size of each numpy array is (the number of levels + 1).
-
NAME
= quant¶
-
VENDOR
= source{d}¶
-
DESCRIPTION
= Model that contains quantization levels for multiple schemes (feature types).¶
-
LICENSE
¶
-
levels
¶
-
construct
(self, levels: Dict[str, Dict[str, numpy.ndarray]])¶
-
dump
(self)¶
-
apply_quantization
(self, extractors)¶
-
-
class
sourced.ml.core.models.
MergeDocFreq
(min_docfreq: int, vocabulary_size: int, ordered: bool = False, *args, **kwargs)[source]¶ Bases:
sourced.ml.core.models.model_converters.base.Model2Base
Merges several
DocumentFrequencies
models together.-
MODEL_FROM_CLASS
¶
-
MODEL_TO_CLASS
¶
-
convert_model
(self, model: DocumentFrequencies)¶
-
finalize
(self, index: int, destdir: str)¶
-
sourced.ml.core.utils
¶
Submodules¶
sourced.ml.core.utils.bblfsh
¶
sourced.ml.core.utils.bblfsh_roles
¶
sourced.ml.core.utils.bigartm
¶
-
sourced.ml.core.utils.bigartm.
install_bigartm
(args=None, target='./bigartm', tempdir=None)[source]¶ Deploys bigartm/bigartm at the specified path.
Parameters: - args –
argparse.Namespace
with “output” and “tmpdir”. “output” sets the target directory, “tmpdir” sets the temporary directory which is used to clone bigartm/bigartm and build it. - target – The path to the built executable. If args is not None, it becomes overridden.
- tempdir – The temporary directory where to clone and build bigartm/bigartm. If args is not None, it becomes overridden.
Returns: None if successful; otherwise, the error code (can be 0!).
- args –
Package Contents¶
-
sourced.ml.core.utils.
install_bigartm
(args=None, target='./bigartm', tempdir=None)[source]¶ Deploys bigartm/bigartm at the specified path.
Parameters: - args –
argparse.Namespace
with “output” and “tmpdir”. “output” sets the target directory, “tmpdir” sets the temporary directory which is used to clone bigartm/bigartm and build it. - target – The path to the built executable. If args is not None, it becomes overridden.
- tempdir – The temporary directory where to clone and build bigartm/bigartm. If args is not None, it becomes overridden.
Returns: None if successful; otherwise, the error code (can be 0!).
- args –