sourced.ml.core.algorithms.token_parser

Module Contents

class sourced.ml.core.algorithms.token_parser.TokenStyle[source]

Bases:enum.Enum

Metadata that should allow to reconstruct initial identifier from a list of tokens.

DELIMITER = 1[source]
TOKEN_UPPER = 2[source]
TOKEN_LOWER = 3[source]
TOKEN_CAPITALIZED = 4[source]
class sourced.ml.core.algorithms.token_parser.TokenParser(stem_threshold=STEM_THRESHOLD, max_token_length=MAX_TOKEN_LENGTH, min_split_length=MIN_SPLIT_LENGTH, single_shot=False, save_token_style=False, attach_upper=True, use_nn=False, nn_model=None)[source]

Common utilities for splitting and stemming tokens.

NAME_BREAKUP_RE[source]
NAME_BREAKUP_KEEP_DELIMITERS_RE[source]
STEM_THRESHOLD = 6[source]
MAX_TOKEN_LENGTH = 256[source]
MIN_SPLIT_LENGTH = 3[source]
use_nn[source]
stem_threshold[source]
max_token_length[source]
min_split_length[source]
process_token(self, token)[source]
stem(self, word)[source]
split(self, token: str)[source]

Splits a single identifier.

split_batch(self, tokens: [str])[source]

Splits a batch of identifiers.

static reconstruct(tokens)[source]
class sourced.ml.core.algorithms.token_parser.NoopTokenParser[source]

One can use this class one does not want to do any parsing.

process_token(self, token)[source]