python - Finding all repeated substrings in a string and how often they appear -

problem:

i need sequences of characters meet following:

sequence of characters must present more once ((le, 1) invalid).
sequence of characters must longer 1 character ((m, 2) invalid).
sequence of characters must not part of longer existing sequence present same number of times ((li, 2) invalid if (lio, 2) present).

so, if input string was: kakamneneneliolelionem$ output be:

(ka, 2) (ne, 4) (lio, 2)

it needs fast, should able solve 1000 character long string in reasonable amount of time.

what have tried:

getting amount of branches suffix tree:

editing this suffix tree -creating librabry(python-suffix-tree), made program gives erroneus results.

i added function suffixtree class in suffix_tree.py:

def get_repeated_substrings(self):     curr_index = self.n     values = self.edges.values()     values = sorted(values, key=lambda x: x.dest_node_index)     data = []  # index = edge.dest_node_index - 1     edge in values:         if edge.source_node_index == -1:             continue         top = min(curr_index, edge.last_char_index)         data.append([edge.source_node_index,                 self.string[edge.first_char_index:top+1]])     repeated_substrings = {}     source_node_indexes = [i[0] in data]     nodes_pointed_to = set(source_node_indexes)     nodes_pointed_to.remove(0)     node_pointed_to in nodes_pointed_to:         presence = source_node_indexes.count(node_pointed_to)         key = data[node_pointed_to-1][1]         if key not in repeated_substrings:             repeated_substrings[key] = 0         repeated_substrings[key] += presence     key in repeated_substrings:         if len(key) > 1:             print(key, repeated_substrings[key])

and used , rest of library this:

from lib.suffix_tree import suffixtree  st = suffixtree("kakaneneneliolelione$") print(st.get_repeated_substrings())

output:

ka 2 ne 7 lio 2 io 4

get_repeated_substrings() goes through connections between nodes (called edges in library) , saves how many more connections node points towards has (it saves repeated_substrings), , prints saved values more 1 character long.

it appends number of connections amount had sequence, works of time, can see in output above, resulted in incorrect value 'ne' (7, should have been 4). after solving problem, realised method won't detect patterns made out of same character (aa, bb), , other malfunctions. conclusion: either there no way solve suffix trees, or doing wrong.

other methods:

i have tried more straightforward ways, consisting of looping through stuff, hasn't been success either:

import copy  string = 'kakabaliosie'  main_char in set(string):     indices = []     char_i, char in enumerate(string):         if main_char == char:             indices.append(char_i)     relative = 1     while true         index in indices:             other_indices = copy.deepcopy(indices)             other_indices.remove(index)             other_index in other_indices:

(can't manage finish it)

question:

how can make program want?

your suffix tree approach right way.

getting set of matches , number of occurrences

basically need traverse tree in bfs fashion. starting root's children recursively count number of leaves reachable each node. lead method node call on root. here possible implementation:

def count_leaves(self, stree):     leaves_count = 0     child in [stree.nodes[x.dest_node_index] x in self.edges.values()]:         child_leaves_count = child.count_leaves(stree)         if 0 == child_leaves_count:             # child node leaf...             leaves_count = leaves_count + 1         else:             # child node internal node, add number of leaves can reach             leaves_count = leaves_count + child_leaves_count     self.leaves_count = leaves_count     return leaves_count

now, each node labelled number of leaves can reach.

then, interesting properties of suffix tree filter out automatically substrings don't match of requirements:

if string present once, end in transition leaf node (said transition has @ least 2 characters because don't count end token $). means isn't explicit state , don't consider it.
if have (li, 2) , (lio, 2), (li, 2) implicit state of suffix tree, means in middle of edge. since consider substrings explicit state (i.e. end in node), won't ever find (li, 2) begin with.
internal nodes have @ least 2 children, otherwise leaf , have none.

now iterating on internal nodes both list of substrings , number of occurences in input string (you need filter out nodes representing 1 character substring).

you find below string representation of suffix tree input string. visualize substrings matches.

- o - n e m $ - ##       - l e l o n e m $ - ##   - o - n e m $ - ##         - l e l o n e m $ - ##   - $ - ##   - e - m $ - ##         l o - n e m $ - ##               l e l o n e m $ - ##       n e - l o l e l o n e m $ - ##             n e l o l e l o n e m $ - ## - k - m n e n e n e l o l e l o n e m $ - ##         k m n e n e n e l o l e l o n e m $ - ## - l - e l o n e m $ - ##       o - n e m $ - ##             l e l o n e m $ - ## - - m n e n e n e l o l e l o n e m $ - ##       k m n e n e n e l o l e l o n e m $ - ## - m - $ - ##       n e n e n e l o l e l o n e m $ - ## - n e - m $ - ##         l o l e l o n e m $ - ##         n e - l o l e l o n e m $ - ##               n e l o l e l o n e m $ - ##

this leads following output:

(io, 2)
(elio, 2)
(ene, 2)
(ka, 2)
(lio, 2)
(ne, 4)
(nene, 2)

excluding redundant matches fixed number of occurrences (n)

we assume instance lio, , io should filtered out because, elio, have 2 matches. such substrings of larger match called "redundant matches". following puzzle remains unsolved: given set of matches occur n times a.k.a n-matches (where n fixed integer), how can filter out "redundant" ones?

we start making priority queue set of n-matches ordered decreasing length. build iteratively generalized suffix tree (gst) of these matches identify redundant matches. algorithm following:

for each element in heap (taken @ top), test if element substring of 1 of elements registered in gst
- if not: insert gst , append list of "good matches".
- else: skip since larger match registered ... , try next element
once heap empty, list of matches contains non-redundant n-matches.

this leads following pseudo python code:

match_heap = heapify(set_of_matches) good_matches = [] match_gst = generalized_suffix_tree() while (not match_heap.empty()):     top_match = match_heap.top()     if (not match_gst.is_substring(top_match.string)):         gst_match.insert(top_match.string)         good_matches.append(top_match)     else:         # given match substring of registered, bigger match         # skip return good_matches

filtering redundant matches

now can filter redundant matches n-matches, easy filter of them out of our global set of matches. gather matches in buckets using number of occurrences, apply algorithm of previous section on each bucket.

notes

to implement algorithm above, need have generalized suffix tree implementation, bit different regular suffix tree. if cannot find python implementation, adapt 1 got. see this question hints on how so.

Search This Blog

Today's Best Video