I have several sequences of data and I'd like to do statistical analysis of trends within these sequences.
For simplicity's sake I'll illustrate this using letters:
Sequence 1: A B L A G F C
Sequence 2: X B D K W L H K
Sequence 3: A B L H K W E N
Sequence 4: X B L A G C H K
Sequence 5: J L H B K W E N
etc...
The actual patterns are non-random and have several very clear patterns and tendencies.
I'd like to do some number crunching to determine things like:
1.) What are the most common sequences (pairs, trios), etc. of letters? (for example, in the examples above, "LAG" occurs in both sequence 1 and 4.
2.) What are the most common opening patterns (as in, the first 2-4 letters) (for example, Sequence 1 and 3 both begin with ABL, sequence 2 & 4 both begin with XB.
3.) What are the most common closing patterns (last 3-4 letters) (for example, WEN is the most common ending, HK is the second most common)
4.) When the letter L occurs, what are the most common letters preceding / succeeding it? (e.g., L is usually followed by H, and sometimes by A. W is always preceded by K)
5.) When the letters D and F occur in succession, what are the most common letters preceding / succeeding this pair?
6.) In all the sequences, what are the most/least common letters (in order)?
etc.
...And many other similar questions...
The ultimate goal: given a new sequence with blanks in it, I would like to make intelligent guess as to the most likely letters to fill those blanks.
Thanks
Bookmarks