Statistical Analysis of DNA Sequences Using Overlapping Windows
Loading...
Files
Date
Authors
Hauth, Amy
Clayton, Murray K.
Advisors
License
DOI
Type
Technical Report
Journal Title
Journal ISSN
Volume Title
Publisher
University of Wisconsin-Madison Department of Computer Sciences
Grantor
Abstract
Motivation: Our analysis of DNA sequences uses a k-length, sliding window and considers all overlapping windows along the sequence. The k consecutive nucleotides in a window are called a word or k-word. Statistical analysis of this collection of words often assumes independence between words. Since words can overlap, strict independence is not a valid assumption. We derive a statistic to incorporate both the independent and dependent components of overlapping, k-length words.
Results: The expected number of occurrences for a k-word in an N-length sequence is easily calculated given the probabilities of the nucleotides within the word.
However, the variance is not straightforward since overlapping occurrences are not independent. We present a derivation of the variance when sequence analysis uses overlapping, k-length windows. The variance can be determined for a word in the entire sequence or at a single position in the sequence. Our analysis assumes that each nucleotide is independent. It does not assume a specific
probability of occurrence for each nucleotide.
Description
Keywords
Related Material and Data
Citation
TR1474