Distribution of nucleotides in genetic sequences - It is not random but what is the rule? (Feb/07/2008 )
Hi!
Distribution of nucleotides along DNA or RNA cannot be modelled by a simple randomisation procedure. The proofs:
1. I read it somewhere but do not remember the source.
2. I tested it with a computer modelling.
The question is: what are the rules? There should be some principles and/or formulas statistically describing distribution of nucleotides in sequences.
Could someone kindly provide me with hints? Thank you
i don't see how the distribution of nucleotides could be reduced to a statistical model, actually i agree with a colleague who always says we humans will never be able to understand the whole complexity of biological processes. some hints you could use, i would say you need to decipher the composition of the genome you are interested in by parts.
There are dinucleotide and higher order statistics which are quite predictive. CG sequences, for example, are highly disfavored. In E. coli, CTAG sequences are very rare. The statistics are organism dependent. There is also significant bias in the statistics depending on the direction of DNA replication and on the direction of gene transcription (which are often, but not always, aligned).
I reject out of hand the idea that biology is too complex to ever understand. It's chemistry, albeit complex chemistry. And evolution needs modular reliable design as much as we need those tools to understand it.
ok phage, i agree in your first paragraph, there are rare sequences in each organism that can be easily detected and analyzed, also cpg islands to mention one. every time people think they have reached something big then something bigger comes, think of the human genome for example, the next thing they asked themselves after sequencing it was "now what?" now there are all these epigenetic phenomena as well, and i believe it just doesn't stop, of course lots of things can be understood by now (compared to last decade or earlier) but you can't just reduce all the "albeit complex chemistry" to statistical models, which reminds me of the picture you have probably seen before called "contact" but that is only science fiction, saying that everything can be reduced to maths...from my point of view...not impossible but improbable.
Well, was my question too general? I try to re-formulate it.
I have 4 nucleotides {a, c, g, t}. I believe you have only 4 as well. Considering a model based on even statistical distribution, we state that there is equal probability (P=0.25) to find any nucleotide in any position of any genetic sequence. As discussed above, it is not so. The reasons are biological but they are outside my scope now.
The reformulated question could be: What is the difference between the purely random distribution of nucleotides in sequences and the real world (from mathematical point of view)?
I do not believe that among 6+ billions of human population no one proposed some formulas
a reasoning i can think of right now is codon usage, you know that you need three nucleotides to get a codon and then an aminoacid, and a series of amionacids to get proteins. if you had random distribution of nucleotides where all four of them were in the same probability (0.25) in real life then you would have to take out all the codons in the genetic code that have a repeated nucleotide. nb: this would be valid for coding regions, other regions i don't know.
even though it doesn't sound statistically likely, due to the number of human population nowadays, there are SOOO many things that we have not even imagined...how many of those 6+ billion are dedicated to science or making new discoveries? just take a look around and you will notice how many people prefer others making decisions for them instead of making decisions for themselves, then i think your numbers will be significantly reduced.
haha, with so many soccer teams and players, soap operas, tv actresses, wrestling and reality shows...hmmm...i don't think there's time left for thinking about A T C and G