sequence errors created during cloning - (Jul/28/2010 )
Hi all,
I am doing relatively simple pcr cloning, using Invitrogen's TA TOPO cloning kit with Mach1 cells. I followed the standard protocol, picking white colonies, growing them up in selective media, miniprepping, and using ecor1 to double-check insert-size on an agarose gel. However, when I sent my beautiful, perfectly sized bands to be sequenced, >90% came back with stop codons caused by randomly distributed insertions and/or deletions.
The cDNA message I'm looking at is a member of the cytochrome P450 complex and I am trying to sequence allelic variants. I sequenced the initial PCR products, which had ambiguities (due to different alleles) but no insertions or deletions, so I'm fairly confident this occurred during cloning.
I pcr'd w/ a pfu taq, so I had to add the 3'A overhang before cloning.
There is no reason to think that this cyp cDNA is harmful to the cells. Even if the protein was expressed, it needs a chaperone to function.
I'm at a loss (technically and financially, now).
Thanks for your help.
lepida on Wed Jul 28 17:26:56 2010 said:
I sequenced the initial PCR products, which had ambiguities (due to different alleles) but no insertions or deletions, so I'm fairly confident this occurred during cloning.
When you sequence a PCR product directly, you're sequencing a population. Since incorporation errors (or single basepair indels) are relatively rare, the sequence generated from the population of amplicons will be a consensus sequence -- you don't get any information about the fidelity from this.
For example, say you had a 1,000 bp sequence that you amplified. If you had 1,000 amplicons, each with a error of incorporation in a different place, they're all individually wrong. But, if you sequence this population, you'll get back perfect sequence -- the error on any individual amplicon at any give location is vastly outnumbered by the correct base in that position on all the other amplicons.
When you clone a PCR product, however, you're capturing one amplicon. Thus, in the experiment above (1,000 amplicons of a 1,000-bp sequence, each with an incorporation error in a different spot), it would be impossible to clone an insert that doesn't have an error, even though the sequence generated from the population is perfect.
The other thing to consider is the quality and reliability of the sequence data you generated from your clones. Did you sequence the inserts in both directions? Was the sequence data of good quality (few Ns, good signal strength, etc.)?
Thanks for your response, Homebrew.
The sequence data was high quality, with very reliable peaks, and I sequenced from both directions (although I don't have complete overlap). Another reason that I think these stop codons were introduced during cloning is that the initial sample mRNA (reverse-transcribed to cDNA). The chance of all those stop codons being transcribed from DNA to full length mRNA's seems unlikely to me, especially since the indels are in different locations for each sequenced clone. Have you heard of a case wherein a whole population of gene copies/alleles are pseudogenes? I would appreciate knowing how someone else explained such a situation (especially if they got their dataset published!).
Thanks, again, for any additional info/suggestions.
All stop codons are transcribed into mRNA -- that's how they're there for the ribosome to know when to terminate translation.
It's very unlikely that these stop codons were introduced during cloning, unless the protein encoded is toxic to your cloning host. Even if they were introduced during PCR amplification, or were a consequence of toxicity, why stop codons? In the case of PCR amplification, the errors would be random, and mostly errors of incorporation (wrong base), not point mutations (missing base). They would not preferentially create stop codons. If the gene were toxic, the same thing holds -- you would only recover genes that had mutated during plasmid replication such that the encoded protein was rendered non-toxic or inactive -- there would be no preference for the introduction of stop codons into the reading frame, encoding an erroneous proline or something would serve just as well.
So, if all your clones have unexpected stop codons in them, you need to think about other reasons for this -- the distribution of errors (all stop codons) does not match the mechanisms of error introduction either as a consequence of PCR amplification or of plasmid replication.
The reason I think that these are errors in the process and not real stop codons is that they appear to be a result of single nucleotide insertions or deletion in the sequence. About a year ago I successfully cloned this gene from similar samples (same sp. of wild rodent) and got exclusively full-length, non-stop-codoned sequences (about 70 of them) that, when translated, closely aligned with amino acid sequences in genbank from lab rats. So, I have an idea of how the sequences "should" look and how they should be translated. When I align my old sequences w/ my new, I can pick out the single nucleotide indels that alter the reading frame, ruining the amino acid similarity and resulting in randomly distributed stop codons. These indels are not consistently placed, so most of the 35 sequences have stops in different places. There is no evidence that this gene has so many copies or alleles that there could be 35 different full-length pseudogenes. I apologize for not being entirely clear about this in my previous posts. The only thing I changed in my process was the polymerase...from a non-pfu (all perfect sequences) to a pfu polymerase (indels and stops, ~35 seq). Maybe that's the answer.
Thanks for your time, HomeBrew; I really appreciate you thinking about this problem with me (for me?).
if you get that high frequency of stop codons (as you said >90%) i would assume your protein is toxic when expressed from this high copy number plasmid.
maybe you can try a low copy number backbone (i found a paper where they expressed cytochrome P450 from a pQE-vector from Qiagen, see here).
I think that E. coli is not really lucky with your sequence, that it creates a heavy burden on the cell and therefore there is strong selection towards mutations with stop codons.
Regards,
p
lepida on Wed Aug 4 18:00:10 2010 said:
The reason I think that these are errors in the process and not real stop codons is that they appear to be a result of single nucleotide insertions or deletion in the sequence. About a year ago I successfully cloned this gene from similar samples (same sp. of wild rodent) and got exclusively full-length, non-stop-codoned sequences (about 70 of them) that, when translated, closely aligned with amino acid sequences in genbank from lab rats. So, I have an idea of how the sequences "should" look and how they should be translated. When I align my old sequences w/ my new, I can pick out the single nucleotide indels that alter the reading frame, ruining the amino acid similarity and resulting in randomly distributed stop codons. These indels are not consistently placed, so most of the 35 sequences have stops in different places. There is no evidence that this gene has so many copies or alleles that there could be 35 different full-length pseudogenes. I apologize for not being entirely clear about this in my previous posts. The only thing I changed in my process was the polymerase...from a non-pfu (all perfect sequences) to a pfu polymerase (indels and stops, ~35 seq). Maybe that's the answer.
Thanks for your time, HomeBrew; I really appreciate you thinking about this problem with me (for me?).
Wait -- that changes things. As originally described, it seemed you were saying the mutations *created* stop codons (as in UGG changing to UGA). Now, you're describing frame-shift mutations such that somewhere *downstream* of the mutation, there happens to be a stop codon in the new (wrong) frame. This is entirely different...
This is likely caused by the polymerase.