ATCG content calculation - (Jul/06/2009 )
Hello everyone,
Is there a program can calculate the content of a,t,c and g of each gene in E. coli ?
I want to know the frequency of each nucleotide.
Thank you very much.
Find the sequence from NCBI nucleotide database and paste it here http://www.protocol-online.org/tools/sms2/dna_stats.html
Perl can do it fairly easily.  Here's a quick-and-dirty script:
#!/usr/bin/perl -w
use strict;
my $input_file = "e.coli.nuc";
my $output_file = "e.coli.freq.txt";
my $count = 0;
open (IN, $input_file) or die "Can't find a file called $input_file! $!\n";
open (OUT, ">$output_file") or die "Can't open $output_file! $!\n";
$/ = ">";
while (<IN>) {
	my ($gene, $description, $seq) = ($_ =~ /(.+?)\s+(.+?)\n([aAtTgGcCnN\n]+)/);
	next unless ($gene);
	$seq =~ s/\n//g;
	my $A = ($seq =~ tr/Aa/Aa/);
	my $G = ($seq =~ tr/Gg/Gg/);
	my $C = ($seq =~ tr/Cc/Cc/);
	my $T = ($seq =~ tr/Tt/Tt/);
	my $N = ($seq =~ tr/Nn/Nn/);
	my $bp = length($seq);
	my $gc = sprintf("%0.1f", 100 * (($G + $C)/$bp));
	print OUT ">$gene $description\n\tA = $A, G = $G, C = $C, T = $T, N = $N, total bp = $bp, %GC = $gc\n";
	$count++;
}
print "Finished $count genes from $input_file\n";
The script takes as input a Fasta formatted file of DNA sequences, in this case "e.coli.nuc", which is the Escherichia coli K-12 MG1655 DNA ORFs dowloaded from ftp://ftp.genome.jp/pub/kegg/genes/organisms/, in the 'eco' directory.
There are actually 24 E. coli genomes available there, identifiable by their three-letter code:
T00007(1997)	eco	E.coli	Escherichia coli K-12 MG1655
T00068(2001)	ecj	E.coli_J	Escherichia coli K-12 W3110
T00666(2008)	ecd	E.coli_DH10B	Escherichia coli K-12 DH10B
T00913(2009)	ebw	E.coli_BW2952	Escherichia coli BW2952
T00044(2001)	ece	E.coli_O157	Escherichia coli O157 EDL933 (EHEC)
T00048(2001)	ecs	E.coli_O157J	Escherichia coli O157 Sakai (EHEC)
T00778(2008)	ecf	E.coli_O157_EC4115	Escherichia coli O157 EC4115 (EHEC)
T00106(2002)	ecc	E.coli_CFT073	Escherichia coli CFT073 (UPEC)
T00338(2006)	eci	E.coli_UTI89	Escherichia coli UTI89 (UPEC)
T00373(2006)	ecp	E.coli_536	Escherichia coli 536 (UPEC)
T00425(2006)	ecv	E.coli_APEC	Escherichia coli APEC O1
T00590(2007)	ecw	E.coli_E24377A	Escherichia coli E24377A
T00591(2007)	ecx	E.coli_HS	Escherichia coli HS
T00672(2008)	ecm	E.coli_SECEC	Escherichia coli SECEC
T00784(2008)	ecy	E.coli_SE11	Escherichia coli SE11
T00697(2008)	ecl	E.coli_ATCC8739	Escherichia coli ATCC 8739
T00796(2008)	ecg	E.coli_0127	Escherichia coli O127
T00826(2009)	eck	E.coli_55989	Escherichia coli 55989
T00827(2009)	ecq	E.coli_ED1a	Escherichia coli ED1a
T00828(2009)	ecr	E.coli_IAI1	Escherichia coli IAI1
T00829(2009)	ect	E.coli_IAI39	Escherichia coli IAI39
T00830(2009)	ecz	E.coli_S88	Escherichia coli S88
T00831(2009)	eum	E.coli_UMN026	Escherichia coli UMN026
T00854(2009)	elf	E.coli_LF82	Escherichia coli LF82
You can format the print out anyway you like by changing the "print" line.  As written, the script outputs the data to a file called "e.coli.freq.txt", but you're free to change the output or input files as you like.
On my machine, it takes less than a second to give you the base frequences, total base count, and %GC for all 4.493 genes from this E. coli genome.
Oh, that's great!
It really helps.
Thank you, HomeBrew and Bioforum.
Dear all
I want to use perl to calculate the percentage of ATCG content of dna, can you help me?
I want to do this kind of research in the future
=?% 
Thank you very much.