
Computing GC Content

The GC-content of a DNA string is given by the percentage of symbols in the string that are ‘C’ or ‘G’. For example, the GC-content of “AGCTATAG” is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with ‘>’, followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with ‘>’ indicates the label of the next string.

In Rosalind’s implementation, a string in FASTA format will be labeled by the ID “Rosalind_xxxx”, where “xxxx” denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Sample input


Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.
Sample output


GC含量是在所研究的对象的全基因组中,鸟嘌呤(Guanine)和胞嘧啶(Cytosine)在全部碱基中所占的比例。能够决定DNA的稳定性。本道题需要我们读取含有多序列的fasta文件,并且挨个计算GC含量,最终输出GC含量最高的序列开头注释信息以及其GC含量值。** 解题思路如下:

  • 1.逐行读取fasta文件,消掉序列信息中的换行符。
  • 2.计算每条序列的GC含量。
  • 3.各条序列的GC含量进行对比,并获得最大GC含量。
  • 4.输出GC含量最高的序列及其标签


  • 子方法1BufferedReader用来读取fasta文件并且去掉其中的序列换行符。详情请见:解决Java逐行读取带有行缩进的fasta文件
  • 子方法2FindMaxCount获取GC含量最高值。
  • 子方法3FindIndex获取最大的GC含量序列引索值,用以输出其对应的开头注释信息。以下是全部代码。
import java.util.ArrayList;public class Computing_GC_Content {public static void main(String[] args) {//1.逐行读取fasta文件ArrayList<String> fasta = BufferedReader("C:/Users/Administrator/Desktop/rosalind_gc.txt", "fasta");//设置返回值是碱基序列。ArrayList<String> tag = BufferedReader("C:/Users/Administrator/Desktop/rosalind_gc.txt", "tag");//设置返回值是标签名称。//2.计算每条序列的GC含量float[] GCratio = new float[fasta.size()];//定义数组存储每条fasta序列的GC比例int[] GCcount = new int[fasta.size()];//定义数组存储每条fasta序列的GC含量for (int i = 0; i < fasta.size(); i++) {for (int j = 0; j < fasta.get(i).length(); j++) {if (fasta.get(i).charAt(j) == 'G' || fasta.get(i).charAt(j) == 'C') {//fasta.get(i).charAt(j)=='G'||'C' Java不支持此类型的判断语句GCcount[i]++;}}GCratio[i] = (float) GCcount[i] / fasta.get(i).length();}//3.各条序列的GC含量进行对比,并获得最大GC含量float maxcount = (float) (Math.round(FindMaxCount(GCratio) * 100000000)) / 1000000;int maxIndex = FindIndex(GCratio);//4.输出GC含量最高的序列及其标签System.out.println(tag.get(maxIndex));System.out.println(maxcount);}//子方法1.读取fasta文件并且分别存储到fasta集合和tag集合中。public static ArrayList<String> BufferedReader(String path,String choose) {//返回值类型是新建集合大类,此处是Set而非哈希。BufferedReader reader;ArrayList<String> tag = new java.util.ArrayList<String>();ArrayList<String> fasta = new java.util.ArrayList<String>();try {reader = new BufferedReader(new FileReader(path));String line = reader.readLine();StringBuilder sb = new StringBuilder();while (line != null) {//多次匹配带有“>”的行,\w代表0—9A—Z_a—z,需要转义。\W代表非0—9A—Z_a—z。if (line.matches(">[\\w*|\\W*]*")){tag.add(line);//第一个循环开始时StringBuilder为空,需要添加判断以排除此特例。if (sb.length()!=0){String seq = sb.toString();//定义字符串变量seq保存删除换行符的序列信息fasta.add(seq);sb.delete(0, sb.length());//清空StringBuilder中全部元素}}else{sb.append(line);//重新向StringBuilder添加元素}// read next lineline = reader.readLine();}String seq = sb.toString();fasta.add(seq);//循环结束还要再次输出序列,否则会丢失一条序列。reader.close();} catch (IOException e) {e.printStackTrace();}if (choose.equals("tag")){return tag;}return fasta;}//子方法2:获取最大的GC含量数字public static float FindMaxCount(float[] arr) {float max = arr[0];for (int x = 1; x < arr.length; x++) {if (max < arr[x]) {max = arr[x];}}return max;}//子方法3:获取最大的GC含量序列引索public static int FindIndex(float[] arr) {float max = arr[0];int maxValIndex = 0;for (int x = 1; x < arr.length; x++) {if (max < arr[x]) {max = arr[x];maxValIndex = x;}}return maxValIndex;}}



