pythonorange_利用Python【Orange】结合DNA序列进行人种预测

coursera上 web intelligence and big data 终于布置了HW7，这一次的要求是对一系列DNA序列进行预测，具体说明如下：

Data Analytics Assignment (for HW7)

Predict the Ethnicity of Individuals from their Genes

============================================

It is now possible to get the DNA sequence of an

individual at a reasonable cost. An individual's genetic make-up

determines a number

of charactersistics - eye colour, propensity for certain

diseases, response to treatment and so on. In this problem, you are

given a subset of genetic information for several individuals. For some

of the individuals you are also told their ethinicity.

Your task is to figure out the ethnicity of the other individuals.

The information provided is as follows:

1. For each individual the presence (1) or absence (0) of a

genetic variation at a particular position on chromosome 6

is provided. In some cases, information for an individual at a

particular position is not available and this represented as

? (missing).

2. Information is provided for approximately 204000 positions. These are your features.

3. The training set has data for 139 individuals along with their ethnicity.

4. The test (prediction) set has data for 11 individuals.

You have to predict the ethnicity for these individuals and enter your

answers via HW7.

Data Sets

-----------

The training set is available here: genestrain.tab.zip (6.2

Mb)

The test set is available here: genesblind.tab.zip (1.2

Mb)

File Format

-----------

(Note: Data sets are .tab files in the tab-separated format that can be read into Orange):

Both the training and test data files have a header line

which is a tab-separated line of column/feature names: For example

'6_10000005' indicates that the column describes the presence or absence

of variations at position 10000005 on chromosome

#6.

Entries in the second header line indicate the type of column (in this case all features are 'discrete').

Entries in the third header line indicate the nature of each column:

A ' ' for most columns that contain a feature, and 'class' for the first

column as it contains the actual class labels (i.e., ethnicities of the

individuals in each row).

These header lines are followed by lines containing feature values (0, 1, or ?) for each genetic feature of an individual.

In the training set file the first column, which denotes the class

label, is a three-letter code with one of the following values:

o CEU is Northern and Western European

o GIH is Gujarati Indian from Houston

o JPT is Japanese in Tokyo

o ASW is Americans of African Ancestry

o YRI is Yoruba in Ibadan, Nigera

In the test file the ethnicity column also exists but is blank.

=========================

For the purposes of your HW answer alone, each three letter code is to be marked with a NUMERIC VALUE as indicated in the table below:

o CEU is Northern and Western European - 0

o GIH is Gujarati Indian from Houston - 1

o JPT is Japanese in Tokyo - 2

o ASW is Americans of African Ancestry - 3

o YRI is Yoruba in Ibadan, Nigera - 4

YOU MUST USE THE ABOVE NUMERIC VALUES TO ENCODE YOUR ANSWER. Note: This

numeric value has no presence in the test or training data.

Task: For each of the individuals in

the test file, predict their ethnicity as CEU, GIH, JPT, ASW or YRI and

enter your answers in HW7 in exactly the order that the 11 individuals appear in the test file.

So, for example, if your prediction is CEU, GIH, JPT, ASW, YRI CEU, GIH, JPT, ASW, YRI, CEU, you should enter your answer as 0 1 2 3 4 0 1 2 3 4 0 (i.e. numbers separated

by a space - no commas, tabs or anything else, just as space between single digit numbers).

不过很多人在discussion form里面反映着印度老师在描述的时候没有把问题讲明白(主要是没告诉他们该怎么做)，也没在video里面给个指导视频啥的。好在把数据下下来以后，发现其中有一个训练集，一个预测集，估计也只能是先训练，再预测而已。

训练集是一个tab文件，格式如下：

横坐标class代表人种(这里有139行，代表139个训练数据)，纵坐标代表DNA片段(约有20万个，后面n列未列出)

预测集如下：

这里第一列加问号的就是要预测的，总共为11个人种信息。

针对这个问题，贝叶斯分类器就能搞定了，代码很短如下：

# Description: Read data, build naive Bayesian classifier and classify first few instances

# Category: modelling

# Uses: genestrain.tab

# Predict: genesblind.tab

# Referenced: c_basics.htm

import orange

data = orange.ExampleTable("genestrain")

data2= orange.ExampleTable("genesblind")

classifier = orange.BayesLearner(data)

i = 0

for item in data2:

c = classifier(item)

print "%d: %s " % (i, c)

i = i + 1

可以看到这里先用训练数据进行训练，得到分类器，然后用分类器对预测数据的每一行进行预测，输出结果，思想还是比较清晰的，不过唯一的缺点是在数据量稍大一点时，运行速度和消耗资源很大，针对这题要使用1G内存，运行10分钟：

最终输出结果如下：

这样就得到了有待预测的11个人种，填写答案搞定。

估计这是这门课最后一次编程作业了，还剩一个在线的final exam，赶紧结课吧。

pythonorange_利用Python【Orange】结合DNA序列进行人种预测相关推荐

python例题求乘客等车时间_利用Python数据处理进行公交车到站时间预测（一）
1.数据格式 id int id编号 type int 41表示站间数据,42中间站进出数据 43始末站进出数据 route_id int 线路ID号,10454,10069,120881 ...
利用Python数据处理进行公交车到站时间预测（一）
1.数据格式 id int id编号 type int 41表示站间数据,42中间站进出数据 43始末站进出数据 route_id int 线路ID号,10454,10069,120881 ...
利用Python对电商用户购买行为进行预测!这都能预测到？
任务:依据电子商务平平台上真实的用户行为记录,利用机器学习相关技术,建立稳健的电商用户购买行为预测模型,预测用户下一个可能会购买的商品. 数据简介数据整理自一家中等化妆品在线商店公布的网上公开数据集 ...
论文解读：《利用注意力机制提高DNA的N6-甲基腺嘌呤位点的鉴定》
论文解读:<Leveraging the attention mechanism to improve the identification of DNA N6-methyladenine si ...
利用python处理dna序列_Python + 生物信息 02 ：Biopython 分析序列
Biopython 做序列分析一.安装Biopython:如果环境已经有Biopython可以跳过这一步.这里有两种安装方案,一种通过pip快速安装,另一种通过安装包安装 1. 用pip安装Biop ...
利用python处理dna序列_利用Python编程提取基因组基因序列
生物技术. DOI: 10.16660/j.cnki.1674-098X.2019.11.141 利用Python编程提取基因组基因序列 ① 庞雪原张婷婷 (东北农业大学生命科学学院黑龙江哈尔滨 ...
利用python处理dna序列_详解基于python的全局与局部序列比对的实现(DNA)
程序能实现什么 a.完成gap值的自定义输入以及两条需比对序列的输入 b.完成得分矩阵的计算及输出 c.输出序列比对结果 d.使用matplotlib对得分矩阵路径的绘制一.实现步骤 1.用户输入步 ...
python文件处理，将DNA序列转换为RNA序列
1 #!/usr/bin/python 2 #-*- coding:utf-8 -*- 3 "将DNA序列转换为RNA序列,即将T转换为U即可,利用字符串的replace方法" 5 ...
numpy序列预处理dna序列_使用机器学习和Python揭开DNA测序神秘面纱
"脱氧核糖核酸(DNA)是一种分子,其中包含每个物种独特的生物学指令.DNA及其包含的说明在繁殖过程中从成年生物传给其后代." 简介基因组是生物体中DNA的完整集合.所有生物物种 ...

pythonorange_利用Python【Orange】结合DNA序列进行人种预测

pythonorange_利用Python【Orange】结合DNA序列进行人种预测相关推荐

最新文章

热门文章