STATS 4014
Advanced Data Science
Assignment 3
Jono Tuke
Semester 1 2019
CHECKLIST
: Have you shown all of your working, including probability notation where necessary?
: Have you given all numbers to 3 decimal places unless otherwise stated?
: Have you included all R output and plots to support your answers where necessary?
: Have you included all of your R code?
: Have you made sure that all plots and tables each have a caption?
: If before the deadline, have you submitted your assignment via the online submission on MyUni?
: Is your submission a single pdf file - correctly orientated, easy to read? If not, penalties apply.
: Penalties for more than one document - 10% of final mark for each extra document. Note that you
may resubmit and your final version is marked, but the final document should be a single file.
? : Penalties for late submission - within 24 hours 40% of final mark. After 24 hours, assignment is not
marked and you get zero.
: Assignments emailed instead of submitted by the online submission on MyUni will not be marked
and will receive zero.

代做STATS 4014作业、代写Data Science作业、R程序语言作业代写、代做R编程设计作业
: Have you checked that the assignment submitted is the correct one, as we cannot accept other
submissions after the due date?
Due date: Friday 3rd May 2019 (Week 7), 5pm.
Q1. Bayesian connection to lasso and ridge regression
a. Suppose that
Yi = β0 + β1xi1 + . . . + βpxip + i,
where ~ iid N(0, σ2).
Write the likelihood for the data.
b. Let βj , j = 1, . . . , p have priors that are iid with
i.e., they are i.i.d. with a double-exponential distribution with mean 0, and common scale parameter b.
Write out the posterior for β given the likelihood in Part a. Show that the lasso estimate is the mode
of the posterior.
c. Let βj , j = 1, . . . , p have priors that are i.i.d. normal distribution with a mean zero and variance c.
Write the posterior of βj , j = 1, . . . , p. Hence show that the ridge regression is both the mean and the
mode of the posterior.
1
Q2. Using data.table
In the following, you are advised to use data.table. Trying to use standard data manipulation may crash
your computer or take too long. The data in DNA_combined.csv is real data on DNA methylation in modern
and ancient DNA samples.
Each row in the dataset is a segment of DNA for which we have the following information:
chr: the chromosome the segment is from,
pos: the starting position of the segment on the chromosome,
N: the length of the segment in number of bases,
X: the number of the bases that are methylated,
type: whether the DNA is modern or ancient, and
ID: the ID of the individual that the DNA is from.
Also we have a spreadsheet of metadata given in Data_Info.xlsx. Each row is an individual and we have
the following information:
Filename: the filename of the compressed file that had the data. I used this to get the samples for you,
SampleID: the ID of each individual,
Sex: the gender of the individual,
Tissue: the area of the body that the DNA was extracted from,
Type: whether the DNA is modern or ancient, and
Age_kyr: the age of the individual in 1000’s year.
Our goal is to find the proportion of samples for each tissue / type combination that has a higher proportion
of methylation compared to the mean for each tisse / type combination.
Perform the following steps:
a. Read in both datasets.
b. Rename the SampleID column to ID in the metadata data.table.
c. Find which samples IDs are repeated in the metadata.
d. Remove from the metadata any samples that are not Hairpin.
e. What is the total number of samples? What is the total number of modern samples and the total
number of ancient samples?
f. Calculate the proportion of methylation for each sample.
g. What is the total number of samples for each combination of tissue and type.
h. Calculate the mean proportion of methylation for each combination of tissue and type.
i. What proportion of samples have a methylation proportion greater than the mean proportion of
methylation for each tissue / type combination?
Q3. Webscraping
In this question, we are going to webscrape data from the internet movie database. As before there are marks
for webscraping and cleaning the dataset, but if you prefer not to do this, the cleaned dataset is provided.
a. Webscraping the data. The main package for webscraping is rvest:
https://rvest.tidyverse.org/
Also the chrome extension selectorgadget is really useful to identify the parts of the webpage that contains
the information:
https://selectorgadget.com/
https://rvest.tidyverse.org/articles/selectorgadget.html
I have written a template function to start you off based on the following tutorial:
2
https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/
The function is
## Get libs ----
pacman::p_load(
rvest, tidyverse, glue
)
#' get top 100
#'
#' Take a year and get information from imdb on the top 100 movies for that year.
#'
#' @param year year to get the data from
#'
#' @return data frame with the information
#'
#' @author Jono Tuke
#'
#' Wednesday 27 Mar 2019
get_top_100 <- function(year){
# Create url for the given year split to make easier reading
url <- glue("https://www.imdb.com/search/title?",
"count=100&release_date={year},{year}",
"&title_type=feature")
# Read in the webpage
html <- try(read_html(url))
if('try-error' %in% class(html)){
cat("Cannot load webpage", url, "\n")
return(NA)
}
# Get title of movies
titles <- html %>%
html_nodes(".lister-item-header a") %>%
html_text()
# Ratings
ratings <-
html %>%
html_nodes(".ratings-imdb-rating strong") %>%
html_text()
## Put together
info <- tibble(
year = year,
title = titles,
rating = ratings
)
return(info)
}
At present, it gets only year, title and ratings for the top 100 movies for a given year.
Write a function that will get the following
## # A tibble: 6 x 11
## year title description runtimes genre rating vote director actors
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
3
## 1 1980 The ~ "\n A f~ 146 min "\nD~ 8.4 768,~ Stanley~ Jack ~
## 2 1980 Star~ "\n Aft~ 124 min "\nA~ 8.8 1,03~ Irvin K~ Mark ~
## 3 1980 The ~ "\n Jak~ 133 min "\nA~ 7.9 165,~ John La~ John ~
## 4 1980 Flyi~ "\n A m~ 88 min "\nC~ 7.8 188,~ Jim Abr~ Rober~
## 5 1980 Flas~ "\n A f~ 111 min "\nA~ 6.5 45,2~ Mike Ho~ Sam J~
## 6 1980 The ~ "\n In ~ 104 min "\nA~ 5.7 57,7~ Randal ~ Brook~
## # ... with 2 more variables: metascore <chr>, gross <chr>
Then webscrape the data for 1980 to 2018 inclusively.
b. Cleaning the data. I will leave the decisions on the cleaning to you, but so that you know - I kept the
top 20 most prolific directors and top 20 most prolific actors - the rest became Other. Also I created a
boolean column for each genre.
c. Which movies are the most highly rated and the most lowly rated?
d. Which director has the highest mean rating?
e. Fit a lasso regression to predict rating with the following predictors:
year,
runtimes,
vote,
metascore,
gross, and
Animation1.
What is the best model? What is the first coefficient to be shrunk to zero as λ increases, and what is
the last coefficient?
=======
1Just because I am obsessed with animation.
4
Mark scheme
Part Marks Difficulty Area Type Comments
Q1
1a 4 0.00 Lasso/ridge proof 4 for derivation
1b 7 0.29 Lasso/ridge proof 5 for derivation; 2 for justification
1c 7 0.29 Lasso/ridge proof 5 for derivation; 2 for justification
Total 18
Q2
2a 1 0.00 data.table analysis 1 for coding
2b 1 1.00 data.table analysis 1 for coding
2c 2 0.50 data.table analysis 2 for coding
2d 1 0.00 data.table analysis 1 for coding
2e 2 0.00 data.table analysis 2 for coding
2f 1 0.00 data.table analysis 1 for coding
2g 4 0.50 data.table analysis 4 for coding
2h 2 0.00 data.table analysis 2 for coding
2i 5 0.60 data.table analysis 2 for coding; 3 for over presentation of
code in this Q
Total 19
Q3
3a 7 0.29 Lasso/ridge coding 5 for coding; 2 for quality of code
3b 10 0.20 Lasso/ridge analysis 5 for code; 5 for explanation of code
3c 2 0.00 Lasso/ridge analysis 2 for code
3d 2 0.00 Lasso/ridge analysis 2 for code
3e 8 0.38 Lasso/ridge interpretation 4 for coding; 4 for interpretation of
results
Total 29
Assignment total 66

因为专业,所以值得信赖。如有需要,请加QQ:99515681 或邮箱:99515681@qq.com

微信:codinghelp

转载于:https://www.cnblogs.com/pythonwel/p/10809348.html

STATS 4014 Advanced Data Science相关推荐

  1. 数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics

    数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics ...

  2. Introducing DataFrames in Apache Spark for Large Scale Data Science(中英双语)

    文章标题 Introducing DataFrames in Apache Spark for Large Scale Data Science 一个用于大规模数据科学的API--DataFrame ...

  3. Coursera | Introduction to Data Science in Python(University of Michigan)| Assignment2

       u1s1,这门课的assignment还是有点难度的,特别是assigment4(哀怨),放给大家参考啦~    有时间(需求)就把所有代码放到github上(好担心被河蟹啊)    先放下该课 ...

  4. Coursera | Introduction to Data Science in Python(University of Michigan)| Assignment4

       u1s1,这门课的assignment还是有点难度的,特别是assigment4(哀怨),放给大家参考啦~    有时间(需求)就把所有代码放到github上(好担心被河蟹啊)    先放下该课 ...

  5. kaggle 2018 data science bowl 细胞核分割学习笔记

    一. 获奖者解决方案 1. 第一名解决方案(Unet 0.631) 主要的贡献 targets: 预测touching borders,将问题作为instance分割 loss function:组合 ...

  6. Neoj图数据科学库(The Neo4j graph data science library)使用指南

    目录 介绍 算法 图目录 版本 安装 支持的Neo4j版本 Neo4j Desktop Neo4j Server Enterprise 版本配置 Neo4j Docker Neo4j Causal C ...

  7. Cypher高级查询--典型算法--利用Graph Data Science(GDS)的算法实现数据分析与知识洞察

    本文继续基于上一篇文章,深入研究基于图谱的各类算法,相比传统的关键词搜索,关系连接,全文检索等,基于知识图谱的算法将充分利用知识图谱的实体关系及其属性权重等信息,为大数据分析做支撑,使得数据分析和知识 ...

  8. 5 steps to land a data science job in just 6 months :)

    The data problem is not unique to any organization or industry. An overwhelming majority of roles in ...

  9. Data Science Challenge / Competition

    文章目录 Kaggle DrivenData CodaLab - Home Challenge Data crowdAI EvalAI Numerai SIGNATE Unearthed Google ...

最新文章

  1. 利用Runtime类,来操作电脑关机。。
  2. httpsendrequest的head怎么用string写_商品广告语用怎么写?男人篇
  3. U-Boot之代码调试
  4. CI/CD(持续集成构建/持续交付):如何测试/集成/交付项目代码?(Jenkins,TravisCI)
  5. python连接postgis_python连接postgres方法
  6. php替换图片_php实现图片上传并进行替换操作
  7. 数据库链接池c3p0的配置
  8. 用php写一个单例类,PHP里的单例类写法实例
  9. QQ微信实时消息转发图片文件视频语音互联机器人自动发消息
  10. 产品经理认证(NPDP)知识体系指南(笔记2)
  11. 计算机能使用硬盘吗,旧电脑的硬盘能直接插在新电脑上用吗?
  12. 如何使用JMX_Expoter+Prometheus+Grafana监控Hadoop集群
  13. java基础知识入门大全(十年经验总结)
  14. 手机App开发行业前景怎么样?
  15. 作为软件测试人员,这些常用的性能测试工具你一定要知道
  16. 七年老安卓的九十月小结
  17. 推荐使用:易企在线客服升级版
  18. vuex本地储存方案
  19. 阿里云服务器学生机搭建及宝塔面板环境配置
  20. centos7开机启动进入紧急模式emergency mode

热门文章

  1. [算法]判断一个数是不是2的N次方
  2. 如何在WinForm中发送HTTP请求
  3. 一个基于POP3协议进行邮箱账号验证的类
  4. Adsense加入黑名单的预防办法
  5. 看到一个暴强的翻译,闲的蛋疼,写个c#版的
  6. 数据结构之链式栈的一些基本操作
  7. eclipse问题_Alt+/不给提示,只补充代码问题的解决方案
  8. jsonArray与 jsonObject区别与js取值
  9. 理解浏览器是如何加载及渲染网页的
  10. CentOS远程监控