2019独角兽企业重金招聘Python工程师标准>>>

基于R语言实现的交通时空大数据处理

Import public NYC taxi and Uber trip data into PostgreSQL / PostGIS database, analyze with R

链接:https://github.com/toddwschneider/nyc-taxi-data

Unified New York City Taxi and Uber data

Code in support of this post: Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance

This repo provides scripts to download, process, and analyze data for over 1.1 billion taxi and Uber trips originating in New York City. The data is stored in a PostgreSQL database, and uses PostGIS for spatial calculations, in particular mapping latitude/longitude coordinates to census tracts.

The yellow and green taxi data comes from the NYC Taxi & Limousine Commission, and Uber data comes via FiveThirtyEight, who obtained it via a FOIL request.

Instructions

Your mileage may vary, but on my MacBook Air, this process took about 3 days to complete. The unindexed database takes up 267 GB on disk. Adding indexes for improved query performance increases total disk usage to 375 GB.

1. Install PostgreSQL and PostGIS

Both are available via Homebrew on Mac OS X

2. Download raw taxi data

./download_raw_data.sh

3. Initialize database and set up schema

./initialize_database.sh

4. Import taxi data into database and map to census tracts

./import_trip_data.sh

5. Optional: download and import Uber data from FiveThirtyEight's GitHub repository

./download_raw_uber_data.sh 
./import_uber_trip_data.sh

6. Analysis

Additional Postgres and R scripts for analysis are in the analysis/ folder, or you can do your own!

Schema

  • trips table contains all yellow and green taxi trips, plus Uber pickups from April 2014 through September 2014. Each trip has a cab_type_id, which references the cab_types table and refers to one of yellow, green, or uber. Each trip maps to a census tract for pickup and dropoff
  • nyct2010 table contains NYC census tracts, plus a fake census tract for the Newark Airport. It also maps census tracts to NYC's official neighborhood tabulation areas
  • uber_trips_2015 table contains Uber pickups from January 2015 through June, 2015. These are kept in a separate table because they don't have specific latitude/longitude coordinates, only location IDs. The location IDs are stored in theuber_taxi_zone_lookups table, which also maps them (approximately) to neighborhood tabulation areas
  • central_park_weather_observations has summary weather data by date

Other data sources

These are bundled with the repository, so no need to download separately, but:

  • Shapefile for NYC census tracts and neighborhood tabulation areas comes from Bytes of the Big Apple
  • Central Park weather data comes from the National Climatic Data Center

Data issues encountered

  • Remove carriage returns and empty lines from TLC data before passing to Postgres COPY command
  • green taxi raw data files have extra columns with empty data, had to create dummy columns junk1 and junk2 to absorb them
  • Two of the yellow taxi raw data files had a small number of rows containing extra columns. I discarded these rows
  • The official NYC neighborhood tabulation areas (NTAs) included in the shapefile are not exactly what I would have expected. Some of them are bizarrely large and contain more than one neighborhood, e.g. "Hudson Yards-Chelsea-Flat Iron-Union Square", while others are confusingly named, e.g. "North Site-South Side" for what I'd call "Williamsburg", and "Williamsburg" for what I'd call "South Williamsburg". In a few instances I modified NTA names, but I kept the NTA geographic definitions
  • The shapefile includes only NYC census tracts. Trips to New Jersey, Long Island, Westchester, and Connecticut are not mapped to census tracts, with the exception of the Newark Airport, for which I manually added a fake census tract
  • The Uber 2015 data uses location IDs instead of latitude/longitude. The location IDs do not exactly overlap with the NYC neighborhood tabulation areas (NTAs) or census tracts, but I did my best to map Uber location IDs to NYC NTAs

Why not use BigQuery or Redshift?

Google BigQuery and Amazon Redshift would probably provide significant performance improvements over PostgreSQL. A lot of the data is already available on BigQuery, but in scattered tables, and each trip has only by latitude and longitude coordinates, not census tracts and neighborhoods. PostGIS seemed like the easiest way to map coordinates to census tracts. Once the mapping is complete, it might make sense to load the data back into BigQuery or Redshift to make the analysis faster. Note that BigQuery and Redshift cost some amount of money, while PostgreSQL and PostGIS are free.

Questions/issues/contact

todd@toddwschneider.com, or open a GitHub issue

转载于:https://my.oschina.net/u/2338162/blog/618970

基于R语言实现的交通时空大数据处理相关推荐

  1. TransBigData:一款基于 Python 的超酷炫交通时空大数据工具包

    今天分享一次Python交通数据分析与可视化的实战!其中主要是使用TransBigData库快速高效地处理.分析.挖掘出租车GPS数据. 所介绍的相关技术开发了Python开源库TransBigDat ...

  2. 交通时空大数据如何分析,我写了本书!

    Datawhale干货 余庆:同济大学博士,Datawhale读者 大数据时代到来,随着智能设备与物联网技术的普及,人在社会生产活动中会产生大量的数据.在我们的日常活动中,手机会记录下我们到访过的地点 ...

  3. 基于R语言混合效应模型(mixed model)案例研究

    全文链接: http://tecdat.cn/?p=2596 在本文中,我们描述了灵活的竞争风险回归模型.回归模型被指定为转移概率,也就是竞争性风险设置中的累积发生率(点击文末"阅读原文&q ...

  4. 毕业论文知识点记录(六)——基于R语言优化maxent模型

    毕业论文知识点记录(六)--基于R语言优化maxent模型 第一步:R安装 这个网上都有很多详细的步骤,就不再详细介绍了. 第二步:R安装包 因为优化maxent模型需要用到kuenm程序包,但是官网 ...

  5. 基于R语言对股市价格预测的ARIMA建模

    基于R语言对股市价格预测的ARIMA建模 获取数据 tushare ID=399224 利用ARIMA对股市价格进行拟合后预测,本次实验的数据源于tushare 首先导入本次实验所需要的所有包 req ...

  6. 基于R语言的主成分回归(PCR)与Lasso回归在水稻基因组预测中的对比(生信数基实验作业)

    基于R语言的主成分回归(PCR)与Lasso回归在水稻基因组预测中的对比 0 引言 全基因组选择是 21 世纪动植物育种的一种重要的选择策略,其核心就是全基因组预测,即基于分布在整个基因组上的多样性分 ...

  7. r语言boxcox异方差_基于R语言进行Box-Cox变换

    原标题:基于R语言进行Box-Cox变换 作者简介 作者:吴健中国科学院大学 R语言.统计学爱好者,尤其擅长R语言和Arcgis在生态领域的应用分享 个人公众号:统计与编程语言 Q: 为什么要进行Bo ...

  8. 基于R语言的判别分析

    本文主要介绍了基于R语言实现距离判别.Bayes判别.Fisher判别的基本思路以及给出了具体的操作过程. 1.数据 这里总共有个20个电视品牌的数据,销售状态G1中的1表示畅销,2表示滞销:销售状态 ...

  9. 基于R语言的Meta分析【全流程、不确定性分析】方法与Meta机器学习技术应用

    Meta分析是针对某一科研问题,根据明确的搜索策略.选择筛选文献标准.采用严格的评价方法,对来源不同的研究成果进行收集.合并及定量统计分析的方法,最早出现于"循证医学",现已广泛应 ...

最新文章

  1. GAN(Generative Adversarial Network,GAN)模型之:EBGAN、PGGAN、CGAN、ACGAN模型
  2. 零基础自学python的app-零基础入门免费学Python 课程和APP推荐
  3. 我是发起人Sumtec
  4. 【字符串】字符串查找 ( 蛮力算法 )
  5. Google实用搜索秘技六则
  6. Advapi 登录类型8的错误
  7. 【OP放大器】在不拆开OP放大器的情况下查一查它是否坏掉或饱和。
  8. 种草笔记App放话:要让一万创作者月入过万
  9. 计算机组成原理alu功能实现代码_计算机组成原理小课堂(3)——易错知识点...
  10. ValueError: No JSON object could be decoded
  11. 提问:AdventNetSnmp.jar这个包是做什么用的和snmp有什么关系
  12. Firefox Private Network使用方法(极详细)
  13. CSS学习(四)——字体样式,文本样式
  14. php的网页服务器根目录,php获得网站根目录的几个方法
  15. 并行计算、分布式计算、网格计算讲解
  16. DNS资源纪录(Resource Record)介绍
  17. 在一个循环链队中只有尾指针(记为rear,结点结构为数据域data,指针域next),请给出这种队列的入队和出队操作的实现过程。
  18. 无法使用以下不同的参数继承com.baomidou.mybatisplus.extension.service.IService: <> 和 <com.itheima.rijidao.en
  19. 30年的Hello world
  20. 响应式网站如何实现?

热门文章

  1. openresty模板html页面,单页面部署去Html缓存 nginx/openresty
  2. java怎么延迟执行语句_Go语言defer(延迟执行语句)
  3. php 设置多个html条件_PHP-FPM是个啥
  4. 20210322 :贪心思想力扣典型题目合集
  5. login.html错误,创建好login.html文件后,在git bash执行webpack出现错误
  6. API功能未授权原因
  7. 硬件加密芯片的使用及适配(CC020加密芯片)
  8. 超分辨率技术如何发展?这6篇ECCV 18论文带你一次尽览
  9. 3大AI事件入围百度2017科技热搜,柯洁对战AlphaGo排名第一
  10. 14 递归 匿名函数 内置函数