现状

从全球范围来看，采用商业地理数据进行商业选址及消费者地理细分在发达经济体已经非常普及。为更精准地服务不断升级的中国消费者，宜家家居、麦当劳、星巴克等专门成立了商业地理分析团队，来指导其在中国的店铺选址。麦肯锡的“解读中国”商业地理分析团队亦感受到来自客户方越来越强烈的需求。

政府部门分析某地区的水资源、土地资源分布特征，是为了更好的优化国土空间开发格局，基础设施布局，全面促进资源节约、保护、利用和管理等进一步推动生态文明的建设。

我们正感受着商业世界发生的巨大变化。大数据时代商业选址的变革之路已悄然开启。古人有云，“一步差三市”，合理的选址对商业经营情况起到了至关重要的作用。采用商业地理数据进行商业选址及消费者地理细分已成为未来智能商业的发展趋势。

技术发展趋势和痛点

全球市场上普遍流行的产品工具，比如ArcGIS，GEOTools，GDAL，GEOS和JTS等，我国国产化自主可控的全局背景下，人大金仓数据库和PostgreSql数据库也是未来发展的方向。

利用FME等地理数据集成产品在数据应用侧有了普遍更广泛的应用，如加工后的数据分发和共享等。

当前公司企业等在地理数据应用普遍存在的问题是：如何使地理数据更快速的相应业务需求，接近实时的分析；在智慧城市的背景下如何得到更好的发散应用，如价值洼地，生活圈分析等；

接下来给大家分享IBM的有关地理数据的产品和Demo：

DB2Warehouse是一个分析数据仓库，具有内存数据处理和数据库分析功能。

它由客户机管理和优化，以实现快速灵活的部署，并具有支持分析工作负载的自动扩展功能。根据所选工作节点的数量，Cloud Pak for Data会自动创建适当的数据仓库环境。对于单个节点，仓库使用对称多处理（SMP）体系结构以提高成本效率。对于两个或多个节点，使用大规模并行处理（MPP）体系结构部署仓库，以实现高可用性和改进的性能。

优势

生命周期管理：与云服务类似，安装、升级和管理Db2仓库非常容易。能够在几分钟内部署Db2仓库数据库
丰富的生态系统：数据管理控制台、REST、图形具有多层恢复策略的Db2仓库的扩展可用性
支持软件定义的存储，如OCS和IBM Spectrum Scale CSI。

DB2WH (Cloud Pak) 地理数据支持和扩展

支持的地理数据类型

支持以下产品和开发语言

1. Esri ArcGIS
You can use Esri ArcGIS for Desktop version 10.3.1 together with your warehouse to analyze and visualize geospatial data.
2. Python
The ibmdbPy package provides methods to read data from, write data to, and sample data from a Db2 database. It also provides access methods for in-database analytic or geospatial functions
3. R
Use the RStudio® development environment that is provided by IBM® Watson™ Studio.
Use ODBC to connect your own, locally installed R development environment to your database
4. SQL/Procedures
5. DB内部的SpatialData 模块
Db2 Spatial Extender/Db2 spatial Analytics

Esri ArcGIS - 创建Arcgis企业级地理数据库

安装并配置 DB2WH(Cloud Pak)
在 DB2WH(Cloud Pak) 服务器上创建名为 sde 的操作系统登录帐户。
您将通过 sde 登录帐户连接到数据库来创建地理数据库。
创建一个 DB2WH(Cloud Pak) 数据库并将其注册到 Spatial Extender 模块。
在数据库中授予 sde 用户 DBADM 权限。
配置客户端
在 64 位的操作系统上安装 Db2 客户端，请运行 64 位可执行文件；该文件将同时安装 32 位和 64 位文件，使您可以从 32 位和 64 位 ArcGIS 客户端进行连接。(IBM dataserver64-v11.5.6_ntx64_rtcl.exe)
创建地理数据库

连接到 Db2 数据库。通过 sde 登录帐户进行连接。
确保将 sde 用户密码保存在数据库连接对话框中。
右键单击数据库连接，然后单击启用地理数据库。
随即将打开启用企业级地理数据库工具。

SpatialData 模块

1. 重要的概念
Geometry types：Points,Linestrings,Polygons 等
Coordinate system：A geographic coordinate system uses a three-dimensional spherical surface to determine locations on the earth.
Data types：ST_Point, ST_LineString, and ST_Polygon, ST_MultiPoint, ST_MultiLineString, ST_MultiPolygon, and ST_Geometry when you are not sure which of the other data types to use.
2. 性能优化
Specifying inline lengths for geospatial columns
Registering spatial columns: call st_register_spatial_column()
Filtering using a bounding box

SpatialData 主要模块介绍

1. Db2 Spatial Extender/Db2 spatial Analytics(Successor)

Functions provided by the Db2 Spatial Extender component can be used to analyze data stored in row-organized/column-organized tables. Spatial Extender stores geospatial data in special data types, each of which can hold up to 4 MB.

2. 启用Db2 spatial Analytics
CALL SYSPROC.SYSINSTALLOBJECTS('GEO', 'C', CAST (NULL AS VARCHAR(128)), CAST (NULL AS VARCHAR(128)))

3. Db2 Spatial Extender/Analytics 接口

Db2 Spatial has a wide variety of interfaces to help you set up and create projects that use spatial data:
Db2 Spatial Extender stored procedures called from application programs.
SQL queries that you submit from application programs.
Open source projects that support Spatial Extender such as:

GeoTools () is a Java™ library for building spatial applications. For more information, see http://www.geotools.org/.
GeoServer is a Web map server and Web feature server. For more information, see http://geoserver.org/.
uDIG is a desktop spatial data visualization and analysis application. For more information, see http://udig.refractions.net/.

SpatialData 模块支持的SQL 函数

使用案例

Safe Harbor Real Estate保险公司立项利用地理数据进行BI决策。

1. 确定目标：

•哪里建立分支机构更合适

•如何根据客户属性调整保险附加费(areas with high rates of traffic accidents, areas with high rates of crime, flood zones, earthquake faults, and so on)

2. 确定地理参考系

•根据地理位置，决定使用spatial reference system, called NAD83_SRS_1, that is designed to be used with GCS_NORTH_AMERICAN_1983

3. 建立相关表

•客户表:CUSTOMERS含LATITUDE and LONGITUDE columns

•子公司表:OFFICE_LOCATIONS

•子公司销售表:OFFICE_SALES

4. 注册三张表

5. 根据经纬度信息更新Locations列

•UPDATE CUSTOMERS SET LOCATION = db2gse.ST_Point(LONGITUDE, LATITUDE,1) to populate the LOCATION value from LATITUDE and LONGITUDE.

•建立HAZARD_ZONES 表导入Shapefile

6. 优化

•create a view that joins columns from the CUSTOMERS and HAZARD ZONES tables and register the spatial columns in this view

7. 分析

•The spatial analysis team runs queries to obtain information

CP4D - 地理数据的快速分析功能

1. 即席查询-Ad Hoc概念

前端BI无需关注底层物理设计，Web端自助式的拖拉拽进行自定义定制化查询，通过API与后端解耦的任何数据源进行互动，并近实时展现数据。

2. CP4D - 地理数据的快速分析

Automatically spin up lightweight, dedicated Apache Spark clusters to run a wide range of workloads.

Included with Cloud Pak for Data：Geospatio-temporal library

Geospatio-temporal library - 使用方式

You can use the geospatio-temporal library to expand your data science analysis to include location analytics by gathering, manipulating and displaying imagery, GPS, satellite photography and historical data.You can use the geospatio-temporal library in Cloud Pak for Data to:

1. Run Spark jobs on your Cloud Pak for Data cluster by using the Spark jobs REST APIs of Analytics Engine powered by Apache Spark. pyst supports most of the common geospatial formats, including geoJSON and WKT.

from pyst import STContext

stc = STContext(spark.sparkContext._gateway)

Work with files in geoJSON format, first create a geoJSON reader and writer:

geojson_reader = stc.geojson_reader()

geojson_writer = stc.geojson_writer()

Direct input：

Point: point(lat, lon)

LineSegment: line_segment(start_point, end_point)

LineString: line_string([point_1, point_2, …]) or line_string([line_segment_1, line_segment_2, …])

Ring: ring([point_1, point_2, …]) or ring([line_segment_1, line_segment_2, …])

Polygon: polygon(exterior_ring, [interior_ring_1, interior_ring_2, …])

MultiGeometry: multi_geometry(geom_1, geom_2, …)

MultiPoint: multi_point(point_1, point_2, …)

MultiLineString: multi_line_string(line_string_1, line_string_2, …)

MultiPolygon: multi_polygon(polygon_1, polygon_2, …)

Null Geometry: null_geometry()

FullEarth: full_earth()

BoundingBox: bounding_box(lower_lat, lower_lon, upper_lat, upper_lon)

2. Run notebooks in Spark environments in Watson Studio.

You can use the geospatio-temporal library to expand your data science analysis in Python notebooks to include location analytics by gathering, manipulating and displaying imagery, GPS, satellite photography and historical data.

The spatio-temporal library is available in all IBM Watson Studio Spark runtime environments and if you run your notebooks in IBM Analytics Engine.

Key aspects of the library include:

All calculated geometries are accurate without the need for projections.

The geospatial functions take advantage of the distributed processing capabilities provided by Spark.

The library includes native geohashing support for geometries used in simple aggregations and in indexing, thereby improving storage retrieval considerably.

The library supports extensions of Spark distributed joins.

The library supports the SQL/MM extensions to Spark SQL.

Reference

https://www.ibm.com/docs/en/cloud-paks/cp-data/3.5.0?topic=scripts-geospatio-temporal-library

https://www.ibm.com/docs/en/cloud-paks/cp-data/3.5.0?topic=libraries-geospatio-temporal-library#getting-started-with-the-library

Geospatio-temporal library - 函数

Topological functions

With the spatio-temporal library, you can use topological relations to confine the returned results of your location data analysis.

Get the aggregated bounding box for a list of geometries.

westchester_WKT = 'POLYGON((-73.984 41.325,...,-74.017 40.698,-74.019 40.698,-74.023 40.703,-74.023 40.709))'

wkt_reader = stc.wkt_reader()

westchester = wkt_reader.read(westchester_WKT)

white_plains = wkt_reader.read(white_plains_WKT)

manhattan = wkt_reader.read(manhattan_WKT)

white_plains_bbox = white_plains.get_bounding_box()

westchester_bbox = westchester.get_bounding_box()

manhattan_bbox = manhattan.get_bounding_box()

aggregated_bbox = white_plains_bbox.get_containing_bb(westchester_bbox).get_containing_bb(manhattan_bbox)

Geohashing functions

The spatio-temporal library includes geohashing functions for proximity search (encoding latitude and longitude and grouping nearby points) in location data analysis.

Geohash coverage

test_wkt = 'POLYGON((-73.76223024988917 41.04173285255264,-73.7749331917837 41.04121496082817,-73.78197130823878 41.02748934524744,-73.76476225519923 41.023733725449326,-73.75218805933741 41.031633228865495,-73.7558787789419 41.03752486433286,-73.76223024988917 41.04173285255264))'

poly = wkt_reader.read(test_wkt)

cover = stc.geohash.geohash_cover_at_bit_depth(poly, 36)

Geospatial indexing functions

With the spatio-temporal library, you can use functions to index points within a region, on a region containing points, and points within a radius to enable fast queries on this data during location analysis.

>>> tile_size = 100000

>>> si = stc.tessellation_index(tile_size=tile_size) # we leave bbox as None to use full earth as boundingbox

>>> si.from_df(county_df, 'NAME', 'geometry', verbosity='error')

3221 entries processed, 3221 entries successfully added

Which are the counties within 20 km of White Plains Hospital? The result is sorted by their distances.

>>> counties = si.within_distance_with_info(white_plains_hospital, 20000)

>>> counties.sort(key=lambda tup: tup[2])

>>> for county in counties:

... print(county[0], county[2])

Westchester 0.0

Fairfield 7320.602641166855

Rockland 10132.182241119823

Bergen 10934.1691335908

Bronx 15683.400292349625

Nassau 17994.425235412604

Ellipsoidal metrics

You can use ellipsoidal metrics to calculate the distance between points.

Compute the radians between two points using azimuth:

>>> p1 = stc.point(47.1, -73.5)

>>> p2 = stc.point(47.6, -72.9)

>>> stc.eg_metric.azimuth(p1, p2)

0,6802979449118038

Routing functions

The spatio-temporal library includes routing functions that list the edges that yield a path from one node to another node.

Find the best route with minimal distance cost (the fastest route distance-wise):

# Check distance cost, in the unit of meters

>>> best_distance_route.cost

2042,4082601271236

# Check route path (only showing the first three points), which is a list of points in 3-tuple (osm_point_id, lat, lon)

>>> best_distance_route.path[:3]

[(2036943312, 33.7631862, -84.3939405),

(3523447568, 33.7632666, -84.3939315),

(2036943524, 33.7633273, -84.3939155)]

Spark引擎 - Analytics Engine

You can use Analytics Engine powered by Apache Spark as a compute engine to run analytical and machine learning jobs.
The Analytics Engine powered by Apache Spark service is not available by default. An administrator must install this service on the IBM Cloud Pak for Data platform,If you have the Watson Studio service installed, the Analytics Engine powered by Apache Spark service automatically adds a set of default Spark environment definitions to analytics projects. You can also create custom Spark environment definitions in a project.

You can submit jobs to Spark clusters in two ways:
1. Specifying a Spark environment definition for a job in an analytics project
2. Running Spark job APIs

Each time you submit a job, a dedicated Spark cluster is created for the job. You can specify the size of the Spark driver, the size of the executor, and the number of executors for the job. This enables you to achieve predictable and consistent performance.

When a job completes, the cluster is automatically cleaned up so that the resources are available for other jobs. The service also includes interfaces that enable you to analyze the performance of your Spark applications and debug problems.

Spark APIs
You can run these types of workloads with Spark jobs APIs:

Spark applications that run Spark SQL
Data transformation jobs
Data science jobs
Machine learning jobs

使用Watson Studio中的spark

For Python:

At the beginning of this cell, add %%writefile myfile.py to save the code as a Python file to your working directory. Notebooks that use the same runtime can also import this file.The advantage of this method is that the code is available in your notebook, and you can edit and save it as a new Python script at any time.using pyst which supports most of the common geospatial formats, which includes shapefile, GeoJSON and Well-Known Text (WKT).
from pyst import STContext
# Register STContext, which is the main entry point
stc = STContext(spark.sparkContext._gateway)

For R:

If you want to save code in a notebook as an R script to the working directory, you can use the writeLines(myfile.R) function.
RStudio uses the sparklyr package to connect to Spark from R. The sparklyr package includes a dplyr interface to Spark data frames as well as an R interface to Spark’s distributed machine learning pipelines.There are two methods of connecting to Spark from RStudio:

Spark Jobs APIs

In IBM Cloud Pak for Data, you can run Spark jobs or applications on your IBM Cloud Pak for Data cluster without installing Watson Studio by using the Spark jobs REST APIs of Analytics Engine powered by Apache Spark.

1. Submitting Spark jobs ：

curl -k -X POST <JOB_API_ENDPOINT> -H "Authorization: Bearer <ACCESS_TOKEN>" -d '{"engine":{"type":"spark"},"application_arguments":["/opt/ibm/spark/examples/src/main/resources/people.txt"],"application": "/opt/ibm/spark/examples/src/main/python/wordcount.py"}'

2. Viewing Spark job status：

curl -k -X GET <JOB_API_ENDPOINT> -H "Authorization: Bearer <ACCESS_TOKEN>"

3. Deleting Spark jobs：

curl -k -X DELETE <JOB_API_ENDPOINT>/<job-id> -H "Authorization: Bearer <ACCESS_TOKEN>"

一个真实案例

确定目标：

•洪灾影响范围，分析图像入库，格式为wkt

•根据wkt影响范围统计涉及承包人数，并利用ML机器学习给出保险方案

代码参考如下

Reference ：

•Insurance Loss Estimation using Remote Sensing

https://dataplatform.cloud.ibm.com/exchange/public/entry/view/14ea8dfab582137c695a6630e90cdc32?context=cpdaas

内容偏多，若有任何疑问，请联系博主。

Cloud Paks地理数据研究成果｜IBM相关推荐

16款测序平台性能大PK，华大表现不俗！基于人类和细菌基因组DNA水平的多平台测序数据研究成果发布...
生物信息学习的正确姿势 NGS系列文章包括NGS基础.转录组分析 (Nature重磅综述|关于RNA-seq你想知道的全在这).ChIP-seq分析 (ChIP-seq基本分析流程).单细胞测序分析 ...
16款测序平台性能大PK！基于人类和细菌基因组DNA水平的多平台测序数据研究成果发布
DNA是生命遗传信息的载体,获取DNA序列信息对于基础科研和临床诊断都至关重要.自1977年第一代测序技术问世以来,经过四十余年的探索,DNA测序技术取得了重大进展.随着对测序成本降低的需求,以高通量 ...
IBM在中国发布Cloud Paks，牵手神州数码，助力企业云转型步入“第二篇章”
2019年11月5日,北京盘古大观IBM大中华区总部,2501会议室.不到80平的会议室里挤了50多人,包括记者.分析师,以及来自IBM中国和神州数码的高管和专家-- 参会记者:通常是开会的地方越小, ...
IBM Cloud Paks：云端追光者也！
作者:阿秃现如今,"企业上云"已毫无争议. 据知名云管理服务商RightScale 发布的2019年全球云计算市场调查显示,在众多云平台中混合云的采用率比重最高,达到了惊人的58 ...
清华中德大数据研究学生交换项目成果报告会成功举办
由数据科学研究院资助,启动于2017年3月的"清华大学与德国哥廷根大学交换留学生"项目阶段成果报告会2019年4月6日于清华大学熊知行楼举办. 数据科学研究院执行副院长韩亦舜以&q ...
NeurIPS2019无人驾驶研究成果大总结（含大量论文及项目数据）
点击我爱计算机视觉标星,更快获取CVML新技术神经信息处理系统大会(NeurIPS),是一个机器学习和计算神经科学相关的顶级学术会议.每年的12月举办.2019的第33届NeurIPS在加拿大温哥华 ...
百度联合研究成果登上《自然》子刊推动人才管理大数据智能化转型
AI 的高速发展启动了人才管理变革的加速器.近日,百度针对"AI+ 人才管理"领域的最新研究成果登上国际顶级刊物 Nature 子刊 Nature Communications(& ...
常用地理数据平台及环境数据资源（GIS）
一.标准地图服务系统 (yyds) 标准地图服务系统自然资源部标准地图服务(http://bzdt.ch.mnr.gov.cn)的页面上,提供了各省.自治区.直辖市的标准地图服务网站的链接. 1 国 ...
腾讯与中国人民大学开源最新研究成果：3TS腾讯事务处理技术验证系统
作者:李海翔,腾讯TEG数据库技术专家一个是全球领先的科技公司,一个是中国数据库基础学术研究的摇篮,近日,中国人民大学-腾讯协同创新实验室正式举行揭牌仪式.据了解,双方已聚焦在数据库基础研究领域进行 ...

Cloud Paks地理数据研究成果｜IBM

现状