UCI数据集汇总及描述

1. Abalone: Predict the age of abalone from physical measurements

鲍鱼DataSet：根据物理度量，预测鲍鱼的年龄。

2. Abscisic Acid Signaling Network: The objective is to determine the set of boolean rules that describe the interactions of the nodes within this plant signaling network. The dataset includes 300 separate boolean pseudodynamic simulations using an asynchronous update scheme.

目标是测定布尔值的度量集合，以描述植物的信号网路节点。该数据集包括了300个独立的布尔值形式的虚拟动态模拟值，使用了异步更新的架构。

3. Acute Inflammations: The data was created by a medical expert as a data set to test the expert system, which will perform the presumptive diagnosis of two diseases of the urinary system.

急性炎症DataSet：数据来源于一位医学专家的数据集，用以检测专家系统，可以推断出泌尿系统的两种疾病的诊断结果。

4. Adult: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

成人DataSet：根据户口普查资料，预测收入是否能超过50000美元/年。通常也被称为“收入普查”数据集。

5. Annealing: Steel annealing data

退火DataSet：训练退火数据。

6. Anonymous Microsoft Web Data: Log of anonymous users of www.microsoft.com; predict areas of the web site a user visited based on data on other areas the user visited.

匿名微软网络数据：微软网站的匿名用户记录；通过其他的用户访问区域数据，预测用户在web站点的访问区域。

7. Arcene: ARCENE's task is to distinguish cancer versus normal patterns from mass-spectrometric data. This is a two-class classification problem with continuous input variables. This dataset is one of 5 datasets of the NIPS 2003 feature selection challenge.

ArceneDataSet：该数据集的任务是根据大量的观测数据，从正常的模式中辨别出癌症。这是一个根据不断输入的变量的二级分类问题。该数据集是从NIPS2003特征选择挑战比赛中的5个数据集之一。

8. Arrhythmia: Distinguish between the presence and absence of cardiac arrhythmia and classify it in one of the 16 groups.

心率失常DataSet：分辨是否出现心率失常，并将结果分类进16个组之一。

9. Artificial Characters: Dataset artificially generated by using first order theory which describes structure of ten capital letters of English alphabet

人为性状DataSet：通过使用第一次序理论（该理论可以描述出英语字母表的十个开头字母的结构），自动生成的数据集。

10. Audiology (Original): Nominal audiology dataset from Baylor

原始AudiologyDataSet：来自Baylor的标称型的audiology数据集。

11. Audiology (Standardized): Standardized version of the original audiology database

标准AudiologyDataSet：原始Audiology数据集的标准化版本。

12. Australian Sign Language signs: This data consists of sample of Auslan (Australian Sign Language) signs. Examples of 95 signs were collected from five signers with a total of 6650 sign samples.

澳大利亚标记语言标记DataSet：这些数据包括了澳大利亚标记语言标记的样本。95个实例，均来自五个标识器，其中有6650个标记样本。

13. Australian Sign Language signs (High Quality): This data consists of sample of Auslan (Australian Sign Language) signs. 27 examples of each of 95 Auslan signs were captured from a native signer using high-quality position trackers

澳大利亚标记语言标记DataSet高品质版：该数据集包含了Auslan标记的样本。有27个实例，它们来自95个标记，这27个实例是使用高质量位置追踪器的当地标识器捕捉出来的。

14. Auto MPG: Revised from CMU StatLib library, data concerns city-cycle fuel consumption

自动MPGDataSet：来自CMU StatLib实验室的精品，是与城市循环能源消耗相关的数据集。

15. Automobile: From 1985 Ward's Automotive Yearbook

汽车DataSet：来自1985的沃德自动化年鉴。

16. AutoUniv: AutoUniv is an advanced data generator for classifications tasks. The aim is to reflect the nuances and heterogeneity of real data. Data can be generated in .csv, ARFF or C4.5 formats.

AutoUniv是一个高级数据生成器，可以用来处理分类任务。目标是反映现实数据的微妙与不同之处。数据可以在.csv中生成，采用ARFF或者C4.5的格式。

17. Bach Chorales: Time-series data based on chorales; challenge is to learn generative grammar; data in Lisp

基于Chorales的时间序列数据集；可以用来挑战生成性的语法；数据放在Lisp中。

18. Badges: Badges labeled with a "+" or "-" as a function of a person's name

徽章DataSet：标记了“+”或“-”的符号的标记，可以作为一个人姓名的函数表达式。

19. Bag of Words: This data set contains five text collections in the form of bags-of-words.

词语包DataSet：该数据集包含了5个文本集合，每个文本集合以词语包的形式展现。

20. Balance Scale: Balance scale weight & distance database

天平DataSet：天平的重量和距离数据库。

21. Balloons: Data previously used in cognitive psychology experiment; 4 data sets represent different conditions of an experiment

气球DataSet：曾经用在认知心理学实验中的数据；4个数据集代表了一个实验中的不同条件。

22. Blood Transfusion Service Center: Data taken from the Blood Transfusion Service Center in Hsin-Chu City in Taiwan -- this is a classification problem.

输血服务中心DataSet：来自台湾的Hsin-CHu市的输血服务中心的数据——用以解决分类问题。

23. Breast Cancer: Breast Cancer Data (Restricted Access)

乳腺癌DataSet：乳腺癌数据（访问限制）。

24. Breast Cancer Wisconsin (Diagnostic): Diagnostic Wisconsin Breast Cancer Database

乳腺癌威斯康星洲（诊断数据）DataSet：威斯康星的乳腺癌诊断数据。

25. Breast Cancer Wisconsin (Original): Original Wisconsin Breast Cancer Database

乳腺癌威斯康星洲（原始数据）：原始的威斯康星州乳腺癌数据库。

26. Breast Cancer Wisconsin (Prognostic): Prognostic Wisconsin Breast Cancer Database

乳腺癌威斯康星洲（Prognostic版）：威斯康星州乳腺癌数据库。

27. Breast Tissue: Dataset with electrical impedance measurements of freshly excised tissue samples from the breast.

乳腺组织DataSet：乳腺的新鲜切除组织样本的电阻度量数据集。

28. CalIt2 Building People Counts: This data comes from the main door of the CalIt2 building at UCI.

Calt2建筑的人数：该数据集来自UCI的Calts建筑的主要大门。

29. Car Evaluation: Derived from simple hierarchical decision model, this database may be useful for testing constructive induction and structure discovery methods.

汽车评估DataSet：来源于简单层次决策模型，该数据集可用于测试建设性的回归，和发现结构性方法。

30. Cardiotocography: The dataset consists of measurements of fetal heart rate (FHR) and uterine contraction (UC) features on cardiotocograms classified by expert obstetricians.

胎儿心率DataSet：该数据集包括胎儿心率（FHR），和基于产科专家医生分类的cardiotocograms　子宫收缩（UC）特征。

31. Census Income: Predict whether income exceeds $50K/yr based on census data. Also known as "Adult" dataset.

收入普查DataSet：基于普查数据，预测收入是否超过50000美元/年。也被称为“成人”数据集。

32. Census-Income (KDD): This data set contains weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau.

收入普查（KDD）DataSet：这个数据集包含了从1994－1995年的U.S普查局的《当前人口调查》中提取出来的普查数据。

33. Challenger USA Space Shuttle O-Ring: Task: predict the number of O-rings that experience thermal distress on a flight at 31 degrees F given data on the previous 23 shuttle flights

挑战者号USA航天飞机O形圈DataSet：任务：基于前23次飞行数据，预测在一次31度热压F的状况中的飞行任务的O形圈的数目。

34. Character Trajectories: Multiple, labelled samples of pen tip trajectories recorded whilst writing individual characters. All samples are from the same writer, for the purposes of primitive extraction. Only characters with a single pen-down segment were considered.

字符轨迹DataSet：同时写出单个字幕的笔尖轨道的多个标记样本记录。为了保证初始的提取数据，所有的样本都来自于同一个书写人员。仅仅考虑了单一落笔段的字符。

35. Chess (Domain Theories): 6 different domain theories for generating legal moves of chess

国际象棋（域理论）DataSet：产生国际象棋的规定路数的6个不同的域理论。

36. Chess (King-Rook vs. King): Chess Endgame Database for White King and Rook against Black King (KRK).

国际象棋（王RookVS王）DataSet：白国王与黑国王的象棋残局数据库。

37. Chess (King-Rook vs. King-Knight): Knight Pin Chess End-Game Database Creator

国际象棋（王Rook对战骑士）：骑士

38. Chess (King-Rook vs. King-Pawn): King+Rook versus King+Pawn on a7 (usually abbreviated KRKPA7).

国王Rook与国王Pawn的a7（通常简写为KAEPA7）。

39. Cloud: Little Documentation

小文档。

40. CMU Face Images: This data consists of 640 black and white face images of people taken with varying pose (straight, left, right, up), expression (neutral, happy, sad, angry), eyes (wearing sunglasses or not), and size

CMU人脸图像DataSet：该数据集包含了640张黑白人脸图像，并且有直、左、右、上四个角度，中性、高兴、悲伤、生气四个表情，有的戴着太阳镜，有的没有，并且大小也不一。

41. Coil 1999 Competition Data: This data set is from the 1999 Computational Intelligence and Learning (COIL) competition. The data contains measurements of river chemical concentrations and algae densities.

Coil1999竞赛数据：该数据集来自1999年的计算机智能学习竞赛（简写为Coil）。该数据集包含了河流的化学浓度度量和藻类的密度度量。

42. Communities and Crime: Communities within the United States. The data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR.

社区与犯罪DataSet：美国的社区。该数据集包含了来自1990美国普查的社会经济数据、来自1990美国LEMAS调查的法律实施数据，还有来自1995年FBI UCR的犯罪数据。

43. Communities and Crime Unnormalized: Communities in the US. Data combines socio-economic data from the '90 Census, law enforcement data from the 1990 Law Enforcement Management and Admin Stats survey, and crime data from the 1995 FBI UCR

社区和非标准化犯罪DataSet：美国的社区。数据包含了来自90年代普查的社会经济数据、来自1990年法律实施管理调查的法律实施数据，还有来自1995年FBI UCR的犯罪数据。

44. Computer Hardware: Relative CPU Performance Data, described in terms of its cycle time, memory size, etc.

计算机硬件：相关CPU运行数据，采用它的时间周期、内存大小来描述。

45. Concrete Compressive Strength: Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients.

混凝土抗压强度DataSet：混凝土是土木工程中最重要的材料。抗压强度是混凝土年龄与组成非线性特征。

46. Concrete Slump Test: Concrete is a highly complex material. The slump flow of concrete is not only determined by the water content, but that is also influenced by other concrete ingredients.

混凝土塌方度试验：混凝土是一种非常复杂的材料。它的塌落度流量不仅取决于含水量，也受其他具体成分的影响。

47. Congressional Voting Records: 1984 United Stated Congressional Voting Records; Classify as Republican or Democrat

国会投票记录DataSet：1984年美国国会投票记录；按照共和党与民主党分类。

48. Connect-4: Contains connect-4 positions

连接4：包含了连接4的位置。

49. Connectionist Bench (Nettalk Corpus): The file "nettalk.data" contains a list of 20,008 English words, along with a phonetic transcription for each word. The task is to train a network to produce the proper phonemes

连接工作台（Nettalk资料库）：文件“nettalk.data”包含了一个有20008个英语单词的列表，还有一个每个单词的phonetic副本。任务是训练一个网络，用来产生适当的phonemes。

50. Connectionist Bench (Sonar, Mines vs. Rocks): The task is to train a network to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock.

连接工作台（声纳、矿产和岩石）：目标是训练一个网络，用来区别在金属圆柱体的反弹声纳信号，和在基本为圆柱体的岩石上的反弹信号。

51. Connectionist Bench (Vowel Recognition - Deterding Data): Speaker independent recognition of the eleven steady state vowels of British English using a specified training set of lpc derived log area ratios.

连接工作台（元音识别—Detering数据）：使用一个来源于一个比率的指定训练集的11个英式英语的稳定元音字母的独立识别扬声器。

52. Contraceptive Method Choice: Dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey.

避孕方法的选择：该数据集是1997年印度尼西亚全国的避孕患病率调查的的一个子集。

53. Corel Image Features: This dataset contains image features extracted from a Corel image collection. Four sets of features are available based on the color histogram, color histogram layout, color moments, and co-occurrence

Corel图像特征：该数据集包含了提取自一个Corel图像集合的图片特征。基于颜色直方图、颜色直方图布局、颜色的时机和调和，可得到四个特征集合。

54. Covertype: Forest CoverType dataset

覆盖类型：森林覆盖类型数据集。

55. Credit Approval: This data concerns credit card applications; good mix of attributes

信贷审批：该数据集与信用卡的使用相关；是各种属性的集合。

56. Cylinder Bands: Used in decision tree induction for mitigating process delays known as "cylinder bands" in rotogravure printing

气缸带：使用判定树来归纳，减缓气缸带的凸版打印。

57. Demospongiae: Marine sponges of the Demospongiae class classification domain.

Demospongiae类别下的海绵分类域。

58. Dermatology: Aim for this dataset is to determine the type of Eryhemato-Squamous Disease.

皮肤科：该数据集用于判定Eryhemato鳞状疾病的类型。

59. Dexter: DEXTER is a text classification problem in a bag-of-word representation. This is a two-class classification problem with sparse continuous input variables. This dataset is one of five datasets of the NIPS 2003 feature selection challenge.

DETEX是一个用一个文字包来表现的文本分类问题。这是一个通过不断的输入参数的两层的分类问题。该数据集是NIPS2003年特征提取邀请赛的五个数据集中的一个。　

60. DGP2 - The Second Data Generation Program: Generates application domains based on specific parameters, number of features, and proportion of positive to negative examples

DGP2—第二个数据生成程序：基于具体的参数、特征的数量、和正面到负面例子的比率，产生应用域。

61. Diabetes: This diabetes dataset is from AIM '94

糖尿病：该糖尿病数据集来自AIM94。

62. Document Understanding: Five concepts, expressed as predicates, to be learned

文件理解：要学习的五个概念，作为谓词来表现。

63. Dodgers Loop Sensor: Loop sensor data was collected for the Glendale on ramp for the 101 North freeway in Los Angeles

Dodgers回路传感器：回路传感器数据集来自Gledale的斜坡（在洛杉矶的101个北高速公路）。

64. Dorothea: DOROTHEA is a drug discovery dataset. Chemical compounds represented by structural molecular features must be classified as active (binding to thrombin) or inactive. This is one of 5 datasets of the NIPS 2003 feature selection challenge.

Dorothea是一个药物发现数据集。以结构分析特征来表现的化合物必须分类为活性的（绑定到凝血酶）或者非活性的。这是五个NIPS2003特征选择挑战赛数据集中的一个。

65. E. Coli Genes: Data giving characteristics of each ORF (potential gene) in the E. coli genome. Sequence, homology (similarity to other genes) and structural information, and function (if known) are provided.

大肠杆菌基因：每个在E.coli基因组里面ORD(潜在基因)的特征数据集。提供序列、同源性（与其他基因的相似形）和结构信息。还有功能（如果知道的话）。

66. EBL Domain Theories: Assorted small-scale domain theories

EBL域理论：各种小规模的域理论。

67. Echocardiogram: Data for classifying if patients will survive for at least one year after a heart attack

超声心动图：该数据集用来分类是否病人在一次心脏病后，至少可以存活一年。

68. Ecoli: This data contains protein localization sites

该数据集包含了蛋白质本地化地址。

69. Economic Sanctions: Domain Theory on Economic Sanctions; Undocumented

经济制裁：经济制裁方面的域理论，无记录文档。

70. EEG Database: This data arises from a large study to examine EEG correlates of genetic predisposition to alcoholism. It contains measurements from 64 electrodes placed on the scalp sampled at 256 Hz

EEG数据库：该数据集来源于一个检查EEG的、与易患酒精中毒的基因体质相关的大型研究、包含了放在头皮上的、为256HZ的、来自64个电极的度量。

71. El Nino: The data set contains oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific.

厄尔尼诺：该数据集包含了从整个赤道太平洋的一系列浮标的海洋与地面气象读数。

72. Entree Chicago Recommendation Data: This data contains a record of user interactions with the Entree Chicago restaurant recommendation system.

芝加哥主菜推荐数据：该数据集包含了一个与芝加哥主菜馆的推荐系统的用户交互的记录。

73. Flags: From Collins Gem Guide to Flags, 1986

标志：从柯林斯宝石指南的标志，1986

74. Forest Fires: This is a difficult regression task, where the aim is to predict the burned area of forest fires, in the northeast region of Portugal, by using meteorological and other data (see details at: http://www.dsi.uminho.pt/~pcortez/forestfires).

森林火灾：这是一个艰难的回归的任务，其目的是在葡萄牙东北部地区，利用气象数据和其他数据，预测森林火灾的过火面积，（详见：http://www.dsi.uminho PT / pcortez / forestfires）。

75. Function Finding: Cases collected mostly from investigations in physical science; intention is to evaluate function-finding algorithms

寻找功能：收集的情况下，大多是从在物理科学的调查;意图是评价函数发现算法

76. Gisette: GISETTE is a handwritten digit recognition problem. The problem is to separate the highly confusible digits '4' and '9'. This dataset is one of five datasets of the NIPS 2003 feature selection challenge.

Gisette：GISETTE是一个手写数字识别问题。问题是独立的高度confusible数字'4'和'9'。这个数据集是5 NIPS的2003年特征选择挑战的数据集之一。

77. Glass Identification: From USA Forensic Science Service; 6 types of glass; defined in terms of their oxide content (i.e. Na, Fe, K, etc)

玻璃鉴定：从美国法医科学服务; 6种玻璃;在他们的氧化物含量定义（即钠，铁，钾等）

78. Haberman's Survival: Dataset contains cases from study conducted on the survival of patients who had undergone surgery for breast cancer

哈伯曼的生存：DataSet包含谁经历了乳腺癌手术患者的生存所进行的研究情况

79. Hayes-Roth: Topic: human subjects study

海斯 - 罗斯：主题：人类受试者的研究

80. Heart Disease: 4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach

心脏病：4个数据库：克利夫兰，匈牙利，瑞士，和弗吉尼亚州的长滩

81. Hepatitis: From G.Gong: CMU; Mostly Boolean or numeric-valued attribute types; Includes cost data (donated by Peter Turney)

肝炎：从G.龚：债务工具中央结算系统;大多是布尔值或数字值的属性类型，包括成本数据（彼得特尼捐赠）

82. Hill-Valley: Each record represents 100 points on a two-dimensional graph. When plotted in order (from 1 through 100) as the Y co-ordinate, the points will create either a Hill (a �bump� in the terrain) or a Valley (a �dip� in the terrain).

希尔谷：每个记录代表一个二维图形上100点。当策划，以统筹的Y（从1到100），积分将创建一个山（在凹凸的地形）或谷（浸在地形）。

83. Horse Colic: Well documented attributes; 368 instances with 28 attributes (continuous, discrete, and nominal); 30% missing values

马绞痛：有据可查的属性; 368 28属性（连续，离散的，标称值）的实例; 30％的缺失值

84. Housing: Taken from StatLib library

房屋：两者StatLib库

85. ICU: Data set prepared for the use of participants for the 1994 AAAI Spring Symposium on Artificial Intelligence in Medicine.

ICU的数据集，为1994年AAAI春季研讨会的与会者在医学上使用人工智能准备。

86. Image Segmentation: Image data described by high-level numeric-valued attributes, 7 classes

图像分割：由高层次的数字值属性描述的图像数据，7类

87. Insurance Company Benchmark (COIL 2000): This data set used in the CoIL 2000 Challenge contains information on customers of an insurance company. The data consists of 86 variables and includes product usage data and socio-demographic data

保险公司的基准（线圈2000年）：使用该数据集在线圈2000挑战包含保险公司对客户的信息。该数据由86变数，包括产品使用的数据和社会人口数据

88. Internet Advertisements: This dataset represents a set of possible advertisements on Internet pages.

互联网广告：这个DataSet表示一组可能在互联网上的网页广告。

89. Internet Usage Data: This data contains general demographic information on internet users in 1997.

互联网应用的数据：该数据包含一般的互联网用户在1997年的人口统计信息。

90. Ionosphere: Classification of radar returns from the ionosphere

电离层：从电离层雷达回波分类

91. IPUMS Census Database: This data set contains unweighted PUMS census data from the Los Angeles and Long Beach areas for the years 1970, 1980, and 1990.

IPUMS普查数据库：该数据集包含未加权PUMS普查从洛杉矶和长滩地区1970年，1980年和1990年的数据。

92. Iris: Famous database; from Fisher, 1936

光圈：著名的数据库;从1936年费舍尔，

93. ISOLET: Goal: Predict which letter-name was spoken--a simple classification task.

ISOLET：目标：预测字母名称是口语 - 一个简单的分类任务。

94. Japanese Credit Screening: Includes domain theory (generated by talking to Japanese domain experts); data in Lisp

日本信用筛选：包括域理论（日本领域的专家交谈生成）;在Lisp中的数据

95. Japanese Vowels: This dataset records 640 time series of 12 LPC cepstrum coefficients taken from nine male speakers.

日本元音：该数据集的记录640 12的LPC倒谱系系数从九男扬声器的时间序列。

96. KDD Cup 1998 Data: This is the data set used for The Second International Knowledge Discovery and Data Mining Tools Competition, which was held i n conjunction with KDD-98

KDD杯1998年的数据：这是数据集的第二届国际知识发现和数据挖掘工具的竞争，这是在同时举行的KDD - 98

97. KDD Cup 1999 Data: This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99

KDD杯1999年的数据：这是数据集使用的第三次国际知识发现和数据挖掘工具的竞争，这是在同时举行的KDD - 99

98. Kinship: Relational dataset

亲属关系：关系数据集

99. Labor Relations: From Collective Bargaining Review

劳动关系：从集体谈判检讨

100. LED Display Domain: From Classification and Regression Trees book; We provide here 2 C programs for generating sample databases

LED显示域：从分类和回归树书，我们在这里提供2 C程序生成示例数据库

101. Lenses: Database for fitting contact lenses

镜头：装修隐形眼镜数据库

102. Letter Recognition: Database of character image features; try to identify the letter

信承认：人物形象特征的数据库;试图找出信

103. Libras Movement: The data set contains 15 classes of 24 instances each. Each class references to a hand movement type in LIBRAS (Portuguese name 'L�ngua BRAsileira de Sinais', oficial brazilian signal language).

天秤座的运动：该数据集包含了15类24个实例。每个类的引用，在天秤座的人的手部动作类型（葡萄牙名“Lngua BRAsileira Sinais”，公报巴西信号语言）。

104. Liver Disorders: BUPA Medical Research Ltd. database donated by Richard S. Forsyth

肝脏疾病：保柏医疗研究公司数据库由理查德福塞斯捐赠

105. Localization Data for Person Activity: Data contains recordings of five people performing different activities. Each person wore four sensors (tags) while performing the same scenario five times.

人活动的本地化数据：数据包含五个执行不同的活动的人的录音。每个人穿的4个传感器（标签），同时执行相同的情况下的五倍。

106. Logic Theorist: All code for Logic Theorist

逻辑理论家：逻辑理论家的所有代码

107. Low Resolution Spectrometer: From IRAS data -- NASA Ames Research Center

低分辨率光谱仪：从红外天文卫星数据 - 美国国家航空航天局艾姆斯研究中心

108. Lung Cancer: Lung cancer data; no attribute definitions

肺癌：肺癌数据;没有属性定义

109. Lymphography: This lymphography domain was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. (Restricted access)

淋巴造影：从大学医学中心，肿瘤研究所，南斯拉夫卢布尔雅那的这淋巴域。（限制访问）

110. M. Tuberculosis Genes: Data giving characteristics of each ORF (potential gene) in the M. tuberculosis bacterium. Sequence, homology (similarity to other genes) and structural information, and function (if known) are provided

结核分枝杆菌基因：给每个ORF在结核分枝杆菌的细菌特性（潜在的基因）的数据。序列，同源性（其他基因的相似性）和结构信息，和功能（如果已知）

111. Madelon: MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear.

Madelon：MADELON是一个人造的数据集，这是对2003年的NIPS的特征选择挑战的一部分。这是一个连续的输入变量的两个类的分类问题。困难的是，问题是多元的和高度非线性。

112. MAGIC Gamma Telescope: Data are MC generated to simulate registration of high energy gamma particles in an atmospheric Cherenkov telescope

魔伽马望远镜：数据生成高能量的伽玛粒子来模拟大气切伦科夫望远镜登记MC

113. Mammographic Mass: Discrimination of benign and malignant mammographic masses based on BI-RADS attributes and the patient's age.

乳腺质量：良性和恶性乳腺群众基于BI - RADS的属性和病人的年龄歧视。

114. Mechanical Analysis: Fault diagnosis problem of electromechanical devices; also PUMPS DATA SET is newer version with domain theory and results

力学分析：机电设备的故障诊断问题;水泵数据集与域的理论和成果是较新的版本

115. Meta-data: Meta-Data was used in order to give advice about which classification method is appropriate for a particular dataset (taken from results of Statlog project).

元数据：元数据使用的分类方法是适合于一个特定的数据集（Statlog项目的结果），以提供意见。

116. MiniBooNE particle identification: This dataset is taken from the MiniBooNE experiment and is used to distinguish electron neutrinos (signal) from muon neutrinos (background).

MiniBooNE的粒子鉴别：该数据集是从MiniBooNE的实验是使用电子中微子（信号），以区别于μ子中微子（背景）。

117. Mobile Robots: Learning concepts from sensor data of a mobile robot; set of data sets

移动机器人：从移动机器人的传感器数据学习观念;组数据集

118. Molecular Biology (Promoter Gene Sequences): E. Coli promoter gene sequences (DNA) with partial domain theory

分子生物学（启动子序列）：大肠杆菌启动子的基因序列（DNA）的部分域理论

119. Molecular Biology (Protein Secondary Structure): From CMU connectionist bench repository; Classifies secondary structure of certain globular proteins

分子生物学（蛋白质二级结构）：从债务工具中央结算系统联结板凳资源库;某些球状蛋白质的二级结构进行分类

120. Molecular Biology (Splice-junction Gene Sequences): Primate splice-junction gene sequences (DNA) with associated imperfect domain theory

分子生物学（拼接交界的基因序列）：灵长类动物的基因序列拼接结与相关的不完善域理论（脱氧核糖核酸）

121. MONK's Problems: A set of three artificial domains over the same attribute space; Used to test a wide range of induction algorithms

和尚的问题：三个以上相同的属性空间的人工域;用于测试一个广泛的归纳算法

122. Moral Reasoner: Horn-clause model that qualitatively simulates moral reasoning; Theory includes negated literals

道德推理：霍恩子句模型定性模拟道德推理理论包括否定的文字

123. Movie: This data set contains a list of over 10000 films including many older, odd, and cult films. There is information on actors, casts, directors, producers, studios, etc.

电影：该数据集包含一个10000多部电影，包括许多年纪大了，奇怪，和邪教的电影列表。有上的演员，演员，董事，制片人，制片公司等信息

124. MSNBC.com Anonymous Web Data: This data describes the page visits of users who visited msnbc.com on September 28, 1999. Visits are recorded at the level of URL category (see description) and are recorded in time order.

MSNBC.com匿名Web数据：这个数据描述了用户的页面访问参观，1999年9月28日msnbc.com。记录访问的URL类别的水平（见说明），在时间顺序记录。

125. Multiple Features: This dataset consists of features of handwritten numer als (`0'--`9') extracted from a collection of Dutch utility maps

多种功能：这个数据集，包括从荷兰实用地图的集合中提取的手写体数字（`0'结束 - `9“）功能

126. Mushroom: From Audobon Society Field Guide; mushrooms described in terms of physical characteristics; classification: poisonous or edible

蘑菇：从Audobon社会领域指南“;蘑菇描述的物理特性;分类：有毒或食用

127. Musk (Version 1): The goal is to learn to predict whether new molecules will be musks or non-musks

麝香（版本1）：我们的目标是要学会预测是否有新的分子，将麝香或非麝香

128. Musk (Version 2): The goal is to learn to predict whether new molecules will be musks or non-musks

麝香（第2版）：我们的目标是要学会预测是否有新的分子，将麝香或非麝香

129. NSF Research Award Abstracts 1990-2003: This data set consists of (a) 129,000 abstracts describing NSF awards for basic research, (b) bag-of-word data files extracted from the abstracts, (c) a list of words used for indexing the bag-of-word

NSF研究奖论文摘要1990年至2003年：（一）129000摘要描述NSF的奖项，用于基础研究（二）字袋从抽象的数据中提取的文件，（三）为索引使用的单词列表，该数据集组成字袋

130. Nursery: Nursery Database was derived from a hierarchical decision model originally developed to rank applications for nursery schools.

苗圃：苗圃数据库是从最初开发托儿所排名应用分层决策模型派生。

131. Online Handwritten Assamese Characters Dataset: This is a dataset of 8235 online handwritten assamese characters. The “online” process involves capturing of data as text is written on a digitizing tablet with an electronic pen.

在线手写阿萨姆字符数据集：这是一个8235联机手写阿萨姆字符的数据集。 “在线”的过程包括数据采集，数字化仪上用电子笔的书面文本。

132. Opinosis Opinion ⁄ Review: This dataset contains sentences extracted from user reviews on a given topic. Example topics are “performance of Toyota Camry” and “sound quality of ipod nano”.

Opinosis意见/评论：此数据集包含一个给定的主题从用户评论中提取的句子。示例主题是“表现的丰田佳美”和“音质”的iPod nano。

133. OpinRank Review Dataset: This data set contains user reviews of cars and and hotels collected from Tripadvisor (~259,000 reviews) and Edmunds (~42,230 reviews).

OpinRank审查数据集：该数据集包含车和酒店收集到到网（259000评语）和埃德蒙兹（?42230条评论）的用户评论。

134. Optical Recognition of Handwritten Digits: Two versions of this database available; see folder

光学识别手写体数字：这个数据库提供的两个版本，请参阅文件夹

135. Othello Domain Theory: Used in research to generate features for an inductive learning system

奥赛罗域理论：在研究中使用生成归纳学习系统的功能

136. Ozone Level Detection: Two ground ozone level data sets are included in this collection. One is the eight hour peak set (eighthr.data), the other is the one hour peak set (onehr.data). Those data were collected from 1998 to 2004 at the Houston, Galveston and Brazoria area.

臭氧浓度检测：两个地面臭氧浓度的数据集都包含在此集合。之一，是8个小时的高峰集（eighthr.data），另一种是一个小时的高峰集（onehr.data）。这些数据收集从1998年至2004年在休斯敦，加尔维斯顿和Brazoria区域。

137. p53 Mutants: The goal is to model mutant p53 transcriptional activity (active vs inactive) based on data extracted from biophysical simulations.

p53基因突变体：我们的目标是到模型的基础上从生物物理模拟提取数据的突变型p53的转录活性（有源VS无效）。

138. Page Blocks Classification: The problem consists of classifying all the blocks of the page layout of a document that has been detected by a segmentation process.

页块分类：问题进行分类的一个已被分割过程中检测到的文件的页面布局的所有块组成。

139. Parkinsons: Oxford Parkinson's Disease Detection Dataset

帕金森：牛津帕金森氏病的检测数据集

140. Parkinsons Telemonitoring: Oxford Parkinson's Disease Telemonitoring Dataset

帕金森远程监护：牛津帕金森病的远程监护数据集

141. PEMS-SF: 15 months worth of daily data (440 daily records) that describes the occupancy rate, between 0 and 1, of different car lanes of the San Francisco bay area freeways across time.

PEMS - SF：15个月，每天的数据（440每日记录）描述的入住率，0和1之间，不同的汽车车道，旧金山湾地区的高速公路，跨越时间的价值。

142. Pen-Based Recognition of Handwritten Digits: Digit database of 250 samples from 44 writers

基于笔的手写数字识别：来自44个作家的250个样本的数字数据库

143. Pima Indians Diabetes: From National Institute of Diabetes and Digestive and Kidney Diseases; Includes cost data (donated by Peter Turney)

皮马印第安人糖尿病：国立糖尿病，消化道和肾脏疾病研究所;包括成本数据（彼得特尼捐赠）

144. Pioneer-1 Mobile Robot Data: This dataset contains time series sensor readings of the Pioneer-1 mobile robot. The data is broken into "experiences" in which the robot takes action for some period of time and experiences a control

先锋- 1移动机器人数据：该数据集包含了时间序列的先锋- 1移动机器人的传感器读数。数据分解成“经验”中，机器人需要一段时间的行动和经验的控制

145. Pittsburgh Bridges: Bridges database that has original and numeric-discretized datasets

匹兹堡桥梁：桥梁数据库，具有原始和数值离散数据集

146. Plants: Data has been extracted from the USDA plants database. It contains all plants (species and genera) in the database and the states of USA and Canada where they occur.

植物：数据已经从美国农业部植物数据库中提取。它包含在数据库中，美国和加拿大发生的所有植物（种属）。

147. Poker Hand: Purpose is to predict poker hands

牌手：目的是预测扑克牌

148. Post-Operative Patient: Dataset of patient features

手术后的病人：病人的特征数据集

149. Primary Tumor: From Ljubljana Oncology Institute

原发肿瘤：肿瘤研究所从卢布尔雅那

150. Prodigy: Assorted domains like blocksworld, eightpuzzle, and schedworld.

奇才：blocksworld，eightpuzzle，schedworld什锦域。

151. Protein Data: Undocumented

蛋白质数据：无证

152. Pseudo Periodic Synthetic Time Series: This data set is designed for testing indexing schemes in time series databases. The data appears highly periodic, but never exactly repeats itself.

伪定期的合成时间系列：该数据集是测试时间序列数据库中的索引计划的设计。的数据显示高度周期性的，但永远不会完全重演。

153. PubChem Bioassay Data: These highly imbalanced bioassay datasets are from the differing types of screening that can be performed using HTS technology. 21 datasets were created from 12 bioassays.

PubChem数据库生物测定数据：这些高度不平衡的生物测定数据集的筛选不同类型可以使用高温超导技术。 21数据集创建了来自12个生物测定。

154. Quadruped Mammals: The file animals.c is a data generator of structured instances representing quadruped animals

四足哺乳动物：该文件animals.c是一个代表四足动物的结构实例的数据发生器

155. Qualitative Structure Activity Relationships: Two sets of datasets are given: pyrimidines and triazines

定性结构活性关系：给出两套数据集：嘧啶和三嗪

156. Record Linkage Comparison Patterns: Element-wise comparison of records with personal data from a record linkage setting. The task is to decide from a comparison pattern whether the underlying records belong to one person.

记录链接比较模式：元素比较明智的，从创纪录的联动设置的个人资料记录。任务是从一个比较模式，决定是否属于一个人的基本纪录。

157. Relative location of CT slices on axial axis: The dataset consists of 384 features extracted from CT images. The class variable is numeric and denotes the relative location of the CT slice on the axial axis of the human body.

CT片的轴向轴的相对位置：数据集包括从CT图像中提取的384功能。类变量是数值表示的CT片对人体的轴向轴的相对位置。

158. Reuters Transcribed Subset: This dataset is created by reading out 200 files from the 10 largest Reuters classes and using an Automatic Speech Recognition system to create corresponding transcriptions.

路透社转录子集：创建该数据集是通过读出最大路透社从10类200个文件，并使用自动语音识别系统，建立相应的改编。

159. Reuters-21578 Text Categorization Collection: This is a collection of documents that appeared on Reuters newswire in 1987. The documents were assembled and indexed with categories.

路透- 21578文本分类收集：这是出现于1987年，路透通讯社的文件的集合。组装和类别索引文件。

160. Robot Execution Failures: This dataset contains force and torque measurements on a robot after failure detection. Each failure is characterized by 15 force/torque samples collected at regular time intervals

机器人执行失败：此数据集包含后故障检测机器人的力和力矩测量。每次失败的特点是在固定的时间间隔采集的样品15力/力矩

161. SECOM: Data from a semi-conductor manufacturing process

世强：从半导体制造过程中的数据

162. Semeion Handwritten Digit: 1593 handwritten digits from around 80 persons were scanned, stretched in a rectangular box 16x16 in a gray scale of 256 values.

Semeion手写体数字：1593从80人左右的手写数字进行扫描，伸一个矩形框，在256个值的灰度的16x16。

163. Servo: Data was from a simulation of a servo system

伺服：数据从一个伺服系统的仿真

164. Shuttle Landing Control: Tiny database; all nominal values

航天飞机着陆控制：微型数据库;所有标称值

165. Solar Flare: Each class attribute counts the number of solar flares of a certain class that occur in a 24 hour period

太阳耀斑：每个类的属性一定的阶级，在24小时内发生的太阳耀斑的数量进行计数

166. Soybean (Large): Michalski's famous soybean disease database

大豆（大）：MICHALSKI著名的大豆疾病数据库

167. Soybean (Small): Michalski's famous soybean disease database

大豆（小）：MICHALSKI著名的大豆疾病数据库

168. Spambase: Classifying Email as Spam or Non-Spam

Spambase：归类为“垃圾邮件”或“非垃圾邮件的电子邮件

169. SPECT Heart: Data on cardiac Single Proton Emission Computed Tomography (SPECT) images. Each patient classified into two categories: normal and abnormal.

SPECT的心脏：心脏单个质子发射计算机断层显像（SPECT）的图像数据。每个病人分为两类：正常和不正常的。

170. SPECTF Heart: Data on cardiac Single Proton Emission Computed Tomography (SPECT) images. Each patient classified into two categories: normal and abnormal.

SPECTF心脏：心脏单个质子发射计算机断层显像（SPECT）的图像数据。每个病人分为两类：正常和不正常的。

171. Spoken Arabic Digit: This dataset contains timeseries of mel-frequency cepstrum coefficients (MFCCs) corresponding to spoken Arabic digits. Includes data from 44 male and 44 female native Arabic speakers.

口语阿拉伯语位：该数据集包含MEL频率倒谱系数（MFCCs）讲阿拉伯语数字对应的时间序列。包括44男44女的母语讲阿拉伯语的数据。

172. Sponge: Data on sponges; Attributes in Spanish

海绵：海绵上的数据，在西班牙语中的属性

173. Statlog (Australian Credit Approval): This file concerns credit card applications. This database exists elsewhere in the repository (Credit Screening Database) in a slightly different form

Statlog（澳大利亚授信审批）：这个文件是关于信用卡申请。该数据库存在于其他地方略有不同形式的资源库（授信数据库）

174. Statlog (German Credit Data): This dataset classifies people described by a set of attributes as good or bad credit risks. Comes in two formats (one all numeric). Also comes with a cost matrix

Statlog（德国信用数据）：这个数据集划分好坏信贷风险的属性所描述的人。来自于两种格式（所有数字）。还带有一个成本矩阵

175. Statlog (Heart): This dataset is a heart disease database similar to a database already present in the repository (Heart Disease databases) but in a slightly different form

Statlog（心）：这个数据集是一个心脏疾病数据库，数据库已经在库（心脏病数据库）类似，但略有不同的形式

176. Statlog (Image Segmentation): This dataset is an image segmentation database similar to a database already present in the repository (Image segmentation database) but in a slightly different form.

Statlog（图像分割）：该数据集是一个图像分割数据库，数据库中已存在的资源库（图像分割数据库），但在一个稍微不同的的形式类似。

177. Statlog (Landsat Satellite): Multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood

Statlog（地球资源卫星多光谱）：在3x3的街区在卫星图像的像素值，并与中央像素在每个居委会相关的分类

178. Statlog (Shuttle): The shuttle dataset contains 9 attributes all of which are numerical. Approximately 80% of the data belongs to class 1

Statlog（班车）：穿梭集包含20个属性，所有这一切都是数字。大约80％的数据属于1级

179. Statlog (Vehicle Silhouettes): 3D objects within a 2D image by application of an ensemble of shape feature extractors to the 2D silhouettes of the objects.

Statlog（车剪影）：在一个物体的二维轮廓的形状特征提取的合奏中的应用2D图像的三维对象。

180. Statlog Project: Various Databases: Vehicle silhouttes, Landsat Sattelite, Shuttle, Australian Credit Approval, Heart Disease, Image Segmentation, German Credit

Statlog项目：各种数据库：车辆silhouttes，地球资源卫星，航天飞机，澳大利亚信贷审批，心脏病，图像分割，德国信用

181. Steel Plates Faults: A dataset of steel plates’ faults, classified into 7 different types. The goal was to train machine learning for automatic pattern recognition.

钢板缺陷：一个数据集钢板断裂，分为7个不同的类型。我们的目标是培养学习机，自动模式识别。

182. Student Loan Relational: Student Loan Relational Domain

。助学贷款的关系：助学贷款的关系域

183. Synthetic Control Chart Time Series: This data consists of synthetically generated control charts.

合成控制图的时间序列数据的综合生成的控制图组成。

184. Syskill and Webert Web Page Ratings: This database contains HTML source of web pages plus the ratings of a single user on these web pages. Web pages are on four seperate subjects (Bands- recording artists; Goats; Sheep; and BioMedical)

Syskill和Webert网页评价：该数据库包含网页的HTML源代码再加上这些网页上的一个单用户的收视率。网页是在四个不同科目（乐队的录音艺术家;山羊;绵羊;和生物医学）

185. Teaching Assistant Evaluation: The data consist of evaluations of teaching performance; scores are "low", "medium", or "high"

助教评价：数据包括教学绩效评价;分数“低”，“中等”，或“高”

186. Thyroid Disease: 10 separate databases from Garavan Institute

甲状腺疾病：10个单独的数据库Garavan研究所

187. Tic-Tac-Toe Endgame: Binary classification task on possible configurations of tic-tac-toe game

井字脚趾残局：可能的配置的tic - tac - toe游戏的二元分类任务

188. Trains: 2 data formats (structured, one-instance-per-line)

火车：2数据格式（结构化，每行一个实例）

189. Twenty Newsgroups: This data set consists of 20000 messages taken from 20 newsgroups.

第二十新闻组：该数据集由来自20个新闻组采取的20000消息。

190. UJI Pen Characters: Data consists of written characters in a UNIPEN-like format

宇治笔特点：数据包括在UNIPEN样的格式写入的字符

191. UJI Pen Characters (Version 2): A pen-based database with more than 11k isolated handwritten characters

宇治钢笔字（第2版）：一个孤立的手写字符超过11K的钢笔型数据库

192. Undocumented: Various datasets without documentation (feel free to explore!)

无证：没有证件的各种数据集（自由探索！）

193. University: Data in original (LISP-readable) form

大学：原（Lisp的可读形式）中的数据

194. UNIX User Data: This file contains 9 sets of sanitized user data drawn from the command histories of 8 UNIX computer users at Purdue over the course of up to 2 years.

UNIX用户数据：该文件包含9套消毒的用户在长达2年的，当然从8 UNIX计算机用户的命令历史数据绘制在普渡大学。

195. URL Reputation: Anonymized 120-day subset of the ICML-09 URL data containing 2.4 million examples and 3.2 million features.

URL的信誉：不具名的120天的ICML - 09的URL数据，含有240万的例子和320万功能的一个子集。

196. US Census Data (1990): The USCensus1990raw data set contains a one percent sample of the Public Use Microdata Samples (PUMS) person records drawn from the full 1990 census sample.

美国人口普查数据（1990年）：USCensus1990raw数据集包含一成市民使用微观数据（PUMS）人记录完整的1990年人口普查抽样抽样样品。

197. Volcanoes on Venus - JARtool experiment: The JARtool project was a pioneering effort to develop an automatic system for cataloging small volcanoes in the large set of Venus images returned by the Magellan spacecraft.

金星上的火山 - JARtool实验：JARtool项目是一项开创性的努力开发一个自动化系统编目在大麦哲伦飞船返回的金星图像设置的小火山。

198. Wall-Following Robot Navigation Data: The data were collected as the SCITOS G5 robot navigates through the room following the wall in a clockwise direction, for 4 rounds, using 24 ultrasound sensors arranged circularly around its 'waist'.

以下壁挂式机器人的导航数据：数据收集的SCITOS G5机器人的导航，通过房间下面的墙壁以顺时针方向，4轮，使用圆周围的“腰”，安排了24超声传感器。

199. Water Treatment Plant: Multiple classes predict plant state

水处理厂：多类预测植物状态

200. Waveform Database Generator (Version 1): CART book's waveform domains

波形数据库生成器（版本1）：订购书的波形域

201. Waveform Database Generator (Version 2): CART book's waveform domains

波形数据库生成（第2版）：订购书的波形域

202. Wine: Using chemical analysis determine the origin of wines

葡萄酒：使用化学分析器判定葡萄酒的来源。

203. Wine Quality: Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009], http://www3.dsi.uminho.pt/pcortez/wine/).

葡萄酒的质量：包括两个数据集，与来自葡萄牙北部的红与白葡萄酒样本样品相关。目标是通过物理化学检验，设计出葡萄酒的质量模型。

204. YearPredictionMSD: Prediction of the release year of a song from audio features. Songs are mostly western, commercial tracks ranging from 1922 to 2011, with a peak in the year 2000s.

年度预测MSD：从声音的特征里，预测一首歌曲的发行年份、歌曲大部来自西部的、从1922至2011年的商业性的音轨，在2000年到达顶峰。

205. Yeast: Predicting the Cellular Localization Sites of Proteins

酵母DataSet：预测蛋白质的细胞定位点。

206. Zoo: Artificial, 7 classes of animals

动物园DataSet：人工，其中类别的动物。

创作不易，转载请注明出处：https://blog.csdn.net/mago2015

UCI数据集汇总及描述相关推荐

收藏 | 机器学习数据集汇总收集
点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达仅作分享,不代表本公众号立场,侵权联系删除转载于:机器学习算法与 ...
各领域机器学习数据集汇总
大学公开数据集 (Stanford)69G大规模无人机(校园)图像数据集[Stanford] http://cvgl.stanford.edu/projects/uav_data/ 人脸素描数据集[C ...
drive数据集_运动想象，脑电情绪等公开数据集汇总
点击上面"脑机接口社区"关注我们更多技术干货第一时间送达运动想像数据 Left/Right Hand MI: http://gigadb.org/dataset/100295 ...
eeg数据集_运动想象，情绪识别等公开数据集汇总
本文来自脑机接口社区运动影像数据 Left/Right Hand MI: http://gigadb.org/dataset/100295 Motor Movement/Imagery Datase ...
运动想象，脑电情绪等公开数据集汇总
点击上面"脑机接口社区"关注我们更多技术干货第一时间送达运动想像数据 Left/Right Hand MI: http://gigadb.org/dataset/100295 ...
开源工业缺陷数据集汇总，持续更新中（已更新28个）
欢迎大家关注我的公众号:一刻AI 本文目前汇总了常见的28个开源工业缺陷数据集,持续更新中 (欢迎大家留言补充,共同建设一个为大家提供便利的文章) 东北大学热轧带钢表面缺陷数据集官方链接:Visio ...
图像处理基本库的学习笔记5--公共数据集，PASCAL VOC数据集，NYUD V2数据集的简介与提取，COCO2017，医学影像数据集汇总
目录公共数据集计算机视觉标准数据集整理-PASCAL VOC数据集数据集文件结构 Annotation JPEGImages SegmentationClass SegmentationObje ...
各领域机器学习数据集汇总（附下载地址）
原文地址大学公开数据集 (Stanford)69G大规模无人机(校园)图像数据集[Stanford] http://cvgl.stanford.edu/projects/uav_data/ 人脸素描 ...
如何使用UCI数据集
UCI数据集是一个常用的机器学习标准测试数据集. 地址: http://www.ics.uci.edu/~mlearn 以Iris鸢尾花数据集为例: 1.Iris数据集在右边方框[Most Popul ...
《各领域机器学习数据集汇总（附下载地址）》
大学公开数据集 (Stanford)69G大规模无人机(校园)图像数据集[Stanford] http://cvgl.stanford.edu/projects/uav_data/ 人脸素描数据集[C ...

UCI数据集汇总及描述

UCI数据集汇总及描述相关推荐

最新文章

热门文章