美国英语最新词频表
2010-04-10 13:04

(4月13日补充:这两天用网上的一些文章和GMAT的一份资料验证了一下这个WORDLIST的覆盖率,证明它的20000单词的覆盖率真的很高,几乎全部覆盖,只有一两个很个别的词没查到。它的前5000单词所带的词族估计有一万多单词,如果能熟练运用,英语水平就已经很不错了)。

因为准备8月开始的MBA课程,所以最近有意识地上网找wordlist(单词表)来加强一下词汇。GMAT、gre的单词表中很多生涩的单词只有专业文章才用,在日常学习生活中使用率很低,所以学习效率不高。后来找到了一个网上很流行的6138个单词的词频表,没看完就晕了,一方面因为它的出处是英国英语,另一方面拼写方式都很古老,甚至有whilst这样的词。whilst在美国现代用语中肯定是20000以外的词汇。可见那个表的古老程度了。功夫不负有心人,终于发现了一个最新的来自于CCAE的单词表。

CCAE“美国当代英语词汇研究”(Corpus of Contemporary American English)是这个世纪里最大的美国语言学研究项目,地位相当于影响深远的英国的BNC-British National Corpus。我们目前使用的大多数英语词频表都是从BNC来的,换据话说都是英国英语的词频,而且是1980年代以前的词频。

美国CCAE至今还没结束,目前收集了4亿词汇的文献资料。这4亿词汇的基础材料包括1990-2009二十年里阅读量最广泛的小说和杂志(“TIME”、“New Yorker”等都是项目的参与者),电影、电视节目,大量的电话记录和面对面谈话记录,甚至还包括911报告等...)。它根据使用时间、文献性质等使用统计学方法进行分类统计,等于是在编一本带词频和流行用法的新美国英语使用辞典。

在CCAE当前成果基础上,美国杨百翰大学对这个资料库用计算机方法筛选出了美语使用频率最高的20,000个高频词汇和它的类词库。
其中前5000个最高频词汇的list文件已经可以下载:
到http://www.wordfrequency.info/?freeList=y
点击最下面的 "download the list"。

另外,5000和20,000词汇的电子书的样本(两者包括5000个左右的样本单词)也可以免费下载,见http://www.wordfrequency.info/files/entries.pdf

这个wordlist最牛的是每个单词不仅带词频和同义词,而且都标注着“类词集”。类词集就是把这个词使用最相关、使用密度最高的词的集合。有了它,我们就知道美国人对这个词的最常用的几十种用法和使用环境。比如说break这个词的类词集里,前四个常用邻接词是law,heart,news和rule,所以我们猜测这个词的最高频用法是break law,break heart, breaking news和 break the rule。这比字典里的例句对培养语感所起的作用大不知高出多少倍。

下面是关于它特点的英文介绍,或者去网站http://www.wordfrequency.info直接看吧。

另外,如果你帮助他们在大的英语学习者的论坛里发一个贴子做宣传(发一个就行),然后把link用电子邮件发给他们,还能够免费得到5,000单词的词频表和类词集的电子书。这本书的印刷版在AMAZON也可以买到。

目前,这算是我见过的最好的wordlist了。

COMPARE (to data from the British National Corpus / American National Corpus)
There are many English word lists and frequency lists out on the Web. Some are good, some are very bad. Not all frequency lists are created equal.
One should be very, very suspicious of word lists that are taken from small samples of web data, outdated texts, or corpora that are too small to effectively model what is happening in the real world. Or worse, word lists that don't give you any idea what they are based on. As the saying goes: "garbage in (bad texts), garbage out (frequency lists)". 
Rather than focusing too much on a comparison with specific wordlists that are out there on the Web, here's some questions you might ask yourself as you consider downloading or purchasing a word list:
Depth and accuracy. Why do so many wordlists on the web contain just the top 1000-3000 words of English? Why not the top 10,000 or 20,000? It's because even a bad corpus (the collection of texts that the word lists are based on) can produce a moderately accurate list for the very most frequent words. But because the corpus is neither deep nor balanced enough, you start getting messy data for medium and lower frequency words. Ask to see samples of the top 10,000 or 20,000 words (e.g. every 7th or 10th word). If they don't have it, then you should be very, very suspicious of that word list.
Genres. Does the corpus contain texts from a wide variety of genres -- spoken, fiction, popular magazines, newspapers, and academic journals? Frequency lists that are based on just one of these may only contain 40-50% of the words from a more balanced corpus. Our frequency list is based on the Corpus of Contemporary American English (COCA), which is almost perfectly balanced across genres.
Size. COCA contains more than 400 million words, and each of the top 20,000 words occurs at least 300 times. In a small 10-20 million word corpus, some of these words would occur just 7-8 times. At that point, the lower frequency words might make it into the list "by chance", whereas others are left out. No such problem with COCA.
How recent is it? Language change happens. If the word list is based on 15-20 year-old texts (or much worse, 100 year old public domain novels), then it will be missing many of the words from the modern language. COCA is based on texts from 1990-2009 (20 million words each year)-- or in other words, virtually right up to the current time.
Is it just a bare wordlist? Word lists are nice, but to be really useful (especially for language learning) there ought to be some indication of what these words mean and how they are used. Most of our frequency lists contain the top 20-30 collocates (nearby words) for each word in the list, which creates a great "sketch" of each word.
--------------------------------------------------------------------------------
Summary. There are many word frequency lists out on the web. Some are just OK, and some are truly bad. The frequency lists that we have created are the only ones that are based on a large, recent, and balanced corpus of English, and which provide indications of the meaning and use of each word.

Word frequency lists and dictionary 
from the Corpus of Contemporary American English

home uses compare samples free list n-grams non-english academic purchase

This site contains what we believe is the most accurate frequency data of English, and it comes in a number of different formats (see the table below).

Any frequency list is only as good as the corpus (collection of texts) that it is based on. Our data is based on the only large, genre-balanced, up-to-date corpus of American English -- the 450 million word Corpus of Contemporary American English. You can be sure that the data that you find here represents what you would encounter in the real world.

If you are a language learner, you can use the frequency lists to maximize your study of vocabulary in a way that is not possible with any other resource.  If you are a (computational) linguist, you will have access to highly accurate, robust and useful data for research and for Natural Language Processing. (More information on how to use this data.)

The English frequency data comes in a number of different formats, shown below. You can also get frequency data for Spanish and Portuguese or Academic English.

Basic word lists

Top 5,000-60,000 words (lemmas)

Genre frequency

See the frequency of each of the top 60,000 lemmas -- in spoken, fiction, popular magazine, newspapers, and academic, as well as more than 40 sub-genres like NEWS-Financial or ACAD-Medicine. You can then use this data to create your own customized lists for particular genres and sub-genres.

Collocates

Collocates = "nearby words", and they provide great insight into the meaning and use of words -- more than any other lists. See (a maximum of) 200-300 collocates for each of the 60,000 words, giving nearly 4,800,000 node word / collocate pairs.

N-grams

Up to 155 million unique 2-5 grams (2-5 words sequences), with frequencies for each string. Allows you to search for the patterns in which a word occurs.

eBook

The 20,000 most frequent words (lemmas) in American English, along with the 20-30 most frequent collocates and the synonyms for each word

Printed book

(From Routledge). The top 5,000 words (including collocates) and thematic lists

Free word list

Basic list of the top 5,000 lemmas

Contact information

史上绝地反击,美式英语英文学习大全。美国英语最新词频表相关推荐

  1. 史上最全的英文学习网站地址大全

        英语口语交流 繁星英语热线 www.starryclub.com 语法 http://www.dailygrammar.com/ adventuretv,提供视频资料,内容多是各地的风土人情, ...

  2. 【干货】史上最全的Tensorflow学习资源汇总,速藏!

    一 .Tensorflow教程资源: 1)适合初学者的Tensorflow教程和代码示例:(https://github.com/aymericdamien/TensorFlow-Examples)该 ...

  3. Uber发布史上最简单的深度学习框架Ludwig!

    昨日,Uber官网重磅宣布新开源深度学习框架Ludwig,不需要懂编程知识,让专家能用的更顺手,让非专业人士也可以玩转人工智能,堪称史上最简单的深度学习框架! Ludwig是一个建立在TensorFl ...

  4. 重磅!Uber发布史上最简单的深度学习框架Ludwig!不懂编程也能玩转人工智能

    点击我爱计算机视觉标星,更快获取CVML新技术 昨日,Uber官网重磅宣布新开源深度学习框架Ludwig,不需要懂编程知识,让专家能用的更顺手,让非专业人士也可以玩转人工智能,堪称史上最简单的深度学习 ...

  5. Uber发布史上最简单的深度学习框架Ludwig!不懂编程也能玩转人工智能

    昨日,Uber官网重磅宣布新开源深度学习框架Ludwig,不需要懂编程知识,让专家能用的更顺手,让非专业人士也可以玩转人工智能,堪称史上最简单的深度学习框架! image Ludwig是一个建立在Te ...

  6. 史上最全大数据学习资源整理

    史上最全大数据学习资源整理 ----------------------------------------------------------------------------------- 转载 ...

  7. 史上最全的前端学习路线图,干货满满

    史上最全的前端学习路线图,干货满满 前端很火,想自学前端的人也多.作为过来人,知道自学的辛苦.所以小编精心制作这份学习路线图,就是让想自学前端的小伙伴们有一份系统专业的学习资源和学习指导. 此学习路线 ...

  8. 史上最全大数据学习资料

    史上最全大数据学习资料 阿甘 阿甘琐记 昨天 本教程包含视频和书籍两部分,视频主要是各大培训机构的教学视频,书籍主要是题主自己收集的. 关注微信公众号"阿甘琐记",后台回复&quo ...

  9. 史上最全“大数据”学习资源整理

    转自:史上最全"大数据"学习资源整理 ------------ 资源列表: 关系数据库管理系统(RDBMS) MySQL:世界最流行的开源数据库; PostgreSQL:世界最先进 ...

最新文章

  1. 编写Thymeleaf视图以展示数据
  2. Spring源码分析-深入理解生命周期之BeanFactoryProcessor
  3. matlab期末复习资料,MATLAB期末复习习题及答案
  4. android调用fragment的方法,AndroidX下使用Activity和Fragment的变化
  5. 掌握这7点,不懂代码也能做出酷炫可视化大屏!
  6. linux sort命令 性能,Linux sort 命令简单使用
  7. C++获取CPU信息应用经验分享
  8. JavaScript中的事件循环
  9. 如何调试程序的 Release 版本
  10. 视频特性TI(时间信息)和SI(空间信息)的计算工具:TIandSI
  11. opencv_haartraining.exe 分类器训练----命令执行,执行项学习(1)
  12. Parallels Desktop 15 for Mac(pd虚拟机)特别版
  13. 数据集中异常值的处理之lof,iforest算法
  14. 太戈编程DEVC++教师答案库
  15. VCL语法教程——5.手持式1311编程器的支持
  16. php rewrite 规则,apache服务器开启rewrite后的写法和规则
  17. wim工具扫描linux磁盘,[V1.30.2011.501版]WimTool -- Wim文件的图形视窗处理工具[无忧首发]...
  18. 嵌入式学习(四)——串口
  19. 什么的出现标志着电子计算机的到来,20世纪四五十年代以来,人类在原子能、计算机、航天技术、电力机械等方面取得了重大突破,标志着新的科学技术革命的到来。——青夏教育精英家教网——...
  20. 软件测试,自学3个月出来就是高薪工作?你以为还是2019年以前?

热门文章

  1. 青少年的音乐合成器原理指南
  2. 电机调速设计并用matlab仿真,基于MATLAB的双闭环调速系统设计与仿真
  3. 计算机课件脚本ppt,ppt课件脚本的写法
  4. maximo邮件配置
  5. 一文了解 AlphaFold 2 背后的 PDB 蛋白质结构数据集
  6. matlab 前向欧拉法,前向后项差分和显式隐式欧拉法
  7. Android 解决打包时提示65536问题
  8. 盘点2022热词、网络流行语!富而喜悦摆渡人走红!
  9. 前端第二章:3.HTML文档声明、存储容量换算、字符编码、字符集、HTML帮助文档Zeal
  10. 【面经——虎牙实习+一面+HR面+offer】