奥德赛修改器

介绍(Introduction)

The other day while I was running some zip() with some lists through a map(). I couldn’t stop noticing how much my Python style over the years has changed.

前几天,当我通过map()运行带有某些列表的zip()时。 我一直不停地注意到这些年来我的Python风格发生了多少变化。

We all asked ourselves this question before, what is it other people do with this beautiful language? What functions do they use?

之前我们都问过自己这个问题,其他人用这种美丽的语言怎么办? 他们使用什么功能?

As a data scientist, I aimed at something slightly more measurable. What is the most mentioned Python functionality in GitHub commits?

作为数据科学家,我的目标是要进行更可衡量的工作。 GitHub提交中最常提及的Python功能是什么?

In the following articles, I will

在以下文章中,我将

  1. Discuss the limitations of such a question and in how many ways I failed to find the answer讨论这样一个问题的局限性以及我以几种方式找不到答案
  2. Show how I collected the data from GitHub显示我如何从GitHub收集数据
  3. And most importantly, teach you how to lure Medium readers to your article with cool racing bars最重要的是,教您如何使用炫酷的赛车杆吸引中级读者阅读您的文章

局限性(Limitations)

Initially, I started this project to figure out how often Python functions are called. Quickly we noticed that on Github, you could look this up in no time. Use the search function!

最初,我开始了这个项目,以了解Python函数被调用的频率。 很快,我们注意到在Github上,您可以立即进行查找。 使用搜索功能!

Amount of print() functions on GitHub, Image by Author
GitHub上的print()函数数量,作者提供

Problem Solved!

问题解决了!

Well not quite…

好吧……

The issue is that these results are volatile. By calling this search several times, we can get any number of results! This means when calling it again.

问题是这些结果是不稳定的。 通过多次调用此搜索,我们可以获得任意数量的结果! 这意味着再次调用它。

Amount of print() functions on GitHub when calling it again, Image by Author.
再次调用时,GitHub上的大量print()函数,按作者提供。

We get a very different result…

我们得到了截然不同的结果……

Github API (Github API)

GitHub has a fantastic search API!

GitHub有一个很棒的搜索API!

Problem Solved!

问题解决了!

Well not quite…

好吧……

The issue here is that they only offer the first 34k results or something like this for the code, after trying for quite some time to get something useful out of it. I had to realize that they won’t allow me to do it in this way. And our questions sadly can’t be answered using the easy way.

这里的问题是,在尝试了很多时间之后,他们只提供了前34k的结果或类似的代码。 我必须意识到,他们不允许我这样做。 可悲的是,使用简单的方法无法回答我们的问题。

Github通过Commits搜索功能 (Github Search function via Commits)

After quite some time, I detected that one could search by commits in the Python Language by time!

一段时间后,我发现可以按时间搜索Python语言中的提交!

Problem Solved!

问题解决了!

Well not quite…

好吧……

While this way of searching seems to be quite reliable. It produces a lot of false positives. For example, it will show commits to repositories that only commit a little bit of Python. The commit may then include the words or functions in some sense.

虽然这种搜索方式似乎非常可靠。 它会产生很多误报。 例如,它将显示对仅提交少量Python的存储库的提交。 然后,提交可以在某种意义上包括单词或功能。

While this is not ideal, I decided to take this route since it allowed for a comparison over time. Also, I tried all other ways I could think of, if you found a better way please let me know in the comments. Generally, this data has to be taken with a lot of skepticism, but I hope it teaches us some valuable lessons. Most certainly, it creates a killer plot ;)

尽管这不是理想的选择,但我决定采用此方法,因为它允许随时间进行比较。 另外,我尝试了所有其他可以想到的方法,如果您找到了更好的方法,请在评论中让我知道。 通常,必须对这些数据持怀疑态度,但我希望它能教给我们一些宝贵的经验。 最肯定的是,它创造了一个杀手plot;)

数据采集 (Data Collection)

We have our approximation of how to find the answer. Now, all we have to do is call the GitHub API!

我们近似于如何找到答案。 现在,我们要做的就是调用GitHub API!

Problem Solved!

问题解决了!

Well not quite…

好吧……

The issue seemed to be that this API is supposed to be more for actual searches inside your repositories. GitHub seems to have a hard limit on the number of links they return to you. They seem to look for X seconds and then stop, and return whatever they got so far. This makes a lot of sense since dealing with such vast amounts of data is very expensive. Sadly it also makes our journey to an answer so much harder.

问题似乎是该API应该更多地用于您存储库中的实际搜索。 GitHub似乎对返回给您的链接数量有严格限制。 他们似乎先寻找X秒,然后停下来,然后返回他们到目前为止得到的一切。 这非常有意义,因为处理如此大量的数据非常昂贵。 可悲的是,这也使我们寻求答案的旅程变得更加困难。

Since we refuse to give up, we decide to call their website and parse the answer from the returned HTML! While this is neither elegant nor simple, we ain’t no quitters.

由于我们拒绝放弃,因此决定致电他们的网站,并从返回HTML中解析答案! 尽管这既不优雅也不简单,但我们并非没有戒烟者。

Let’s build our link. An example link might look like

让我们建立链接。 一个示例链接可能看起来像

https://github.com/search?q={function}%28+language%3A{Language}+type%3Acommits+committer-date%3A%3C{before_year}-01-01&type=commits
Example link, Image by Author
示例链接,作者提供的图片

As we can see we look for basically 3 things.

如我们所见,我们基本上寻找3件东西。

function: What function do we want to know about? e.g. len()language: What programming language? e.g. Pythonbefore_year: Before what year? e.g. 2000

When feeding these parameters to GitHub it will tell us how many functions have been committed before that date!

当将这些参数提供给GitHub时,它将告诉我们在该日期之前已提交了多少个函数!

After calling this link, it returns us an HTML file that we can filter to get our answer. The code for doing such things can be

调用此链接后,它将返回一个HTML文件,我们可以对其进行过滤以获取答案。 做这样的事情的代码可以是

import urllib.requestlanguage='Python'befor_year=2000# create the url using a year and a languageurl_base = f"https://github.com/search?l=Python&q={search_term}%28+language%3A{language}+type%3Acommits+committer-date%3A<{befor_year}-01-01&type=commits"fp = urllib.request.urlopen(url_base)byte_html = fp.read()org_html = byte_html.decode("utf8")fp.close()

To filter the resulting HTML, we can then, for example, use regex. We could also use BeautifulSoup or some other lovely HTML-parsing library, but it simplifies the readability for this article quite a bit to use regex. In this specific case, we only care about one number which makes is faster to simply look for that single number.

为了过滤生成HTML,例如,我们可以使用regex。 我们还可以使用BeautifulSoup或其他一些可爱HTML解析库,但是使用正则表达式可以大大简化本文的可读性。 在这种特定情况下,我们只关心一个数字,这使得简单地查找该单个数字会更快。

import refind_count = re.compile(r'([0-9,]+) (available|commit)')

The above regex ‘find_count’ finds the string “44,363 commits”. Using the matching group (everything that is in the “()”), we can then grep the number combination from that string “44,363”.

上面的正则表达式“ find_count”找到字符串“ 44,363 commits”。 使用匹配组(“()”中的所有内容),我们可以从字符串“ 44,363”中提取数字组合。

The full code to do such a thing quickly and fast is,

快速而快速地完成这样的事情的完整代码是,

def search_git_get_count(terms, file_name='freq.csv', language="Python"):"""Collects the amount of function in terms calls for all year from 2000 to 2020:param terms: array of terms that we want to aggreate e.g. ["print", "len"]:param language: Language we want to search for e.g. "Python":return: Filename of the Dataframe including the results"""function_calls_by_date = []print(f"Starting to gather the data this will take approx. {(len(terms) * (END_YEAR - START_YEAR + 1)) // 10} minutes")for befor_year in [str(i) for i in range(START_YEAR, END_YEAR + 1)]:# init a dict to store the datayear_overview = {'date': f'{befor_year}-01-01',}for search_term in terms:year_overview[search_term + "()"] = 0for search_term in terms:while True:try:# example: https://github.com/search?q=len%28+language%3APython+type%3Acommits+committer-date%3A%3C2000-01-01&type=commitsurl_base = f"https://github.com/search?q={search_term}%28+language%3A{language}+type%3Acommits+committer-date%3A<{befor_year}-01-01&type=commits"fp = urllib.request.urlopen(url_base)byte_html = fp.read()org_html = byte_html.decode("utf8")fp.close()if "You have triggered an abuse detection mechanism." in org_html:# in this case we really should wait 10minprint("GitHub thinks we are malicious, we will wait for 10 minutes")sleep(60 * 10)continuestring_html = org_html[55000:56000].replace('\n', '')ret = next(re.finditer(find_count, string_html)).groups()[0]ret = int(ret.replace(',', ''))year_overview[search_term + "()"] = retbreakexcept Exception as e:# We can call api 30 times per minuteprint("We had an error with the API throttling us...", e)print("With request:", url_base)print("We will sleep for 60 seconds")sleep(60)continuefunction_calls_by_date.append(year_overview)print(f"Aggregated year: {befor_year}")function_calls_by_date = pd.DataFrame(function_calls_by_date)function_calls_by_date.to_csv(file_name, index=False)print(f"Dumped data to:{file_name}")return file_name

As we can see, we iterate over all terms and years to collect one data point for each of the functions. Then we parse the result from the HTML and store it. The entire rest of the processes is there to ensure that we comply with GitHub rate-limiting and do not get banned will accumulating our data!

如我们所见,我们遍历所有条款和年份,为每个功能收集一个数据点。 然后,我们从HTML解析结果并将其存储。 整个其余过程都在这里,以确保我们遵守GitHub限速并且不被禁止将累积我们的数据!

GitHub seems to not enjoy us calling their relatively expensive functions all the time ;) I ran this for 20 years and 20 functions, and it took over 80 minutes, which I found quite surprising.

GitHub似乎不喜欢我们一直都在调用它们相对昂贵的功能;)我运行了20年和20个功能,花了80多分钟,这让我感到非常惊讶。

Finally, we have collected the data we desired and can now show off with some cool plots!

最后,我们收集了所需的数据,现在可以用一些很酷的图来炫耀!

可视化 (Visualization)

We have now a data frame which looks roughly like this,

现在我们有了一个数据框架,看起来像这样

date,print(),len(),join()2000-01-01,677545,44165,235342001-01-01,859815,66593,400322002-01-01,1091170,93604,596182003-01-01,1391283,117548,803272004-01-01,1755368,152962,1252382005-01-01,2049569,185497,173200

For each year the amount of function commits per function. This data collection is especially nice to visualize.

每年,每个功能的功能提交量。 此数据收集特别易于可视化。

To visualize data over time, I think racing bars are the coolest. While they may not be the most informative ones, they look incredible!

为了使数据随时间可视化,我认为竞速条是最酷的。 尽管它们可能不是最有用的信息,但它们看起来不可思议!

What we need is a CSV that has, for each date, several categories. Once we have such a CSV, we can easily use the fantastic bar_chart_race library.

我们需要的是一个CSV,每个日期都有几个类别。 一旦有了这样的CSV,我们就可以轻松使用出色的bar_chart_race库。

Note: The library seems to be not entirely uptodated when install via pip, therefore install via github

注意:通过pip安装时,库似乎未完全更新,因此通过github安装

python -m pip install git+https://github.com/dexplo/bar_chart_race

Now, all that s left to do is pass our CSV to the function, creating a beautiful gif.

现在,剩下要做的就是将CSV传递给函数,创建漂亮的gif。

def plot_search_term_data(file):    """    This function plots our df:param file: file name of the csv, expects a "date" column    """df = pd.read_csv(file).set_index('date')    bcr.bar_chart_race(        df=df,        filename=file.replace('.csv', '.gif'),        orientation='h',        sort='desc',        n_bars=len(df.columns),        fixed_order=False,        fixed_max=True,        steps_per_period=10,        period_length=700,        interpolate_period=False,        period_label=        {'x': .98, 'y': .3, 'ha': 'right', 'va': 'center'},        period_summary_func=        lambda v, r: {'x': .98, 'y': .17,                      's': f'Calls{v.sum():,.0f},                      'ha': 'right', 'size':11},        perpendicular_bar_func='median',        title='Most Mentioned Python Functionality Over Time',        bar_size=.95,        shared_fontdict=None,        scale='linear',        fig=None,        writer=None,        bar_kwargs={'alpha': .7},        filter_column_colors=False)
The most mentioned Python functions mentioned inside Pythonrepositories calculated via GitHub commits. Image by Author
通过GitHub提交计算的Python存储库中提到的最多提及的Python函数。 图片作者

结论(Conclusion)

We have seen how we can gather data directly from HTML using regex instead of the usual bs4. While this approach should not be used for more significant projects, using it for simple quests such as this one is a must. We also have seen that the most prominent data source may not always work.

我们已经看到了如何使用正则表达式而不是通常的bs4直接从HTML收集数据。 尽管不应将此方法用于更重要的项目,但必须将其用于诸如此类的简单任务。 我们还看到,最重要的数据源可能并不总是有效。

Finally, we discovered a new lovely library and way how to create beautiful racing bars that will capture your viewer’s interest!

最后,我们发现了一个新的可爱的资料库以及如何创建漂亮的赛车杆来吸引观众的兴趣!

If you enjoyed this article, feel free to connect on Twitter or LinkedIn.

如果您喜欢这篇文章,请随时在Twitter或LinkedIn上进行连接。

整个代码 (Entire Code)

import urllib.request
from time import sleep
import pandas as pd
import bar_chart_race as bcr
from tqdm import tqdm
import re
import cProfilefind_count = re.compile(r'([0-9,]+) (available|commit)')
START_YEAR = 2000
END_YEAR = 2020def search_git_get_count(terms, file_name='freq.csv', language="Python"):"""Collects the amount of function in terms calls for all year from 2000 to 2020:param terms: array of terms that we want to aggreate e.g. ["print", "len"]:param language: Language we want to search for e.g. "Python":return: Filename of the Dataframe including the results"""function_calls_by_date = []print(f"Starting to gather the data this will take approx. {(len(terms) * (END_YEAR - START_YEAR + 1)) // 10} minutes")for befor_year in [str(i) for i in range(START_YEAR, END_YEAR + 1)]:# init a dict to store the datayear_overview = {'date': f'{befor_year}-01-01',}for search_term in terms:year_overview[search_term + "()"] = 0for search_term in terms:while True:try:# example: https://github.com/search?q=len%28+language%3APython+type%3Acommits+committer-date%3A%3C2000-01-01&type=commitsurl_base = f"https://github.com/search?q={search_term}%28+language%3A{language}+type%3Acommits+committer-date%3A<{befor_year}-01-01&type=commits"fp = urllib.request.urlopen(url_base)byte_html = fp.read()org_html = byte_html.decode("utf8")fp.close()if "You have triggered an abuse detection mechanism." in org_html:# in this case we really should wait 10minprint("GitHub thinks we are malicious, we will wait for 10 minutes")sleep(60 * 10)continuestring_html = org_html[55000:56000].replace('\n', '')ret = next(re.finditer(find_count, string_html)).groups()[0]ret = int(ret.replace(',', ''))year_overview[search_term + "()"] = retbreakexcept Exception as e:# We can call api 30 times per minuteprint("We had an error with the API throttling us...", e)print("With request:", url_base)print("We will sleep for 60 seconds")sleep(60)continuefunction_calls_by_date.append(year_overview)print(f"Aggregated year: {befor_year}")function_calls_by_date = pd.DataFrame(function_calls_by_date)function_calls_by_date.to_csv(file_name, index=False)print(f"Dumped data to:{file_name}")return file_namedef plot_search_term_data(file):"""This function plots the aggregated function calls from git as a gif:param file: file name of the csv, expects a "date" column"""df = pd.read_csv(file).set_index('date')print(df)bcr.bar_chart_race(df=df,filename=file.replace('.csv', '.gif'),orientation='h',sort='desc',n_bars=len(df.columns),fixed_order=False,fixed_max=True,steps_per_period=10,period_length=700,interpolate_period=False,period_label={'x': .98, 'y': .3, 'ha': 'right', 'va': 'center'},period_summary_func=lambda v, r: {'x': .98, 'y': .17,'s': f'Calls: {v.sum():,.0f} by Sandro Luck \nTwitter:@san_sluck','ha': 'right', 'size':11},perpendicular_bar_func='median',title='Most Mentioned Python Functionality Over Time',bar_size=.95,shared_fontdict=None,scale='linear',fig=None,writer=None,bar_kwargs={'alpha': .7},filter_column_colors=False)if __name__ == '__main__':terms = ["print", "len", "join", "split", "sorted", "range", "strip", "str", "type", "map", "enumerate", "count","index", "zip", "replace", "iter", "int", "append", "find", "super"]file_name = search_git_get_count(terms)file_name = 'fixed.csv'plot_search_term_data(file_name)

翻译自: https://towardsdatascience.com/my-odyssey-finding-the-most-popular-python-function-6aa216db047c

奥德赛修改器


http://www.taodudu.cc/news/show-5584586.html

相关文章:

  • Ambire 指南:Arbitrum 奥德赛活动开始!第一周——跨链桥
  • 阿斯达打奥德赛
  • 【名场面临摹 之 马里奥·奥德赛】3.2 马里奥的步行循环(Walk Cycle)动画(4月21日更新)
  • ps手持_大胆地只手持
  • Arbitrum奥德赛第一周跨链桥任务教程
  • 使用 Ambire 参加 Arbitrum 奥德赛活动的指南
  • Qt 使用 kdChart 自定义甘特图
  • 【奥日与萤火意志】华丽与优雅的精灵之舞
  • 艾泽拉斯游记
  • 热播 何赛飞受肯定梅婷被批无韵味
  • 计算机死机造成桌面数据丢失,电脑突然死机,再重启桌面文件全没了
  • MAC升级10.15不能使用[远程桌面连接]--解决方案
  • mac上的虚拟机Parallels Desktop的窗口突然不见,这样处理
  • 关于QuartusⅡz下载破解版后不能用VWF文件仿真的问题解决办法
  • 2021年Java面试心得:阿里java开发面试
  • mysqlcheck约束,含面试题+答案
  • MySQL最全整理!java垃圾回收器的作用不包括
  • Java开发经典实战!java正则表达式匹配字符串替换
  • Java面试题,Java堆外内存泄漏排查
  • 提醒自己的一些事项
  • tar.tar 后缀文件的解压方法
  • python matplotlib 计算并显示均值中值
  • python结巴分词去掉停用词、标点符号、虚词_NLP自然语言处理入门-- 文本预处理Pre-processing...
  • 1+X WEB前端中级 选择题-单选汇总1
  • python预处理缺失值_数据预处理 第3篇:数据预处理(使用插补法处理缺失值)...
  • python 线性回归回归 缺失值 忽略_机器学习 第3篇:数据预处理(使用插补法处理缺失值)...
  • Java实现 LeetCode 637 二叉树的层平均值(遍历树)
  • c语言中char是多少字节,c语言中char_char c = \'\\1\' 十进制是多少_c语言char
  • c语言中char、int以及单引号与双引号的一些理解
  • C语言中char类型的符号问题

奥德赛修改器_我的奥德赛发现最受欢迎的python函数相关推荐

  1. 奥德赛修改器_语音奥德赛2020的声音在演讲者嵌入方面取得了进步

    奥德赛修改器 At Speech Odyssey 2020 IQT Labs sponsored a special session on applications of VOiCES, a data ...

  2. java实现内存修改器_魔兽3内存修改器 v8

    原文 http://tctianchi.yo2.cn/articles/%e9%ad%94%e5%85%bd3%e5%86%85%e5%ad%98%e4%bf%ae%e6%94%b9%e5%99%a8 ...

  3. python怎么制作游戏修改器_如何使用CE来修改游戏并制作一个修改器

    1 首先下载CE,地址在参考资料里面.http://pan.baidu.com/s/1hqkrPcC打开后启动Cheat Engine.exe和练习软件Tutorial.exe 打开之前最好关闭杀毒软 ...

  4. 博德之门1修改器_从博德之门到啤酒

    博德之门1修改器 创作者访谈 (CREATOR INTERVIEW) Narratives with emotional beats and meaningful choices have alway ...

  5. 安卓手机ip修改器_亚马逊手机端?电脑端?谁是测评一哥

    有用手机端的,有用电脑端的,各有利弊.要想对所使用的的环境有一个清晰的认识需要了解清楚防关联的底层原理. 我们都知道亚马逊是一个购物网站,以前只有电脑网页端的,这几年因为移动互联网的发展才有了手机端和 ...

  6. 用java写修改器_一些修改器1

    1.增加:$inc db.mycoll.update({}, {"$inc":{"mykey":10}}); 2.设置:$set db.mycoll.updat ...

  7. 我的世界服务器修改器1.7.10,我的世界修改器_我的世界TMI内置修改器1.7.10 - 99单机游戏...

    我的世界TMI内置修改器适用于1.7.10版本的我的世界,可以修改我的世界内部的参数,非常的有意思,想要的朋友欢迎前来下载. 安装方法: 1.有FORGE版 关闭Minecraft. 使用winrar ...

  8. keep 虚拟路线修改器_王者无限技能10.2最新版-王者无限技能修改器下载

    王者无限技能是一款非常棒的修改器软件,我们可以通过这款软件可以让自己的英雄拥有无限技能,该软件的功能强大,并且安全无毒.一键安装即可使用,打开游戏以就可以打开此修改器,开启以后就在后台运行,喜欢快来下 ...

  9. 七日杀16.1 服务器修改器,七日杀三十二项32位修改器_七日杀 a16.1b1多功能三十二项修改器-66街机网...

    资源说明: 七日杀 a16.1b1多功能三十二项修改器32位[潇潇蓝龙],由"潇潇蓝龙"制作,一款全功能修改器,基本上是全能力开关,支持任意调整生物伤害.方块距离.生物距离.体力消 ...

最新文章

  1. 你听过Oracle中rownum用法吗?
  2. 附录4:Matplotlib实例记录
  3. maven_结合使用嵌入式Tomcat和Maven tomcat插件
  4. Eclipse 实用技巧
  5. 新数据革命:开源图形化数据引擎Hawk5发布
  6. JEECG V3.0版本 (工作流在线定义+UI快速开发库+代码生成器) 全新架构技术,漂亮的界面+智能代码生成+智能工作流
  7. 新AirPods渲染图曝光:采用黑白灰金四种配色
  8. 从0成为Facebook广告高手系列教程,Facebook广告数据分析上篇
  9. java XML 通过BeanUtils的population为对象赋值 根据用户选择进行dom4j解析
  10. spring5之容器始末源码赏析 (一)总览
  11. 苹果cmsv10仿爱客影视搜索自适应模板
  12. web之线性渐变,径向渐变,重复渐变
  13. 浏览器缓存之http缓存和service worker
  14. 关于页面自动提交两次的问题(360浏览器)
  15. Unity 摄像头实时扫描二维码
  16. Python3:好玩游戏的物品清单 和 列表到字典的函数
  17. html显示用户ipv6地址,IPv6地址查询
  18. JAVA应用中集成SF的chatter功能及开发步骤
  19. 浅析Web表单美化CSS框架Topcoat
  20. idea中找不到Lombok插件的问题

热门文章

  1. LiveQing云端流媒体-云平台功能
  2. 你真的搞懂了参数传递方式吗?(多图超详细)
  3. 【91xcz】删除Windows 8中无用的网络连接
  4. redis 主从同步到分布式集群
  5. Windows11 安装使用VMware 失败
  6. 深富策略:北交所首秀抢眼 沪深指数微跌
  7. PyTorch快餐教程2019 (2) - Multi-Head Attention
  8. Visio导入CAD绘图问题总结-更改形状线条颜色问题解决
  9. Linux CentOS中按tab键不能自动补全解决办法
  10. Linux:wget后台下载/查看后台任务进度