

The other day while I was running some zip() with some lists through a map(). I couldn’t stop noticing how much my Python style over the years has changed.

前几天,当我通过map()运行带有某些列表的zip()时。 我一直不停地注意到这些年来我的Python风格发生了多少变化。

We all asked ourselves this question before, what is it other people do with this beautiful language? What functions do they use?

之前我们都问过自己这个问题,其他人用这种美丽的语言怎么办? 他们使用什么功能?

As a data scientist, I aimed at something slightly more measurable. What is the most mentioned Python functionality in GitHub commits?

作为数据科学家,我的目标是要进行更可衡量的工作。 GitHub提交中最常提及的Python功能是什么?

In the following articles, I will


  1. Discuss the limitations of such a question and in how many ways I failed to find the answer讨论这样一个问题的局限性以及我以几种方式找不到答案
  2. Show how I collected the data from GitHub显示我如何从GitHub收集数据
  3. And most importantly, teach you how to lure Medium readers to your article with cool racing bars最重要的是,教您如何使用炫酷的赛车杆吸引中级读者阅读您的文章


Initially, I started this project to figure out how often Python functions are called. Quickly we noticed that on Github, you could look this up in no time. Use the search function!

最初,我开始了这个项目,以了解Python函数被调用的频率。 很快,我们注意到在Github上,您可以立即进行查找。 使用搜索功能!

Amount of print() functions on GitHub, Image by Author

Problem Solved!


Well not quite…


The issue is that these results are volatile. By calling this search several times, we can get any number of results! This means when calling it again.

问题是这些结果是不稳定的。 通过多次调用此搜索,我们可以获得任意数量的结果! 这意味着再次调用它。

Amount of print() functions on GitHub when calling it again, Image by Author.

We get a very different result…


Github API (Github API)

GitHub has a fantastic search API!


Problem Solved!


Well not quite…


The issue here is that they only offer the first 34k results or something like this for the code, after trying for quite some time to get something useful out of it. I had to realize that they won’t allow me to do it in this way. And our questions sadly can’t be answered using the easy way.

这里的问题是,在尝试了很多时间之后,他们只提供了前34k的结果或类似的代码。 我必须意识到,他们不允许我这样做。 可悲的是,使用简单的方法无法回答我们的问题。

Github通过Commits搜索功能 (Github Search function via Commits)

After quite some time, I detected that one could search by commits in the Python Language by time!


Problem Solved!


Well not quite…


While this way of searching seems to be quite reliable. It produces a lot of false positives. For example, it will show commits to repositories that only commit a little bit of Python. The commit may then include the words or functions in some sense.

虽然这种搜索方式似乎非常可靠。 它会产生很多误报。 例如,它将显示对仅提交少量Python的存储库的提交。 然后,提交可以在某种意义上包括单词或功能。

While this is not ideal, I decided to take this route since it allowed for a comparison over time. Also, I tried all other ways I could think of, if you found a better way please let me know in the comments. Generally, this data has to be taken with a lot of skepticism, but I hope it teaches us some valuable lessons. Most certainly, it creates a killer plot ;)

尽管这不是理想的选择,但我决定采用此方法,因为它允许随时间进行比较。 另外,我尝试了所有其他可以想到的方法,如果您找到了更好的方法,请在评论中让我知道。 通常,必须对这些数据持怀疑态度,但我希望它能教给我们一些宝贵的经验。 最肯定的是,它创造了一个杀手plot;)

数据采集 (Data Collection)

We have our approximation of how to find the answer. Now, all we have to do is call the GitHub API!

我们近似于如何找到答案。 现在,我们要做的就是调用GitHub API!

Problem Solved!


Well not quite…


The issue seemed to be that this API is supposed to be more for actual searches inside your repositories. GitHub seems to have a hard limit on the number of links they return to you. They seem to look for X seconds and then stop, and return whatever they got so far. This makes a lot of sense since dealing with such vast amounts of data is very expensive. Sadly it also makes our journey to an answer so much harder.

问题似乎是该API应该更多地用于您存储库中的实际搜索。 GitHub似乎对返回给您的链接数量有严格限制。 他们似乎先寻找X秒,然后停下来,然后返回他们到目前为止得到的一切。 这非常有意义,因为处理如此大量的数据非常昂贵。 可悲的是,这也使我们寻求答案的旅程变得更加困难。

Since we refuse to give up, we decide to call their website and parse the answer from the returned HTML! While this is neither elegant nor simple, we ain’t no quitters.

由于我们拒绝放弃,因此决定致电他们的网站,并从返回HTML中解析答案! 尽管这既不优雅也不简单,但我们并非没有戒烟者。

Let’s build our link. An example link might look like

让我们建立链接。 一个示例链接可能看起来像

Example link, Image by Author

As we can see we look for basically 3 things.


function: What function do we want to know about? e.g. len()language: What programming language? e.g. Pythonbefore_year: Before what year? e.g. 2000

When feeding these parameters to GitHub it will tell us how many functions have been committed before that date!


After calling this link, it returns us an HTML file that we can filter to get our answer. The code for doing such things can be

调用此链接后,它将返回一个HTML文件,我们可以对其进行过滤以获取答案。 做这样的事情的代码可以是

import urllib.requestlanguage='Python'befor_year=2000# create the url using a year and a languageurl_base = f"https://github.com/search?l=Python&q={search_term}%28+language%3A{language}+type%3Acommits+committer-date%3A<{befor_year}-01-01&type=commits"fp = urllib.request.urlopen(url_base)byte_html = fp.read()org_html = byte_html.decode("utf8")fp.close()

To filter the resulting HTML, we can then, for example, use regex. We could also use BeautifulSoup or some other lovely HTML-parsing library, but it simplifies the readability for this article quite a bit to use regex. In this specific case, we only care about one number which makes is faster to simply look for that single number.

为了过滤生成HTML,例如,我们可以使用regex。 我们还可以使用BeautifulSoup或其他一些可爱HTML解析库,但是使用正则表达式可以大大简化本文的可读性。 在这种特定情况下,我们只关心一个数字,这使得简单地查找该单个数字会更快。

import refind_count = re.compile(r'([0-9,]+) (available|commit)')

The above regex ‘find_count’ finds the string “44,363 commits”. Using the matching group (everything that is in the “()”), we can then grep the number combination from that string “44,363”.

上面的正则表达式“ find_count”找到字符串“ 44,363 commits”。 使用匹配组(“()”中的所有内容),我们可以从字符串“ 44,363”中提取数字组合。

The full code to do such a thing quickly and fast is,


def search_git_get_count(terms, file_name='freq.csv', language="Python"):"""Collects the amount of function in terms calls for all year from 2000 to 2020:param terms: array of terms that we want to aggreate e.g. ["print", "len"]:param language: Language we want to search for e.g. "Python":return: Filename of the Dataframe including the results"""function_calls_by_date = []print(f"Starting to gather the data this will take approx. {(len(terms) * (END_YEAR - START_YEAR + 1)) // 10} minutes")for befor_year in [str(i) for i in range(START_YEAR, END_YEAR + 1)]:# init a dict to store the datayear_overview = {'date': f'{befor_year}-01-01',}for search_term in terms:year_overview[search_term + "()"] = 0for search_term in terms:while True:try:# example: https://github.com/search?q=len%28+language%3APython+type%3Acommits+committer-date%3A%3C2000-01-01&type=commitsurl_base = f"https://github.com/search?q={search_term}%28+language%3A{language}+type%3Acommits+committer-date%3A<{befor_year}-01-01&type=commits"fp = urllib.request.urlopen(url_base)byte_html = fp.read()org_html = byte_html.decode("utf8")fp.close()if "You have triggered an abuse detection mechanism." in org_html:# in this case we really should wait 10minprint("GitHub thinks we are malicious, we will wait for 10 minutes")sleep(60 * 10)continuestring_html = org_html[55000:56000].replace('\n', '')ret = next(re.finditer(find_count, string_html)).groups()[0]ret = int(ret.replace(',', ''))year_overview[search_term + "()"] = retbreakexcept Exception as e:# We can call api 30 times per minuteprint("We had an error with the API throttling us...", e)print("With request:", url_base)print("We will sleep for 60 seconds")sleep(60)continuefunction_calls_by_date.append(year_overview)print(f"Aggregated year: {befor_year}")function_calls_by_date = pd.DataFrame(function_calls_by_date)function_calls_by_date.to_csv(file_name, index=False)print(f"Dumped data to:{file_name}")return file_name

As we can see, we iterate over all terms and years to collect one data point for each of the functions. Then we parse the result from the HTML and store it. The entire rest of the processes is there to ensure that we comply with GitHub rate-limiting and do not get banned will accumulating our data!

如我们所见,我们遍历所有条款和年份,为每个功能收集一个数据点。 然后,我们从HTML解析结果并将其存储。 整个其余过程都在这里,以确保我们遵守GitHub限速并且不被禁止将累积我们的数据!

GitHub seems to not enjoy us calling their relatively expensive functions all the time ;) I ran this for 20 years and 20 functions, and it took over 80 minutes, which I found quite surprising.


Finally, we have collected the data we desired and can now show off with some cool plots!


可视化 (Visualization)

We have now a data frame which looks roughly like this,



For each year the amount of function commits per function. This data collection is especially nice to visualize.

每年,每个功能的功能提交量。 此数据收集特别易于可视化。

To visualize data over time, I think racing bars are the coolest. While they may not be the most informative ones, they look incredible!

为了使数据随时间可视化,我认为竞速条是最酷的。 尽管它们可能不是最有用的信息,但它们看起来不可思议!

What we need is a CSV that has, for each date, several categories. Once we have such a CSV, we can easily use the fantastic bar_chart_race library.

我们需要的是一个CSV,每个日期都有几个类别。 一旦有了这样的CSV,我们就可以轻松使用出色的bar_chart_race库。

Note: The library seems to be not entirely uptodated when install via pip, therefore install via github


python -m pip install git+https://github.com/dexplo/bar_chart_race

Now, all that s left to do is pass our CSV to the function, creating a beautiful gif.


def plot_search_term_data(file):    """    This function plots our df:param file: file name of the csv, expects a "date" column    """df = pd.read_csv(file).set_index('date')    bcr.bar_chart_race(        df=df,        filename=file.replace('.csv', '.gif'),        orientation='h',        sort='desc',        n_bars=len(df.columns),        fixed_order=False,        fixed_max=True,        steps_per_period=10,        period_length=700,        interpolate_period=False,        period_label=        {'x': .98, 'y': .3, 'ha': 'right', 'va': 'center'},        period_summary_func=        lambda v, r: {'x': .98, 'y': .17,                      's': f'Calls{v.sum():,.0f},                      'ha': 'right', 'size':11},        perpendicular_bar_func='median',        title='Most Mentioned Python Functionality Over Time',        bar_size=.95,        shared_fontdict=None,        scale='linear',        fig=None,        writer=None,        bar_kwargs={'alpha': .7},        filter_column_colors=False)
The most mentioned Python functions mentioned inside Pythonrepositories calculated via GitHub commits. Image by Author
通过GitHub提交计算的Python存储库中提到的最多提及的Python函数。 图片作者


We have seen how we can gather data directly from HTML using regex instead of the usual bs4. While this approach should not be used for more significant projects, using it for simple quests such as this one is a must. We also have seen that the most prominent data source may not always work.

我们已经看到了如何使用正则表达式而不是通常的bs4直接从HTML收集数据。 尽管不应将此方法用于更重要的项目,但必须将其用于诸如此类的简单任务。 我们还看到,最重要的数据源可能并不总是有效。

Finally, we discovered a new lovely library and way how to create beautiful racing bars that will capture your viewer’s interest!


整个代码 (Entire Code)

import urllib.request
from time import sleep
import pandas as pd
import bar_chart_race as bcr
from tqdm import tqdm
import re
import cProfilefind_count = re.compile(r'([0-9,]+) (available|commit)')
END_YEAR = 2020def search_git_get_count(terms, file_name='freq.csv', language="Python"):"""Collects the amount of function in terms calls for all year from 2000 to 2020:param terms: array of terms that we want to aggreate e.g. ["print", "len"]:param language: Language we want to search for e.g. "Python":return: Filename of the Dataframe including the results"""function_calls_by_date = []print(f"Starting to gather the data this will take approx. {(len(terms) * (END_YEAR - START_YEAR + 1)) // 10} minutes")for befor_year in [str(i) for i in range(START_YEAR, END_YEAR + 1)]:# init a dict to store the datayear_overview = {'date': f'{befor_year}-01-01',}for search_term in terms:year_overview[search_term + "()"] = 0for search_term in terms:while True:try:# example: https://github.com/search?q=len%28+language%3APython+type%3Acommits+committer-date%3A%3C2000-01-01&type=commitsurl_base = f"https://github.com/search?q={search_term}%28+language%3A{language}+type%3Acommits+committer-date%3A<{befor_year}-01-01&type=commits"fp = urllib.request.urlopen(url_base)byte_html = fp.read()org_html = byte_html.decode("utf8")fp.close()if "You have triggered an abuse detection mechanism." in org_html:# in this case we really should wait 10minprint("GitHub thinks we are malicious, we will wait for 10 minutes")sleep(60 * 10)continuestring_html = org_html[55000:56000].replace('\n', '')ret = next(re.finditer(find_count, string_html)).groups()[0]ret = int(ret.replace(',', ''))year_overview[search_term + "()"] = retbreakexcept Exception as e:# We can call api 30 times per minuteprint("We had an error with the API throttling us...", e)print("With request:", url_base)print("We will sleep for 60 seconds")sleep(60)continuefunction_calls_by_date.append(year_overview)print(f"Aggregated year: {befor_year}")function_calls_by_date = pd.DataFrame(function_calls_by_date)function_calls_by_date.to_csv(file_name, index=False)print(f"Dumped data to:{file_name}")return file_namedef plot_search_term_data(file):"""This function plots the aggregated function calls from git as a gif:param file: file name of the csv, expects a "date" column"""df = pd.read_csv(file).set_index('date')print(df)bcr.bar_chart_race(df=df,filename=file.replace('.csv', '.gif'),orientation='h',sort='desc',n_bars=len(df.columns),fixed_order=False,fixed_max=True,steps_per_period=10,period_length=700,interpolate_period=False,period_label={'x': .98, 'y': .3, 'ha': 'right', 'va': 'center'},period_summary_func=lambda v, r: {'x': .98, 'y': .17,'s': f'Calls: {v.sum():,.0f} by Sandro Luck \nTwitter:@san_sluck','ha': 'right', 'size':11},perpendicular_bar_func='median',title='Most Mentioned Python Functionality Over Time',bar_size=.95,shared_fontdict=None,scale='linear',fig=None,writer=None,bar_kwargs={'alpha': .7},filter_column_colors=False)if __name__ == '__main__':terms = ["print", "len", "join", "split", "sorted", "range", "strip", "str", "type", "map", "enumerate", "count","index", "zip", "replace", "iter", "int", "append", "find", "super"]file_name = search_git_get_count(terms)file_name = 'fixed.csv'plot_search_term_data(file_name)

翻译自: https://towardsdatascience.com/my-odyssey-finding-the-most-popular-python-function-6aa216db047c




