by Vaibhav Gupta

通过Vaibhav Gupta

我如何使用Python帮助我选择了Google Summer of Code '19的组织 (How I used Python to help me chose an organisation for Google Summer of Code ‘19)

In this tutorial, I’ll be using python to scrape data from the Google Summer of Code (GSoC) archive about the participating organizations from the year 2009.

在本教程中,我将使用python从Google Summer of Code(GSoC)档案中抓取有关2009年参与组织的数据。

我在这个项目背后的动机 (My Motivation Behind This Project)

While I was scrolling through the huge list of organisations that participated in GSoC’18, I realised that exploring an organisation is a repetitive task - choose one, explore its projects, check that if it has participated in previous years or not. But, there are 200+ organizations, and going through them all would take a whole lot of time. So, being a lazy person, I decided to use python to ease my work

当我浏览参加GSoC'18的庞大组织列表时,我意识到探索组织是一项重复性的任务-选择一个组织,探索其项目,并检查其是否参与了前几年。 但是,有200多个组织,要遍历所有组织将花费大量时间。 因此,作为一个懒惰的人,我决定使用python简化工作

要求 (Requirements)

  • Python (I’ll be using python3.6, because f-strings are awesome ?)

    Python(我将使用python3.6,因为f字符串很棒吗?)

  • Pipenv (for virtual environment)

    Pipenv(用于虚拟环境)

  • requests (for fetching the web page)

    请求(用于获取网页)

  • Beautiful Soup 4 (for extracting data from the web pages)

    Beautiful Soup 4(用于从网页提取数据)

建立脚本 (Building Our Script)

These are the web pages which we are going to scrape:

这些是我们将要抓取的网页:

  1. For the years 2009–2015: Link

    2009-2015年: 链接

  2. For the years 2015–2018: Link

    2015-2018年: 链接

编码部分 (Coding Part)

步骤1:设置虚拟环境并安装依赖项 (Step 1: Setting up the virtual environment and installing the dependencies)

virtualenv can be used to create a virtual environment, but I would recommend using Pipenv because it minimizes the work and supports Pipfile and Pipfile.lock.

virtualenv可用于创建虚拟环境,但我建议使用Pipenv,因为 它最小化工作并支持PipfilePipfile.lock

Make a new folder and enter the following series of commands in the terminal:

新建一个文件夹,然后在终端中输入以下命令系列:

pip install pipenv

Then create a virtual environment and install all the dependencies with just a single command (Pipenv rocks ?):

然后创建一个虚拟环境并仅用一个命令安装所有依赖项(Pipenv摇滚?):

pipenv install requests beautifulsoup4 --three

The above command will perform the following tasks:

上面的命令将执行以下任务:

  • Create a virtual environment (for python3).创建一个虚拟环境(适用于python3)。
  • Install requests and beautifulsoup4.

    安装请求eautifulsoup4。

  • Create Pipfile and Pipfile.lock in the same folder.

    在同一文件夹中创建PipfilePipfile.lock

Now, activate the virtual environment:

现在,激活虚拟环境:

pipenv shell

Notice the name of the folder before $ upon activation like this:

在激活时注意$之前的文件夹名称,如下所示:

(gsoc19) $

第2步:收集2009-2015年的数据 (Step 2: Scraping data for the years 2009–2015)

Open any code editor and make a new python file (I will name it 2009–2015.py). The webpage contains the links for the list of organizations of each year. First, write a utility function in a separate file utils.py which will GET any webpage for us and will raise an Exception if there’s a connection error.

打开任何代码编辑器,然后创建一个新的python文件(我将其命名为2009–2015.py )。 该网页包含每年组织列表的链接。 首先,在单独的文件utils.py编写一个实用程序函数,它将为我们获取任何网页,如果出现连接错误,则会引发Exception

Now, get the link of the web page which contains the list of organizations for each year.

现在,获取网页的链接,其中包含每年的组织列表。

For that, create a function get_year_with_link. Before writing the function, we need to inspect this webpage a little. Right-click on any year and click on Inspect.

为此,创建一个函数get_year_with_link 。 在编写功能之前,我们需要检查一下此网页。 右键单击任何年份,然后单击“ 检查”

Note that there’s a <ul> tag, and inside it, there are <li> tags, each of which contains a <span&gt; tag with class mdl-list__item-primary-content, inside of which is our link and year. Also, notice that this pattern is the same for each year. We want to grab that data.

需要注意的是有一个< UL>标记,并在其内部,疗法e ar Ë<LI>标记,其中的每一个conta插件一个<span&克t; tag with class mdl-list__it t; tag with class mdl-list__it em-primary-content的t; tag with class mdl-list__it ,其中是我们的链接和年份。 另外,请注意,每年这种模式都是相同的。 我们想获取这些数据。

The tasks performed by this function are in this order:

此功能执行的任务按以下顺序进行:

  1. Get the MAIN_PAGE, raise an exception if there’s a connection error.

    获取MAIN_PAGE ,如果存在连接错误,则引发异常。

  2. If the response code is 200 OK, parse the fetched webpage with BeautifulSoup. And if the response code is something else, exit the program.

    如果响应代码为200 OK ,则使用BeautifulSoup解析获取的网页。 如果响应代码是其他代码,请退出程序。

  3. Find all the <li> tags with class mdl-list__item mdl-list__item — one-line and store the returned list in years_li.

    class mdl-list__item mdl-list__item — on e-line class mdl-list__item mdl-list__item — on找到具有class mdl-list__item mdl-list__item — on所有< li>标签,并将返回的list存储st in ye yes_li中。

  4. Initialize an empty years_dict dictionary.

    初始化一个空的years_dict词典。

  5. Start iterating over the years_li list.

    开始遍历years_li列表。

  6. Get the year text (2009, 2010, 2011,…), remove all \n, and store it in the year.

    获取年份文本(2009、2010、2011等),删除所有\n并将其存储在year

  7. Get the relative link of each year (/archive/gsoc/2015, /archive/gsoc/2016,…) and store it in the relative_link.

    获取每年的相对链接(/ archive / gsoc / 2015,/ archive / gsoc / 2016,…)并将其存储在relative_link

  8. Convert the relative_link into the full link by combining it with the HOME_PAGE link and store it in full_link.

    通过将relative_linkHOME_PAGE链接合并,将其转换为完整链接,并将其存储在full_link

  9. Add this data to the year_dict dictionary with the year as key and full_link as its value.

    将此数据添加到year_dict词典中,并以year作为键,并将full_link作为其值。

  10. Repeat this for all years.全年重复一次。

This will give us a dictionary with years as keys and their links as values in this format:

这将为我们提供一个字典,其中以年份为键,其链接为值,格式如下:

{  ...  '2009': 'https://www.google-melange.com/archive/gsoc/2009',  '2010': 'https://www.google-melange.com/archive/gsoc/2010',  ...}

Now, we want to visit these links and get the name of every organization with their links from these pages. Right-click on any org name and click on Inspect.

现在,我们要访问这些链接,并从这些页面获取每个组织的名称及其链接。 右键单击任何组织名称,然后单击“ 检查”

Note that there’s a <ul> tag with class mdl-list, which has <li> tags with class mdl-list__item mdl-list__item — one-line. Inside of each there’s an <a> tag which has the link and the organization’s name. We want to grab that. For that, let’s create another function get_organizations_list_with_links, which takes the links of web pages which contains the organizations' list for each year (which we scraped in get_year_with_link).

请注意,有一个< ul>标签与class md 1-列表,whic h ha小号<LI>标签with class mdl-list__item mdl-list__ite米-单行。 每个内th ERE是它具有链接,该组织的名称<a>标记。 我们想抓住那个。 为此,让我们创建another function get_organizatio ns_list_with_links,该another function get_organizatio获取网页的链接,其中包含每年的组织列表( ich we scraped in get_year_with_link ich we scraped in抓取)。

The tasks performed by this function are in this order:

此功能执行的任务按以下顺序进行:

  1. Get the org list page, (https://www.google-melange.com/archive/gsoc/2015, https://www.google-melange.com/archive/gsoc/2016, ….), raise an exception if there’s a connection error.

    获取组织列表页面,( https://www.google-melange.com/archive/gsoc/2015 , https://www.google-melange.com/archive/gsoc/201 6,...。),引发如果存在连接错误,则为异常。

  2. If the response code is 200 OK, parse the fetched webpage with BeautifulSoup. And if the response code is something else, exit the program.

    如果响应代码为200 OK ,则使用BeautifulSoup解析获取的网页。 如果响应代码是其他代码,请退出程序。

  3. Find all the <li> tags with class mdl-list__item mdl-list__item — one-line and store the returned list in orgs_li.

    class mdl-list__item mdl-list__item — on e-line class mdl-list__item mdl-list__item — on找到具有class mdl-list__item mdl-list__item — on所有< li>标记,并将返回的st in o存储st in o r rgs_li中。

  4. Initialize an empty orgs_dict dictionary.

    初始化一个空的orgs_dict字典。

  5. Start iterating over the orgs_li list.

    开始遍历orgs_li列表。

  6. Get the org name, remove all \n, and store it in the org_name .

    获取组织名称,删除所有\n ,并将其存储在org_name

  7. Get the relative link of each org (/archive/gsoc/2015/orgs/n52, /archive/gsoc/2015/orgs/beagleboard,…) and store it in the relative_link.

    获取每个组织的相对链接(/ archive / gsoc / 2015 / orgs / n52,/ archive / gsoc / 2015 / orgs / beagleboard,...)并将其存储在relative_link

  8. Convert the relative_link into the full link by combining it with the HOME_PAGE link and store it in full_link.

    通过将relative_linkHOME_PAGE链接结合起来,将其转换为完整链接,并将其存储在full_link

  9. Add this data to the orgs_dict with the org_name as key and full_link as its value.

    将此数据添加到orgs_dict ,并将org_name作为键,并将full_link作为其值。

  10. Repeat this for all the organizations for a particular year.对特定年份的所有组织重复此操作。

This will give us a dictionary with organizations’ names as keys and their links as values, like this:

这将为我们提供一本字典,该字典将组织的名称作为键,并将其链接作为值,例如:

{  ...  'ASCEND': 'https://www.google-melange.com/archive/gsoc/2015/orgs/ascend',
'Apache Software Foundation': 'https://www.google-melange.com/archive/gsoc/2015/orgs/apache',  ...}

Moving ahead, we want to visit these links and get the title, description, and link of each project of each org for each year (?). Right-click on any project’s title and click on Inspect.

继续前进,我们想访问这些链接,并获取每个组织每年的每个项目的标题,描述和链接(?)。 右键单击任何项​​目的标题,然后单击“ 检查”。

Again, the same pattern. There’s a <ul> tag with class mdl-list which contains the <li> tags with class mdl-list__item mdl-list__item — two-line, inside of which there’s an <span> which contains an <a> tag containing our project’s name. Also, there’s an <span> tag with class mdl-list__item-sub-title containing the project’s description. For that, create a function get_org_projects_info to get this task done.

再次,相同的模式。 有一个< UL>与标签class md -1-列表,它包含s th Ë<LI>标记with class mdl-list__item mdl-list__ite米-两行,在其内部的there的一个<跨度> whi CH包含一个<一个包含我们项目name.标签name. 另外,还有一个n <span> tag with类mdl-list__item-sub-title包含项目的描述. For that, create a . For that, create a函数get_org_projects_info以完成此任务。

The tasks performed by this function are in this order:

此功能执行的任务按以下顺序:

  1. Get the org’s description page, (https://www.google-melange.com/archive/gsoc/2015/orgs/ascend, https://www.google-melange.com/archive/gsoc/2015/orgs/apache, ….), raise an exception if there’s a connection error.

    获取组织的描述页面,( https://www.google-melange.com/archive/gsoc/2015/orgs/ascend , https://www.google-melange.com/archive/gsoc/2015/orgs/apache ,....),如果出现连接错误,则引发异常。

  2. If the response code is 200 OK, parse the fetched webpage with BeautifulSoup. And if the response code is something else, exit the program.

    如果响应代码为200 OK ,则使用BeautifulSoup解析获取的网页。 如果响应代码是其他代码,请退出程序。

  3. Find all the <li> tags with class equal to mdl-list__item mdl-list__item — two-line and store the returned list in projects_li.

    找到与类al to mdl-list__item mdl-list__item — tw所有< li>标记al to mdl-list__item mdl-list__item — tw两行,并将返回的list存储st in proje proje cts_li中。

  4. Initialize an empty projects_info list.

    初始化一个空的projects_info列表。

  5. Start iterating over the projects_li list.

    开始遍历projects_li列表。

  6. Initialize an empty dictionary proj_info in each loop.

    在每个循环中初始化一个空字典proj_info

  7. Get the project’s title, remove all \n, and store it in the proj_title .

    获取项目的标题,删除所有\n ,并将其存储在proj_title

  8. Get the relative link of each project (https://www.google-melange.com/archive/gsoc/2015/orgs/apache/projects/djkevincr.html, ….) and store it in the proj_relative_link.

    获取每个项目的相对链接( https://www.google-melange.com/archive/gsoc/2015/orgs/apache/projects/djkevincr.html ,...),并将其存储在proj_relative_link

  9. Convert the proj_relative_link into the full link by combining it with the HOME_PAGE link and store it in proj_full_link.

    proj_relative_linkHOME_PAGE链接合并,将其转换为完整链接,并将其存储在proj_full_link

  10. Store the project’s title, description and link in the proj_info dictionary and append this dictionary to the projects_info list.

    将项目的标题,描述和链接存储在proj_info词典中,然后将此词典附加到projects_info列表中。

This will give us a list containing dictionaries with the project’s data.

这将为我们提供一个包含词典以及项目数据的列表。

[  ...  {    'title': 'Project Title 1',    'description': 'Project Description 1',    'link': 'http://project-1-link.com/',  },  {    'title': 'Project Title 2',    'description': 'Project Description 2',    'link': 'http://project-2-link.com/',  }  ...]

步骤3:实现主要功能 (Step 3: Implementing the main function)

Let’s see the code first:

我们先来看一下代码:

The tasks performed by this function are in this order:

此功能执行的任务按以下顺序进行:

  1. We want to have a final_dict dictionary so that we can save it as .json file.

    我们希望有一个final_dict字典,以便我们可以将其另存为.json文件。

  2. Then, we call our function get_year_with_link(), which will return a dictionary with years as keys and links to the list of organizations as values and store it in year_with_link.

    然后,我们调用函数get_year_with_link() ,该函数将返回以年份为键的字典,并以值的形式链接到组织列表,并将其存储在year_with_link

  3. We iterate over the dictionary year_with_link.

    我们遍历字典year_with_link

  4. For each year, we call the function get_organizations_list_with_links() with the link for it as the parameter, which will return a dictionary with organizations’ name as keys and links to the webpage containing information about them as values. We store the returning value in final_dict, with year as keys.

    对于每一年,我们都调用函数get_organizations_list_with_links() ,并get_organizations_list_with_links()链接作为参数,该函数将返回一个以组织名称作为键的字典,并返回包含有关于它们的信息作为值的网页链接。 我们将返回值存储在final_dict ,并以year作为键。

  5. Then we iterate over each org for each year.然后,我们每年迭代每个组织。
  6. For each org, we call the function get_org_projects_info() with the link for the org as parameter, which will return a list of dictionaries containing each projects’ information.

    对于每个组织,我们都调用函数get_org_projects_info() ,并将该组织的链接作为参数,这将返回包含每个项目信息的词典列表。

  7. We store that data in the final_dict.

    我们将该数据存储在final_dict

  8. After the loop is over, we will have a final_dict dictionary as follows:

    循环结束后,我们将有一个final_dict字典,如下所示:

{    "2009": {        "Org 1": [            {                "title": "Project - 1",                "description": "Project-1-Description",                "link": "http://project-1-link.com/"            },            {                "title": "Project - 2",                "description": "Project-2-Description",                "link": "http://project-2-link.com/"            }        ],        "Org 2": [            {                "title": "Project - 1",                "description": "Project-1-Description",                "link": "http://project-1-link.com/"            },            {                "title": "Project - 2",                "description": "Project-2-Description",                "link": "http://project-2-link.com/"            }        ]    },    "2010": {        ...    }}

9. Then we will save it as a json file with json.dumps .? ?

9.然后,我们将其保存为带有json.dumpsjson文件。 ?

下一步 (Next Steps)

Data for the years 2016–2018 can be scraped in a similar manner. And then python (or any suitable language) can be used to analyze the data. Or, a web app can be made. In fact, I have already made my version of a webapp using Django, Django REST Framework and ReactJS. Here is the link for the same: https://gsoc-data-analyzer.netlify.com/

2016-2018年的数据可以类似的方式进行抓取。 然后可以使用python(或任何合适的语言)来分析数据。 或者,可以制作一个Web应用程序。 实际上,我已经使用DjangoDjango REST FrameworkReactJS制作了Webapp版本。 这是相同的链接: https : //gsoc-data-analyzer.netlify.com/

All the code for this tutorial is available on my github.

本教程的所有代码都可以在我的github上找到 。

改进之处 (Improvements)

The running time of the script can be improved by using Multithreading. Currently, it fetches one link at one time, it can be made to fetch multiple links simultaneously.

通过使用多线程,可以缩短脚本的运行时间。 当前,它一次获取一个链接,可以使它同时获取多个链接。

关于我 (About Me)

Hi there.

嗨,您好。

I am Vaibhav Gupta, an undergrad student at Indian Institute of Information Technology, Lucknow. I love Python and JS.

我是勒克瑙印度信息技术学院的本科生Vaibhav Gupta 。 我喜欢Python和JS。

See my portfolio or find me on Facebook, LinkedIn or Github.

查看我的投资组合,或在Facebook , LinkedIn或Github上找到我。

翻译自: https://www.freecodecamp.org/news/how-i-used-python-to-help-me-chose-an-organisation-for-google-summer-of-code-19-75078de13194/

我如何使用Python帮助我选择了Google Summer of Code '19的组织相关推荐

  1. 50万+Python 开发者的选择,这本书对零基础真是太太太友好了

    人生苦短,我用 Python! 10 月 30 日,Python 之父发推宣布自己退休,离开 DropBox.他说:"这感觉苦乐参半,我就要离开 DropBox 退休了,这令人忧伤.而我在 ...

  2. python和java选择哪个-Python和Java该如何选择?选哪个好?

    在这里为大家介绍一下,其实Python和Java是两种截然不同的编程语言,两者都算是互联网行业中主流的编程语言,不过两者使用起来都是非常好用的,当然要看自己的需求,接下来为大家简单的区分一下Pytho ...

  3. python和java选择哪个-python和java哪个学起来更简单

    在近几年Python的呼声越来越高,很多刚开始起步想要学习编程的朋友都会犹豫要不要选择学习Python,毕竟作为人工智能时代的首选语言这个诱惑还是很大的.在选择上最纠结的就是Python和Java选择 ...

  4. 50多岁学python_50万+Python 开发者的选择,这本书对零基础真是太太太友好了!!!...

    人生苦短,我用 Python! 10 月 30 日,Python 之父发推宣布自己退休,离开 DropBox.他说:"这感觉苦乐参半,我就要离开 DropBox 退休了,这令人忧伤.而我在 ...

  5. PySpark之Python版本如何选择(详细版)

    问题背景 公司目前有两套Spark集群和一套Hadoop集群用来做 数据相关的存储和计算.Hadoop的版本是2.7,Spark两套集群,一个是2.1.0版本,一个是 2.4.3版本.但是两个Spar ...

  6. python排序算法---选择排序

    1.选择排序法定义 选择排序(Selection sort)是一种简单直观的排序算法.它的工作原理如下:首先在未排序序列中找到最小(大)元素,存放到排序序列的起始位置,然后,再从剩余未排序元素中继续寻 ...

  7. python排序算法-选择排序

    python排序算法-选择排序 一.选择排序 1.一趟排序纪录最小的数,放在第一个位置 2.再一次排序记录列表无序区最小的数,放在第二个位置 关键 有序区.无序区.无序区最小数的位置 代码 def s ...

  8. Python的if选择语句

    Python的if选择语句 一.前言 二.程序结构 三.常用选择语句 1.最简单的if语句 2. if...else语句 3. if...elif...else语句 4. if语句的嵌套 四.使用an ...

  9. 零基础Python完全自学教程11:Python中的选择语句

    欢迎你来到站长学堂,学习站长在线出品的在线课程<零基础Python完全自学教程>今天给大家分享的是第11课<Python中的选择语句>.本节课主要内容有:最简单的if语句.if ...

最新文章

  1. 从构建分布式秒杀系统聊聊验证码
  2. 1:ImageNet Classification with Deep Convolutional Neural Networks
  3. 计算机网络按定义分,计算机网络定义及其分类
  4. input自适应_深度残差网络+自适应参数化ReLU(调参记录18)Cifar10~94.28%
  5. MySQL优化(二):索引的类型、匹配原则、创建原则
  6. 数据装载器连接其他oracle数据库_07
  7. c中获取python控制台输出_在真实的tim中用C捕获控制台python打印
  8. etmvc mysql乱码_etmvc中集成spring使用druid连接池
  9. Java数组扩容算法及Java对它的应用
  10. 【java】java ReentrantLock 源码详解
  11. iOS 使用SourceTree注意事项
  12. Spring的 @ExceptionHandler注解无效问题
  13. jtag接口定义 jtag接口的主要作用是什么
  14. hdu 5025 Saving Tang Monk 状态压缩dp+广搜
  15. 5G中大规模MIMO系统通信的若干分析
  16. 【人工智能项目】LSTM实现数据预测分类实验
  17. iTEST平台成绩查询代码
  18. [xueqi]吃着榨菜,轻松搞下漏洞百出的湾湾站
  19. 使用DiskGenius(原DiskMan)修复损坏的硬盘分区
  20. 华为模拟器ensp ACL技术

热门文章

  1. command对象的三个主要的方法 1119
  2. for循环的使用步骤 1104
  3. screentogif 屏幕录制生成gif图片的软件安装过程
  4. 爬虫-request库-get请求
  5. 爬虫-使用xpath拿36KR的数据-xpath的学习与演练
  6. django-中间件,视图函数调用前会执行的
  7. django-自关联-三级联动的效果
  8. eclipse-Java compiler level does not match the version of the installed Java project facet.
  9. JavaScript 设计模式核⼼原理与应⽤实践之单例模式——Vuex的数据管理哲学
  10. 什么是DataV数据可视化?