使用python数据分析

By Rocky Kev

洛基·凯夫(Rocky Kev)

I wanted to learn Python for a long time, but I could never find a reason. When my company had a bunch of daily reports that needed to be generated, I realized I had an opportunity to explore Python to cut out all the repetition.

我想学习Python很长一段时间,但是我找不到原因。 当我的公司需要生成大量日常报告时,我意识到我有机会探索Python以消除所有重复。

This article is the result of a few weeks learning Python, playing around with the various libraries, and automating some of my tasks at work.

本文是几周学习Python,使用各种库并使我的一些工作自动化的结果。

Now I want to share what Python is capable of.

现在,我想分享Python的功能。

Rather than give boring office related examples, let’s put them in a Game of Thrones frame!

与其给办公室带来无聊的例子 ,不如让它们放在“权力的游戏”框架中!

In this post, I will be implementing web automation with the Selenium library, web scraping with the BeautifulSoup library, and generating reports with the csv module — which is sort of simulating the whole Pandas/Data Science side of Python.

在本文中,我将使用Selenium库实现Web自动化, 使用BeautifulSoup库实现Web 抓取 ,并使用csv模块生成报告 -这类似于模拟Python的整个Pandas / Data Science方面。

And like I mentioned before —all of the examples will be using Game of Thrones.

就像我之前提到的一样,所有示例都将使用《权力的游戏》。

一些快速注意事项: (Some Quick Notes:)

  1. You shouldn’t need any Python experience to do this. I’ll explain the code, and you should have enough to get going.您不需要任何Python经验即可做到这一点。 我将解释代码,您应该有足够的继续做下去。
  2. I’m not a super-expert at Python. This is roughly a few weeks of Python experience. It was just enough to automate my work and create these examples.

    我不是Python的超级专家。 这大约是几周的Python经验。 这足以使我的工作自动化并创建这些示例。

  3. Python is WELL DOCUMENTED. There are so many free guides to learning Python, like Automate the Boring Stuff, Python for Beginners, and the amazing Dataquest.io data science track. There’s even more links in the freeCodeCamp knowledge base.

    Python的文档很好。 有很多免费的Python学习指南,例如《 自动化无聊的东西》 ,《 Python适用于初学者 》和令人惊叹的Dataquest.io数据科学专着 。 freeCodeCamp知识库中还有更多链接。

Python,最好的基于爬行动物的计算机语言 (Python, the best reptile-based computer language)

For those unfamiliar with programming —

对于不熟悉编程的人-

Python is a general purpose programming language which is strictly typed, interpreted, and known for its easy readability with great design principles.

Python是一种通用编程语言,经过严格的类型化,解释和定义,以其易于阅读且具有出色的设计原理而著称。

Python is a general purpose programming language which is strictly typed, interpreted, and known for its easy readability with great design principles.Via the Freecodecamp.com guide

Python是一种通用的编程语言,经过严格的类型化,解释和定义,以其易于阅读且具有出色的设计原理而著称。 通过Freecodecamp.com指南

According to Stack Overflow’s 2018 Developer Survey, Python is the language most developers are wanting to learn (and also one of the fastest growing major programming languages).

根据Stack Overflow的2018年开发人员调查 ,Python是大多数开发人员想要学习的语言(也是增长最快的主要编程语言之一)。

Python powers site like Reddit, Instagram and Dropbox. It’s also a really readable language that has a lot of powerful libraries.

Python支持Reddit,Instagram和Dropbox等网站。 它也是一种非常易读的语言,具有许多强大的库。

Python is named after Monty Python, not the reptile. BUT — in spite of that, it’s still the most popular reptile-based programming language, beating Serpent, Gecko, Cobra and Raptor! (I had to research that joke!)

Python以Monty Python命名,而不是爬行动物。 但是,尽管如此,它仍然是最流行的基于爬行动物的编程语言,击败了Serpent,Gecko,Cobra和Raptor! (我不得不研究那个笑话!)

If you have some background in programming (say in JavaScript)—

如果您有一定的编程背景(例如使用JavaScript),

Some things about Python:

关于Python的一些事情:

  • Python uses indentation vs curly brackets. Check the example below:Python使用缩进和大括号。 检查以下示例:
  • Python uses class-based inheritance — so it’s more like C languages. where as can JavaScript can simulate classes.Python使用基于类的继承-因此它更像C语言。 JavaScript可以在哪里模拟类。
  • Python is also strongly typed. No mix-matching. For example, if you add a string and an integer together, it’ll start complaining.Python也是强类型的。 没有混搭。 例如,如果您将一个字符串和一个整数加在一起,它将开始抱怨。

让我们跳进去吧! (Let’s jump right into it!)

I’ll be breaking this into 3 pieces.

我将其分为三部分。

  • Game of Thrones and Python #1: Web automation

    权力游戏和Python#1 :网络自动化

  • Game of Thrones and Python #2: Web Scraping

    权力游戏和Python#2 :网络爬虫

  • Game of Thrones and Python #3: Generating reports with the CSV Module

    权力游戏和Python#3 :使用CSV模块生成报告

权力游戏和Python 1 — Web自动化 (Game of Thrones and Python 1 — Web Automation)

One of the coolest things you can do with Python is web automation.

使用Python可以做的最酷的事情之一就是Web自动化。

For example — you can write a Python script that:

例如,您可以编写以下Python脚本:

  1. Opens up a browser打开浏览器
  2. Automatically visits a specific website自动访问特定网站
  3. Logs you into that site登录到该站点
  4. Goes to another part of that website转到该网站的另一部分
  5. Finds the most recent blog post.查找最新的博客文章。
  6. Opens that blog post.打开该博客文章。
  7. Submits a comment that says, “Great writing! High five!”提交评论,说:“出色的写作! 举手击掌!”
  8. And finally logs you out of that website最后将您退出该网站

It might not seem so hard to do. That takes what…. 20 seconds?

似乎并不难做到。 那需要什么...。 20秒?

But if you had to do that over and over again, it would drive you insane.

但是,如果您必须一遍又一遍地执行此操作,则会使您发疯。

For example — what if you had a staging site that’s still in development with 100 blog posts, and you wanted to post a comment on every single page to test its functionality?

例如,如果您的暂存站点仍在开发中,尚有100篇博客文章,并且您想在每个页面上发表评论以测试其功能,那该怎么办?

That’s 100 blog posts * 20 seconds = roughly 33 minutes

那是100篇博客文章* 20秒= 大约33分钟

And what if there are MULTIPLE testing phases, and you had to repeat the test six more times?

如果有多个测试阶段,又不得不重复进行六次测试,该怎么办?

Other use cases for web automation include:

网络自动化的其他用例包括

  • You might want to automate account creations on your site.您可能要自动在您的网站上创建帐户。
  • You might want to run a bot from start to finish in your online course.您可能想在在线课程中从头到尾运行一个机器人。
  • You might want to push 100 bots to submit a form on your site with a single script.您可能需要推动100个机器人通过一个脚本在您的网站上提交表单。

我们将要做什么 (What we will be doing)

For this part, we’ll be automating the process to logging into all of our favorite Game of Thrones fan sites.

对于这一部分,我们将自动执行登录所有我们最喜欢的《权力的游戏》粉丝网站的过程。

Don’t you hate when you have to waste time logging into westeros.org, the /r/freefolk subreddit, winteriscoming.net and all your other fan sites?

当您不得不浪费时间登录westeros.org,/ r / freefolk subreddit,winteriscoming.net和所有其他粉丝站点时,您是否讨厌吗?

With this template, you can automatically log into various websites!

使用此模板,您可以自动登录各种网站!

Now, for Game of Thrones!

现在,为权力的游戏!

代码 (The Code)

You will need to install Python 3, Selenium, and the Firefox webdrivers to get started. If you want to follow along, check out my tutorial on How to automate form submissions with Python.

您需要安装Python 3,Selenium和Firefox Webdrivers才能开始。 如果您想继续学习,请查看我的教程“ 如何使用Python自动执行表单提交”

This one might get complicated. So I highly recommend sitting back and enjoying the ride.

这可能会变得复杂。 因此,我强烈建议您高枕无忧。

## Game of Thrones easy login script## ## Description: This code logs into all of your fan sites automaticallyfrom selenium import webdriverfrom selenium.webdriver.common.keys import Keysfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.common.exceptions import TimeoutExceptionimport timedriver = webdriver.Firefox()driver.implicitly_wait(5)    ## implicity_wait makes the bot wait 5 seconds before every action    ## so the site content can load up# Define the functionsdef login_to_westeros (username, userpass):    ## Open the login page    driver.get('https://asoiaf.westeros.org/index.php?/login/')        ## Log the details    print(username + " is logging into westeros.")        ## Find the fields and log into the account.     textfield_username = driver.find_element_by_id('auth')    textfield_username.clear()    textfield_username.send_keys(username)    textfield_email = driver.find_element_by_id('password')    textfield_email.clear()    textfield_email.send_keys(userpass)    submit_button = driver.find_element_by_id('elSignIn_submit')    submit_button.click()    ## Log the details    print(username + " is logged in! -> westeros")       def login_to_reddit_freefolk (username, userpass):    ## Open the login page    driver.get('https://www.reddit.com/login/?dest=https%3A%2F%2Fwww.reddit.com%2Fr%2Ffreefolk')        ## Log the details    print(username + " is logging into /r/freefolk.")        ## Find the fields and log into the account.     textfield_username = driver.find_element_by_id('loginUsername')    textfield_username.clear()    textfield_username.send_keys(username)
textfield_email = driver.find_element_by_id('loginPassword')    textfield_email.clear()    textfield_email.send_keys(userpass)    submit_button = driver.find_element_by_class_name('AnimatedForm__submitButton')    submit_button.click()    ## Log the details    print(username + " is logged in! -> /r/freefolk.")    ## Define the user and email combo. login_to_westeros("gameofthronesfan86", PASSWORDHERE)time.sleep(2)driver.execute_script("window.open('');")Window_List = driver.window_handlesdriver.switch_to_window(Window_List[-1])login_to_reddit_freefolk("MyManMance", PASSWORDHERE)time.sleep(2)driver.execute_script("window.open('');")Window_List = driver.window_handlesdriver.switch_to_window(Window_List[-1])## wait for 2 secondstime.sleep(2)print("task complete")

分解代码 (Breaking the code down)

To start, I’m importing the Selenium library to help with the heavy lifting.

首先,我将导入Selenium库以帮助您完成繁重的工作。

I also imported the time library, so after each action, it will wait x seconds. Adding a wait allows the page to load.

我还导入了时间库,因此在执行每个操作后,它将等待x秒。 添加等待允许页面加载。

from selenium import webdriverfrom selenium.webdriver.common.keys import Keysfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.common.exceptions import TimeoutExceptionimport time

什么是Selenium? (What is Selenium?)

Selenium is the Python library we use for web automation. Selenium has developed an API so third-party authors can develop webdrivers to the communication to browsers. That way, the Selenium team can focus on their codebase, while another team can focus on the middleware.

Selenium是我们用于网络自动化的Python库。 Selenium开发了一个API,因此第三方作者可以开发用于与浏览器通信的Web驱动程序。 这样,Selenium团队可以专注于他们的代码库,而另一个团队可以专注于中间件。

For example:

例如:

  • The Chromium team made their own webdriver for Selenium called chromedriver.

    Chromium团队为Selenium制作了自己的Web驱动程序,称为chromedriver 。

  • The Firefox team made their own webdriver for Selenium called geckodriver.

    Firefox团队为Selenium制作了自己的Web驱动程序,称为geckodriver 。

  • The Opera team made their own webdriver for Selenium called operadriver.

    Opera团队为Selenium制作了自己的Web驱动程序,称为Operadriver 。

driver = webdriver.Firefox()
driver.get('

driver.close()

driver.close()

In the code above, I’m asking Selenium to do things like “Set Firefox up as the browser of choice”, and “pass this link to Firefox”, and finally “Close Firefox”. I used the geckodriver to do that.

在上面的代码中,我要Selenium做诸如“ 将Firefox设置为选择的浏览器 ”,“ 将此链接传递给Firefox ”以及最后“ 关闭Firefox ”之类的操作。 我用了geckodriver。

登录网站 (Logging into sites)

To make it easier to read, I wrote a separate function to log into each site, to show the pattern that we are making.

为了便于阅读,我编写了一个单独的函数来登录每个站点,以显示我们正在制作的模式。

def login_to_westeros (username, userpass):    ## Log in    driver.get('https://asoiaf.westeros.org/index.php?/login/')        ## Log the details    print(username + " is logging into westeros.")        ## 2)
textfield_email = driver.find_element_by_id('password')    textfield_email.clear()    textfield_email.send_keys(userpass)    submit_button = driver.find_element_by_id('elSignIn_submit')    submit_button.click()    ## Log the details    print(username + " is logged in! -> westeros")

If we break that down even more — each function has the following elements.

如果我们进一步细分-每个函数都包含以下元素。

I’m telling Python to:

我告诉Python:

  1. Visit a specific page.

    访问特定页面。

    Visit a specific page.driver.get('https://asoiaf.westeros.org/index.php?/login/')

    访问特定页面。 driver.get('https://asoiaf.westeros.org/index.php?/login/')

2. Look for the login box * Clear the text if there is any * Submit my variable

2.查找登录框*清除文本*提交我的变量

textfield_username = driver.find_element_by_id('auth')    textfield_username.clear()    textfield_username.send_keys(username)

3. Look for the password box * Clear the text if there is any * Submit my variable

3.查找密码框*清除文本*提交我的变量

textfield_email = driver.find_element_by_id('password')    textfield_email.clear()    textfield_email.send_keys(userpass)

4. Look for the submit button, and click it

4.查找提交按钮,然后单击它

submit_button = driver.find_element_by_id('elSignIn_submit')    submit_button.click()

As a note: each website has different ways to find the username/password and submit buttons. You’ll have to do a bit of searching for that.

注意:每个网站都有不同的方式来查找用户名/密码和提交按钮。 您将需要进行一些搜索。

如何找到任何网站的登录框和密码框 (How to find the login box and password box for any website)

The Selenium Library has a bunch of handy ways to find elements on a webpage. Here are some of the ones I like to use.

Selenium库提供了许多方便的方法来查找网页上的元素。 这是我喜欢使用的一些。

  • find_element_by_idfind_element_by_id
  • find_element_by_namefind_element_by_name
  • find_element_by_xpathfind_element_by_xpath
  • find_element_by_class_namefind_element_by_class_name

For the whole list, visit the Selenium Python documentation for locating elements.

有关整个列表,请访问Selenium Python文档以查找元素 。

To use asoiaf.westeros.com as an example, when I inspect the elements — they all have IDs… which is GREAT! That makes my life easier.

以asoiaf.westeros.com为例 ,当我检查元素时-它们都具有ID… 太好了 ! 这使我的生活更轻松。

运行代码 (Running the code)

享受旅程 (Enjoying the ride)

With web automation, you’re playing a game of ‘how can I get Selenium to find the element’. Once you find it, you can then manipulate it.

使用网络自动化,您正在玩​​“我如何让Selenium查找元素”的游戏。 一旦找到它,就可以对其进行操作。

权力游戏和Python 2-网页搜罗 (Game of Thrones and Python 2 — Web Scraping)

In this piece, we will be exploring web-scrapping.

在本篇文章中,我们将探索网络爬网。

The big picture process is:

总体过程是:

  1. We’ll have Python visit a webpage.我们将让Python访问网页。
  2. We’ll then parse that webpage with BeautifulSoup.然后,我们将使用BeautifulSoup解析该网页。
  3. You then set up the code to grab specific data.然后,您设置代码以获取特定数据。

For example: You might want to grab all the h1 tags. Or all the links. Or in our case, all of the images on a page.

例如:您可能想获取所有的h1标签。 或所有链接。 或者在我们的例子中,页面上的所有图像。

Some other use cases for Web Scraping:

Web爬网的其他一些用例:

  • You can grab all the links on a web page.您可以获取网页上的所有链接。
  • You can grab all the post titles within a forum您可以在论坛中获取所有帖子标题
  • You can use it to grab the daily NASDAQ Value without ever visiting the site.您可以使用它来获取每日纳斯达克价值,而无需访问该网站。
  • You can use it to download all of the links within a website that doesn’t have a ‘Download All’.您可以使用它来下载网站上没有“全部下载”的所有链接。

In short, web scraping allows you to automatically grab web content through Python.

简而言之,Web抓取使您可以通过Python自动获取Web内容。

Overall, a very simple process. Except when it isn’t!

总体而言,这是一个非常简单的过程。 除非不是这样!

Web抓取图像的挑战 (The challenge of Web Scraping for images)

My goal was to turn my knowledge of web scraping content to grab images.

我的目标是将我对网络抓取内容的了解转变为获取图像。

While web scraping for links, body text and headers is very straightforward,web scraping for images is significantly more complex. Let me explain.

虽然对链接,正文和标题的Web抓取非常简单,但对图像的Web抓取却要复杂得多。 让我解释。

As a web developer, hosting MULTIPLE full-sized images on a single webpage will slow the whole page down. Instead, use thumbnails and then only load the full-sized image when the thumbnail is clicked on.

作为Web开发人员,在单个网页上托管多个全尺寸图片会减慢整个页面的速度。 而是使用缩略图,然后仅在单击缩略图时才加载完整尺寸的图像。

For example: Imagine if we had twenty 1 megabyte images on our web page. Upon landing, a visitor would have to download 20 megabytes worth of images! The more common method is to make twenty 10kb thumbnail images. Now, your payload is only 200kb, or about 1/100 of the size!

例如:假设我们的网页上有二十张1兆字节的图像。 登陆后,访客将必须下载价值20 MB的图像! 更常见的方法是制作二十张10kb的缩略图。 现在,您的有效负载只有200kb,约为大小的1/100!

So what does this have to do with web scraping images and this tutorial?

那么,这与网络抓取图像和本教程有什么关系?

It means that it makes it pretty difficult to write a generic block of code that always works for every website. Websites implement all different ways to turn a thumbnail to a full-size image, which makes it a challenge to create a ‘one-size fits all’ model.

这意味着很难编写始终适用于每个网站的通用代码块 。 网站采用各种不同的方式将缩略图转换为全尺寸图像,这使得创建“全尺寸适合”的模型成为一项挑战。

I’ll still teach what I learned. You’ll still gain a lot of skills from it. Just be aware that trying that code on other sites will require major modifications. Hurray for Zone of Proximal Development.

我仍然会教我学到的东西。 您仍将从中获得很多技能。 请注意,在其他站点上尝试该代码将需要进行重大修改 。 万岁为近端发展区。

Python和权力的游戏 (Python and Game of Thrones)

The goal of this tutorial is that we’ll be gathering images of our favorite actors! Which will allow us to do weird things like make a Teenage Crush Actor Collage that we can hang in our bedroom (like so).

本教程的目的是我们将收集我们最喜欢的演员的图像! 这将使我们能够做一些奇怪的事情,例如制作我们可以挂在我们卧室里的“青春美眉演员拼贴画”。

In order to gather those images, we’ll be using Python to do some web scraping. We’ll be using the BeautifulSoup library to visit a web page and grab all the image tags from it.

为了收集这些图像,我们将使用Python进行一些Web抓取。 我们将使用BeautifulSoup库访问网页并从中获取所有图像标签。

NOTE: In many website terms and conditions, they prohibit any web scraping of their data. Some develop APIs to allow you to tap into their data. Others do not. Additionally, try to be mindful that you are taking up their resources. So look to doing one request at a time rather than opening lots of connections in parallel and grinding their site to a halt.

注意:在许多网站条款和条件中,它们禁止任何网络刮取其数据。 一些开发API使您可以利用它们的数据。 其他人没有。 此外,请注意您正在占用他们的资源。 因此,您希望一次执行一个请求,而不是并行打开大量连接并磨碎其站点以使其停止。

代码 (The Code)

# Import the libraries neededimport requestsimport timefrom bs4 import BeautifulSoup# The URL to scrapeurl = 'https://www.popsugar.com/celebrity/Kit-Harington-Rose-Leslie-Cutest-Pictures-42389549?stream_view=1#photo-42389576'#url = 'https://www.bing.com/images/search?q=jon+snow&FORM=HDRSC2'# Connectingresponse = requests.get(url)# Grab the HTML and using Beautifulsoup = BeautifulSoup (response.text, 'html.parser')#A loop code to run through each link, and download itfor i in range(len(soup.findAll('img'))):    tag = soup.findAll('img')[i]    link = tag['src']    #skip it if it doesn't start with http    if "http" in full_link:         print("grabbed url: " + link)        filename = str(i) + '.jpg'        print("Download: " + filename)        r = requests.get(link)        open(filename, 'wb').write(r.content)    else:        print("grabbed url: " + link)        print("skip")        time.sleep(1)

让Python访问网页 (Having Python Visit the Webpage)

We start by importing the libraries needed, and then storing the webpage link into a variable.

我们首先导入所需的库,然后将网页链接存储到变量中。

  • The Requests library is used to do all sorts of HTTP requests

    Requests库用于执行各种HTTP请求

  • The Time library is used to put a 1 second wait after each request. If we didn’t include that, the whole loop will fire off as fast as possible, which isn’t very friendly to the sites we are scraping from.

    时间库用于在每个请求之后放置1秒的等待时间。 如果我们不包括在内,整个循环将尽快启动,这对于我们要从中进行抓取的网站不是很友好。

  • The BeautifulSoup Library is used to make exploring the DOM Tree easier.

    BeautifulSoup库用于使探索DOM树更加容易。

使用BeautifulSoup解析该网页 (Parse that webpage with BeautifulSoup)

Next, we push our URL into BeautifulSoup.

接下来,我们将URL推送到BeautifulSoup。

寻找内容 (Finding the content)

Finally, we use a loop to grab the content.

最后,我们使用循环来获取内容。

It starts with a FOR loop. BeautifulSoup does some cool filtering, where my code asks BeautifulSoup find all the ‘img’ tags, and store it in a temporary array. Then, the len function asks for the length of the array.

它以FOR循环开始。 BeautifulSoup做一些很酷的过滤,我的代码要求BeautifulSoup找到所有的'img'标签,并将其存储在一个临时数组中。 然后, len函数要求输入数组的长度。

#A loop code to run through each link, and download itfor i in range(len(soup.findAll('img'))):

So in human words, if the array held 51 items, the code will look likeFor i in range(50):

因此,用人类的话来说,如果数组包含51个项目,则代码将类似于range(50)中的For i:

Next, we’ll return back to our soup object, and do the real filtering.

接下来,我们将返回到汤对象,并进行实际过滤。

tag = soup.findAll('img')[i]   link = tag['src']

Remember that we are in a For loop, so [i] represents a number.

请记住,我们处于For循环中,因此[i]代表一个数字。

So we are telling BeautifulSoup to findAll ‘img’ tags, store it in a temp array, and reference a specific index number based on where we are in the loop.

因此,我们告诉BeautifulSoup查找所有'img'标签,将其存储在临时数组中,并根据我们在循环中的位置引用特定的索引号。

So instead of calling an array directly like allOfTheImages[10], we’re using soup.findAll(‘img’)[10], and then passing it to the tag variable.

因此,我们没有像allOfTheImages [10]那样直接调用数组,而是使用了soup.findAll('img')[10],然后将其传递给tag变量。

The data in the tag variable will look something like:

tag变量中的数据如下所示:

<img src="smiley.gif" alt="Smiley face" height="42" width="42">

Which is why the next step is pulling out the ‘src’.

这就是为什么下一步要推出“ src”的原因。

下载内容 (Downloading the Content)

Finally — it’s the fun part!

最后-这是有趣的部分!

We go to the final part of the loop, with downloading the content.

我们转到循环的最后一部分,下载内容。

There’s a few odd design elements here that I want to point out.

我想指出一些奇怪的设计元素。

  1. The IF statement is actually a hack I made for other sites I was testing. There were times when I was grabbing images that was the part of the root site (like the favicon or the social media icons) that I didn’t want. So using the IF statement allowed me to ignore it.IF语句实际上是我为我正在测试的其他站点所做的黑客攻击。 有时候,当我获取我不想要的作为根网站一部分的图像(例如收藏夹图标或社交媒体图标)时,会出现这种情况。 因此,使用IF语句使我可以忽略它。
  2. I also forced all the images to be .jpg. I could have written another chunk of IF statements to check the datatype, and then append the correct filetype. But that was adding a significant chunk of code that made this tutorial longer.我还强制所有图像均为.jpg。 我本可以编写另一段IF语句来检查数据类型,然后附加正确的文件类型。 但这增加了大量代码,使本教程更长。
  3. I also added all the print commands. If you wanted to grab all the links of a webpage, or specific content — you can stop right here! You did it!我还添加了所有打印命令。 如果您想获取网页的所有链接或特定内容,可以在这里停下来! 你做到了!

I also want to point out is the requests.get(link) and the open(filename, ‘wb’).write(r.content) code.

我还想指出的是requests.get(link)open(filename,'wb')。write(r.content)代码。

r = requests.get(link)open(filename, 'wb').write(r.content)

How this works:

工作原理:

  1. Requests gets the link.

    请求获取链接。

2. Open is a default python function that opens or creates a file, gives it writing & binary mode access (since images are are just 1s and 0s), and writes the content of the link into that file.

2. Open是默认的python函数,用于打开或创建文件,为其提供写和二进制模式访问权限(因为图像仅为1和0),并将链接的内容写入该文件。

#skip it if it doesn't start with http
if "http" in full_link:
print("grabbed url: " + link)        filename = str(i) + '.jpg'        print("Download: " + filename)        r = requests.get(link)        open(filename, 'wb').write(r.content)    else:        print("grabbed url: " + link)        print("skip")        time.sleep(1)

Web Scraping has a lot of useful features.

Web爬网具有许多有用的功能。

This code won’t work right out of the box for most sites with images, but it can serve as a foundation to how to grab images on different sites.

对于大多数带有图像的站点,此代码无法立即使用,但可以作为在不同站点上获取图像的基础。

权力游戏和Python 3-生成报告和数据 (Game of Thrones and Python 3 — Generating reports and data)

Gathering data is easy. Interpreting the data is difficult. Which is why there’s a huge surge of demand for data scientists who can make sense of this data. And data scientists use languages like R and Python to interpret it.

收集数据很容易。 解释数据很困难。 这就是为什么对可以理解这些数据的数据科学家的需求激增的原因。 数据科学家使用R和Python等语言来解释它。

In this tutorial, we’ll be using the csv module, which will be enough to generate a report. If we were working with a huge dataset, one that’s like 50,000 rows or bigger, we’d have to tap into the Pandas library.

在本教程中,我们将使用csv模块,该模块足以生成报告。 如果我们正在处理一个巨大的数据集,例如一个50,000行或更大的数据集,则必须使用Pandas库。

What we will be doing is downloading a CSV, having Python interpret the data, send a query based on what kind of question we want answered, and then have the answer print out to us.

我们要做的是下载CSV,让Python解释数据,根据我们要回答的问题类型发送查询,然后将答案打印出来。

Python VS基本电子表格功能 (Python VS basic spreadsheet functions)

You might be wondering:

您可能想知道:

“Why should I use Python when I can easily just use spreadsheet functions like =SUM or =COUNT, or filter out the rows I don’t need manually?”

“当我可以轻松地仅使用= SUM或= COUNT之类的电子表格函数,或者过滤掉不需要的行时,为什么要使用Python?”

Like for all the other automation tricks in Part 1 and 2, you can definitely do this manually.

像第1部分和第2部分中的所有其他自动化技巧一样,您绝对可以手动执行此操作。

But imagine if you had to generate a new report every day.

但是,假设您是否必须每天生成一个新报告

For example: I build online courses. And we want a daily report of every student’s progress. How many students started today? How many students are active this week? How many students made it to Module 2? How many students submitted their Module 3 homework? How many students clicked on the completion button on mobile devices?

例如:我建立在线课程。 我们希望每天报告每个学生的进度。 今天有多少学生入学? 本周有多少学生活跃? 有多少学生参加了模块2? 多少学生提交了单元3作业? 多少学生点击了移动设备上的完成按钮?

I can either spend 15 minutes sorting through the data to generate a report for my team. OR write Python code that does it daily.

我可以花15分钟来整理数据以为团队生成报告。 或编写每天执行的Python代码。

Other use cases for using code instead of default spreadsheet functions:

使用代码代替默认电子表格功能的其他用例:

  • You might be working with a huge set of data (huge like 50,000 rows and 20 columns)您可能正在处理大量数据(例如50,000行和20列)
  • You require multiple slices of filters and segmentation to get your answers.您需要多片过滤器和细分才能获得答案。
  • You need to run the same query on a dataset that changes repeatedly您需要对重复更改的数据集运行相同的查询

使用权力游戏生成报告 (Generating Reports with Game of Thrones)

Every year, Winteriscoming.net, a Game of Thrones news site, has their annual March Madness. Visitors would vote for their favorite characters, and winners move up the bracket and compete against another person. After 6 rounds of votes, a winner is declared.

每年,《权力的游戏》新闻网站Winteriscoming.net都会举办年度疯狂游行 。 访客会投票选出他们最喜欢的角色,而获胜者则排名上升并与另一个人竞争。 经过6轮投票后,宣布获胜者。

Since 2019’s votes are still happening, I grabbed all 6 rounds of 2018’s data and compiled them into a CSV file. To see how the poll looked like on winteriscoming.net, click here.

由于2019年的投票仍在进行中,因此我获取了2018年的所有6轮数据并将其编译为CSV文件。 要查看winteriscoming.net上的民意调查, 请单击此处 。

I’ve also added some additional background data (like where they are from), to make the reporting a bit more interesting.

我还添加了一些其他背景数据(例如它们来自何处),以使报告更加有趣。

问问题 (Asking Questions)

In order to generate a report, we have to ask some questions.

为了生成报告,我们必须提出一些问题。

By definition: A report’s primary duty is to ANSWER questions.

按照定义 :报告的主要职责是回答问题。

So let’s make them up right now.

因此,让我们立即对其进行弥补。

Based on this dataset… here’s some questions.

基于此数据集……这是一些问题。

  1. Who won the popularity vote?谁赢得了人气投票?
  2. Who won based on averages?谁赢得了平均值?
  3. Who is the most popular non-Westeros person? (characters not born in Westeros)谁是最受欢迎的非维斯特洛人? (不是在维斯特洛出生的字符)

在回答问题之前-让我们设置我们的Python代码 (Before answering questions — let’s set up our Python code)

To make it easier, I wrote the all the code, including revisions — in my new favorite online IDE, Repl.it.

为了简化起见,我在最喜欢的新在线IDE Repl.it中编写了所有代码,包括修订版。

import csv#
Import the dataf_csv = open('winter-is-coming-2018.csv')headers = next(f_csv) f_reader = csv.reader(f_csv)file_data = list(f_reader)# Make all blank cells into zeroes# https://stackoverflow.com/questions/2862709/replacing-empty-csv-column-values-with-a-zerofor row in file_data:  for i, x in enumerate(row):    if len(x)< 1:      x = row[i] = 0

Here’s my process with the code.

这是我的代码处理过程。

  1. I imported the csv module.我导入了csv模块。

2. I imported the csv file, and turned it into a list type called file_data.

2.我导入了csv文件,并将其转换为名为file_data的列表类型。

  • The way Python reads your file is by first passing the data to an object.Python读取文件的方式是先将数据传递给对象。
  • I removed the header, since it’ll fudge the data.我删除了标头,因为它会弄乱数据。
  • I then pass the object to a reader, and finally a list.然后,我将该对象传递给阅读器,最后传递给列表。
  • Note: I just realized I did it via the Python 2 way. There’s a cleaner way to do it in Python 3. Oh well. Still works.

    注意:我刚刚意识到我是通过Python 2方法完成的。 在Python 3中 有一种 更清洁的方法 那好吧。 仍然有效。

3. In order to sum up any totals, I made all blank cells become 0.

3.为了求和,我使所有空白单元格都变为0。

  • This was one of those moments where found a Stack Overflow solution that was better than my original version.

    这是当时发现Stack Overflow解决方案比我的原始版本更好的时刻之一。

With this set up, we can now loop through the list of data, and answer questions!

通过此设置,我们现在可以遍历数据列表,并回答问题!

问题1 –谁赢得了人气投票? (Question 1 — Who won the popularity vote?)

The Spreadsheet method:

电子表格方法:

The easiest way would be to add up each cell, using a formula. Using row 2 as an example, in a blank column, you can write the formula:

最简单的方法是使用公式将每个单元相加。 以第2行为例,在空白列中,可以编写公式:

=sum(E2:J2)

You can then drag that formula for the other rows.

然后,您可以将该公式拖动到其他行。

Then, sort it by total. And you have a winner!

然后,按总数排序。 而且你有赢家!

## Include the code from above
# Push the data to a dictionarytotal_score = {}
# Pass each character and their final score into total_score dictionaryfor row in file_data:  total = (int(row[4]) +           int(row[5]) +           int(row[6]) +           int(row[7]) +           int(row[8]) +           int(row[9]) )  total_score[row[0]] = total# Dictionaries aren't sortable by default, we'll have to borrow from these two classes.
# https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-valuefrom operator import itemgetterfrom collections import OrderedDictsorted_score = OrderedDict(sorted(total_score.items(), key=itemgetter(1) ,reverse=True))
# We get the name of the winner and their scorewinner = list(sorted_score)[0]
#jon snowwinner_score = sorted_score[winner] #scoreprint(winner + " with " + str(winner_score))
## RESULT => Jon Snow with 12959

The steps I took are:

我采取的步骤是:

  1. The dataset is just one big list. By using a for loop, you can then access each row.数据集只是一个大列表。 通过使用for循环,然后可以访问每一行。
  2. Within that for loop, I added each cell. (emulating the whole “=sum(E:J)” formula)在该for循环中,我添加了每个单元格。 (模拟整个“ = sum(E:J)”公式)
  3. Since dictionaries aren’t exactly sortable, I had to import two classes to help me sort the dictionary by their values, from high to low.由于字典不是完全可排序的,因此我必须导入两个类来帮助我按字典的值从高到低对字典进行排序。
  4. Finally, I passed the winner, and the winner’s value as text.最后,我通过了获胜者,并将获胜者的价值作为文字。

To help understand that loop, I drew a diagram.

为了帮助理解该循环,我绘制了一个图表。

Overall, this process is a bit longer compared to the spreadsheet Method. But wait, it gets easier!

总体而言,与电子表格方法相比,此过程要更长一些。 但是,等等,它变得更容易了!

问题2 –谁以平均数获胜? (Question 2 — Who won based on averages?)

You might have noticed that whoever proceeded farther in the rankings would obviously get more votes.

您可能已经注意到,无论谁在排名中取得更大的进步,显然都会获得更多的选票。

For example: If Jon Snow got 500 points in Round One and 1000 points in Round Two, he already beats The Mountain who only had 1000 points and never made it past his bracket.

例如:如果乔恩·雪诺( Jon Snow)在第一回合中获得500分,在第二回合中获得1000分,那么他已经击败了仅获得1000分并且从未超越其支架的The Mountain

So the next best thing is to sum the total, and then divide it based on how many rounds they participated in.

因此,下一个最好的事情是对总数进行求和,然后根据他们参与的回合数对其进行划分。

The Spreadsheet Method:

电子表格方法:

This is easy. In Column B is how many rounds they participated in. You would divide the rounds by the sum, and presto!

这很简单。 在B列中,他们参加了多少回合。您可以将各回合除以总和,然后确定!

## OLD CODE FROM QUESTION
1# Pass each character and their final score into total_score dictionaryfor row in file_data:  total = (int(row[4]) +           int(row[5]) +           int(row[6]) +           int(row[7]) +           int(row[8]) +           int(row[9]) )  total_score[row[0]] = total
## NEW CODE
Pass each character and their final score into total_score dictionaryfor row in file_data:  total = (int(row[4]) +           int(row[5]) +           int(row[6]) +           int(row[7]) +           int(row[8]) +           int(row[9]) )
# NEW LINE - divide by how many rounds  new_total = total / int(row[2])  total_score[row[0]] = new_total
# RESULT => Davos Seaworth with 2247.6666666666665

Noticed the change? I just added one additional line.

注意到变化了吗? 我刚刚增加了一行。

That’s all it took to answer this question! NEXT!

这就是回答这个问题所需要的全部! 下一个!

问题3 —谁是最受欢迎的非Westeros人? (Question 3 — Who is the most popular non-Westeros person?)

With first two examples, it’s pretty easy to calculate the total with the default spreadsheet functions. For this question, things are a bit more complicated.

在前两个示例中,使用默认电子表格功能很容易计算出总数。 对于这个问题,事情要复杂一些。

The Spreadsheet Method:

电子表格方法:

  1. Assuming you already have the sum假设你已经有了总和
  2. You now have to filter it based on if they are Westeros/Other现在,您必须根据它们是否为Westeros / Other对其进行过滤
  3. Then sort by the sum然后按总和排序
## OLD CODE FROM QUESTION
1# Pass each character and their final score into total_score dictionaryfor row in file_data:  total = (int(row[4]) +           int(row[5]) +           int(row[6]) +           int(row[7]) +           int(row[8]) +           int(row[9]) )  # NEW LINE - divide by how many rounds  new_total = total / int(row[2])  total_score[row[0]] = new_total## NEW CODE# Pass each character and their final score into total_score dictionaryfor row in file_data:  # Add IF-THEN statement  if (row[3] == 'other'):    total = (int(row[4]) +             int(row[5]) +             int(row[6]) +             int(row[7]) +             int(row[8]) +             int(row[9]) )  else:    total = 0  total_score[row[0]] = total# RESULT => Missandei with 4811

In Question 2, I added one line of code to answer that new question.

在问题2中,我添加了一行代码来回答这个新问题。

In Question 3, I added a IF-ELSE statement. If they are non-Westeros, then count their score. Else, give them a score of 0.

在问题3中,我添加了IF-ELSE语句。 如果他们不是维斯特洛人,请计算他们的分数。 否则,给他们打0分。

对此进行审查: (Reviewing this:)

While the spreadsheet Method doesn’t seem like a lot of steps, it sure is a lot more clicks. The Python method took a lot longer to set up, but each additional query involved changing a few lines of code.

尽管电子表格方法似乎没有很多步骤,但可以肯定会有更多点击。 Python方法的建立花费了更长的时间,但是每个额外的查询都涉及更改几行代码。

Imagine if the stakeholder asked a dozen more questions.

想象一下,如果利益相关者又问了十几个问题。

For example:

例如:

  1. How many points did characters whose names start with L have?名称以L开头的字符有多少点?
  2. Or how many points did everyone in round 3 get who lived in Westeros?或者第三轮比赛中的每个人在韦斯特罗斯(Westeros)住了多少分?
  3. Or if it was 640 GoT characters instead of just 64?或者,如果它是640个GoT字符而不是64个?

But also imagine this — you’re given a dataset that’s roughly 50 megabytes (Our Game of Thrones csv file was barely 50 kilobytes — roughly 1/1000 the size). A 50mb file that large would probably take Excel a few minutes to load. Additionally, it’s not unusual for Data Scientists to use datasets that are in the 10 gigabyte range!

但也可以想象一下-您获得的数据集约为50兆字节(我们的《权力的游戏》 csv文件只有50兆字节-约为大小的1000)。 大小为50mb的文件可能需要Excel加载几分钟。 此外,对于数据科学家来说,使用10 GB范围内的数据集也很常见!

Overall, as the data set scales, it’ll take longer and longer to process. And that’s where the power of Python comes in.

总体而言,随着数据集的扩展,处理时间将越来越长。 这就是Python强大功能的所在。

结论 (Conclusion)

In Part 1, I covered web automation with the Selenium library. In Part 2, I covered web scraping with the BeautifulSoup library. And in Part 3, I covered generating reports with the csv module.

在第1部分中,我用Selenium库介绍了Web自动化。 在第2部分中,我用BeautifulSoup库介绍了Web抓取。 在第3部分中,我介绍了使用csv模块生成报告。

While I covered them in pieces — there’s also a synergy between them. Imagine if you had a project where you had to figure out who dies next in Game of Thrones based on the comments by the actors on the show. You might start with web scraping all of the actors’ names off of IMDB. You might use Selenium to automatically log into various social media platforms and search for their social media name. You might then compile all the data, and interpret it as a csv or, if it’s really huge, using the Pandas library.

当我把它们分成几部分时,它们之间也有协同作用。 想象一下,如果您有一个项目,必须根据演出者的评论来确定谁在《权力的游戏》中死了。 您可能首先从Web上将所有参与者的姓名从IMDB中刮下来。 您可以使用Selenium自动登录各种社交媒体平台并搜索其社交媒体名称。 然后,您可以编译所有数据,并将其解释为csv,或者如果它真的很大,则使用Pandas库。

We didn’t even get into Machine Learning, AI, Web Development, or the dozens of other things people use Python for.

我们甚至没有涉足机器学习,人工智能,Web开发或人们使用Python进行的其他数十种工作。

Let this be a stepping stone into your Python journey!

让这成为您Python之旅的垫脚石!



? Absolutely HUGE shout out to mJordan for proofing my work at the Puppies and Portfolios meetup. She is one of the most talented CSS developers I have ever met.

绝对大声喊叫 mJordan,以证明我在“ 小狗和公文包”聚会上的工作证明。 她是我见过的最有才华CSS开发人员之一。

? If you like nerding out about course building, online education and the future of education — reach out to me on my Linkedin or Twitter.

如果您想了解课程建设,在线教育和教育的未来,请通过 Linkedin Twitter 与我联系

翻译自: https://www.freecodecamp.org/news/how-i-used-python-to-analyze-game-of-thrones-503a96028ce6/

使用python数据分析

使用python数据分析_我如何使用Python分析《权力游戏》相关推荐

  1. 使用python数据分析_如何使用Python提升您的数据分析技能

    使用python数据分析 If you're learning Python, you've likely heard about sci-kit-learn, NumPy and Pandas. A ...

  2. 视频教程-Python数据分析案例实战 视频课程-Python

    Python数据分析案例实战 视频课程 计算机硕士,多年工作经验,技术和产品负责人. 多年推荐系统/NLP/大数据工作经验. 负责公司多个AI项目产品落地,包括文本分类.关键词抽取.命名实体识别.对话 ...

  3. Python数据分析大作业 4000+字 图文分析文档 销售分析

    资源地址:Python数据分析大作业 4000+字 图文分析文档 销售分析 +完整python代码 数据来自某商场,具体商业数据保密 资源地址:Python数据分析大作业 4000+字 图文分析文档 ...

  4. Python数据分析实战:降雨量统计分析报告分析

    本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. 以下文章来源于菜J学Python ,作者小小明 最近遇到一个有点烧脑的需求,其实也不算烧pytho ...

  5. Python数据分析_第06课:数据清洗与初步分析_笔记

    文章目录 缺失值处理--拉格朗日插值法 dataframe合并 索引上的合并 轴向连接 合并重叠数据 重塑层次化索引 长宽格式的转换 移除重复数据 利用函数或映射进行数据转换 数据标准化 最小-最大规 ...

  6. 七月在线python数据分析_七月在线Python基础+数据分析班+爬虫项目

    资源介绍 七月在线Python基础+数据分析班+爬虫项目 目录 ├─Python基础入门班2017年 ││代码.rar │└─视频 │xa0 xa0 01.第1课 入门基础.rar │xa0 xa0 ...

  7. 七月在线python数据分析_七月在线Python数据分析笔记

    一块钱的网课,真的值!!!!!一直拖到现在才看,真的挺不好意思的,哈哈哈哈~ 对于没有任何基础的萌新来说,真的还是有难度,希望后面能好好学习,有所收获,给自己加个油吧!!! 第一节课主要是数据分析入门 ...

  8. Python数据分析_第11课:logistic回归_笔记

    文章目录 逻辑回归 案例1:银行贷款违约分析 参数初始化 建立随机逻辑回归模型,筛选变量 建立逻辑回归模型 非线性回归 案例2 多项式模型 对数模型 指数模型 幂函数模型 画图 GitHub: htt ...

  9. pythoncookbook和流畅的python对比_为什么你学Python效率比别人慢?因为你没有这套完整的学习资料...

    以下资源免费获取方式! 关注!转发!私信"资料"即可免费领取! 入门书籍 1.<Python基础教程>(Beginning Python From Novice to ...

最新文章

  1. NIPS | 谷歌AI大军来袭,看450多名员工如何横扫今年大会
  2. Servlet--01--概念
  3. 【Android 安全】DEX 加密 ( 代理 Application 开发 | 项目中配置 OpenSSL 开源库 | 使用 OpenSSL 开源库解密 dex 文件 )
  4. Vsftp与PAM虚拟用户
  5. SRTP参数及数据包处理过程
  6. 如何查看cplex的help文档_word查看技巧:如何快速找到文档的修改痕迹
  7. 在Ubuntu Server上使用vtk处理体数据,直接得到渲染结果图片避免显示窗口
  8. mysql 机器复制_MySQL复制在同一台机器上
  9. citrix xendesktop edition
  10. hosts文件位置和修复hosts文件
  11. Javasocket编程步骤,已有千人收藏
  12. 用termux打开python文件,安卓手机运行python程序的软件:Termux、Pydroid3
  13. 粒子追踪 matlab,粒子追踪软件 - 研究粒子与场的相互作用
  14. Nodemcu 背篼酥课堂--物联网实战体系课程
  15. 决策树参数criterion
  16. SQL基础1--select
  17. 正则表达式与遇到的问题
  18. 第十三届蓝桥杯 EDA 设计与开发科目 模拟试题(详细解读)
  19. 什么是入侵防御系统(IPS)?底层原理是什么?
  20. getElementsByTagName的用法

热门文章

  1. SpringBoot整合Quartz之动态控制任务(暂停,启动,修改执行时间)
  2. php小数点后保留一位或两位小数
  3. ¥1-3 SWUST oj 942: 逆置顺序表
  4. 破解微信小游戏-动物餐厅之无限小鱼干
  5. 工程伦理第五章习题答案
  6. 2020计算机科学与技术考研大纲,2020东华大学计算机考研大纲
  7. Docker | Docker 快速搭建 TensorRT 环境(超详细)
  8. RabbitMQ-基础知识总结
  9. java写的股票技术分析_基于Java语言开发的个性化股票分析技术:量能突破模型(Energe-Break)...
  10. 计算机网络复习(第五章)