python爬取防爬虫网站的数据

本文转载自 you_are_my_dream 查看原文 2016/11/29 2386 爬虫/ python/ 数据/ 网站

对于反爬虫的网站，比如天眼查，使用phantomJS和selenium这两个可以很轻松的爬取出来

举例来说，在天眼查中搜索百度，然后查看网页源代码，在源代码中查找的时候并不能查找到百度词条，因为它是防爬虫的。

输入的如果是中文的字符串，要注意对中文字符串进行解码，转化成浏览器可以识别的网址形式，

代码如下：(爬取对应的公司名称)

#!/usr/bin/python
#coding: utf-8

from bs4 import BeautifulSoup
from selenium import webdriver
import urllib2

# Zip压缩包解压后exe文件所在的完整的位置
driver = webdriver.PhantomJS(executable_path= r"D:\phantomjs-2.1.1-windows\bin\phantomjs.exe")

def search(keyword):
    # 将手动输入的字符串进行转码
    keyword = keyword.encode("utf-8")
    url_keyword = urllib2.quote(keyword)
    url =  "http://www.tianyancha.com/search?key=%s&checkFrom=searchBox" % url_keyword
    # print(url)
    driver.get(url)

    soup = BeautifulSoup(driver.page_source, "lxml")
    # print(soup)
    soup = soup.find_all("span", {"class" : "ng-binding",  "ng-bind-html" : "node.name | trustHtml"})

    for s in soup:
        # 输出文本的内容
        print s.get_text()

if __name__ == "__main__":
    while True:
        x = raw_input(u"输入字符串：")
        search(x)

智能推荐

注意！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系我们删除。

猜您在找

Python 网络爬虫--简单的爬取一些防爬取的网站 Python3爬虫之五：爬取网站数据并写入excel Python爬虫——爬取网站的图片 Python爬虫爬取网站新闻 python爬虫：爬取网站视频

赞助商链接

python爬取防爬虫网站的数据

注意！

赞助商广告