威尼斯人线上娱乐

Python互连网爬虫实战项目代码大全,Python多线程爬虫与三种数码存款和储蓄格局完结

17 4月 , 2019  

1. 多进程爬虫

  对于数据量较大的爬虫,对数据的处理须求较高时,能够使用python多进度或八线程的体制完毕,多进度是指分配多少个CPU处理程序,同一时半刻刻只有1个CPU在做事,十2线程是指进度之中有八个近乎”子进程”同时在协同工作。python中有四种七个模块可成功多进度和二十三四线程的行事,此处此用multiprocessing模块产生十贰线程爬虫,测试进度中发觉,由于站点具备反爬虫机制,当url地址和进度数目较多时,爬虫会报错。

Python八线程爬虫与多种数据存款和储蓄格局达成(Python爬虫实战二),python爬虫

威尼斯人线上娱乐 1

1]-
微信公众号爬虫。基于搜狗微信寻觅的微信公众号爬虫接口,能够增添成基于搜狗搜索的爬虫,再次来到结果是列表,每一项均是民众号具体信息字典

二. 代码内容

#!/usr/bin/python
#_*_ coding:utf _*_

import re
import time 
import requests
from multiprocessing import Pool

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 time.sleep(1)
 return duanzi_list

def normal_scapper(url_lists):
 '''
 定义调用函数,使用普通的爬虫函数爬取数据
 '''
 begin_time = time.time()
 for url in url_lists:
  scrap_qiushi_info(url)
 end_time = time.time()
 print "普通爬虫一共耗费时长:%f" % (end_time - begin_time)

def muti_process_scapper(url_lists,process_num=2):
 '''
 定义多进程爬虫调用函数,使用mutiprocessing模块爬取web数据
 '''
 begin_time = time.time()
 pool = Pool(processes=process_num)
 pool.map(scrap_qiushi_info,url_lists)
 end_time = time.time()
 print "%d个进程爬虫爬取所耗费时长为:%s" % (process_num,(end_time - begin_time))

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 normal_scapper(url_lists)
 muti_process_scapper(url_lists,process_num=2)


if __name__ == "__main__":
 main()

一. 多进度爬虫

  对于数据量较大的爬虫,对数据的处理须要较高时,能够运用python多进程或三八线程的体制变成,多进程是指分配八个CPU处理程序,同一时半刻刻只有二个CPU在做事,多线程是指进度之中有多少个像样”子进度”同时在协同工作。python中有三种几个模块可成功多进程和三十二线程的做事,此处此用multiprocessing模块产生二十八线程爬虫,测试进度中窥见,由于站点具备反爬虫机制,当url地址和进程数目较多时,爬虫会报错。

那是菜鸟学Python的第七8篇原创作品阅读本文大致须求3分钟

python零基础学习摄像教程全集-3

 三. 爬取的数量存入到MongoDB数据库

#!/usr/bin/python
#_*_ coding:utf _*_

import re
import time 
import json
import requests
import pymongo
from multiprocessing import Pool

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_mongo(datas):
 '''
 @datas: 需要插入到mongoDB的数据,封装为字典,通过遍历的方式将数据插入到mongoDB中,insert_one()表示一次插入一条数据
 '''
 client = pymongo.MongoClient('localhost',27017)
 duanzi = client['duanzi_db']
 duanzi_info = duanzi['duanzi_info']
 for data in datas:
  duanzi_info.insert_one(data)

def query_data_from_mongo():
 '''
 查询mongoDB中的数据
 '''
 client = pymongo.MongoClient('localhost',27017)['duanzi_db']['duanzi_info']
 for data in client.find():
  print data 
 print "一共查询到%d条数据" % (client.find().count())


def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_mongo(duanzi_list)

if __name__ == "__main__":
 main()
 #query_data_from_mongo()

二. 代码内容

#!/usr/bin/python
#_*_ coding:utf _*_

import re
import time 
import requests
from multiprocessing import Pool

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 time.sleep(1)
 return duanzi_list

def normal_scapper(url_lists):
 '''
 定义调用函数,使用普通的爬虫函数爬取数据
 '''
 begin_time = time.time()
 for url in url_lists:
  scrap_qiushi_info(url)
 end_time = time.time()
 print "普通爬虫一共耗费时长:%f" % (end_time - begin_time)

def muti_process_scapper(url_lists,process_num=2):
 '''
 定义多进程爬虫调用函数,使用mutiprocessing模块爬取web数据
 '''
 begin_time = time.time()
 pool = Pool(processes=process_num)
 pool.map(scrap_qiushi_info,url_lists)
 end_time = time.time()
 print "%d个进程爬虫爬取所耗费时长为:%s" % (process_num,(end_time - begin_time))

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 normal_scapper(url_lists)
 muti_process_scapper(url_lists,process_num=2)


if __name__ == "__main__":
 main()

面前写了一篇小说关于爬取市面上全部的Python书思路,那也终于大家多少解析类别讲座里面包车型大巴三个小的实战项目。上次代码未有写完,正好周末有时间把代码全体到位同时存入了数据库中,明日就给大家一步步剖析一下是本人是哪些爬取数据,清洗数据和绕过反爬虫的一对国策和轻松记录。

2]-
豆瓣读书爬虫。能够爬下豆瓣读书标签下的具有图书,按评分排名依次存款和储蓄,存款和储蓄到Excel中,可方便咱们筛选收集,比如筛选评价人数>一千的高分书籍;可根据分裂的核心存储到Excel分歧的Sheet
,选拔User
Agent伪装为浏览器实行爬取,并参预随机延时来更加好的效仿浏览器行为,防止爬虫被封

 4. 插入至MySQL数据库

  将爬虫获取的数据插入到关系性数据库MySQL数据库中作为永远数据存款和储蓄,首先要求在MySQL数据库中创造库和表,如下:

1. 创建库
MariaDB [(none)]> create database qiushi;
Query OK, 1 row affected (0.00 sec)

2. 使用库
MariaDB [(none)]> use qiushi;
Database changed

3. 创建表格
MariaDB [qiushi]> create table qiushi_info(id int(32) unsigned primary key auto_increment,username varchar(64) not null,level int default 0,laugh_count int default 0,comment_count int default 0,content text default '')engine=InnoDB charset='UTF8';
Query OK, 0 rows affected, 1 warning (0.06 sec)

MariaDB [qiushi]> show create table qiushi_info;
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table       | Create Table                                                                                                                                                                                                                                                                                            |
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| qiushi_info | CREATE TABLE `qiushi_info` (
  `id` int(32) unsigned NOT NULL AUTO_INCREMENT,
  `username` varchar(64) NOT NULL,
  `level` int(11) DEFAULT '0',
  `laugh_count` int(11) DEFAULT '0',
  `comment_count` int(11) DEFAULT '0',
  `content` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

 写入到MySQL数据库中的代码如下:

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import time 
import pymysql
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_mysql(datas):
 '''
 @params: datas,将爬虫获取的数据写入到MySQL数据库中
 '''
 try:
  conn = pymysql.connect(host='localhost',port=3306,user='root',password='',db='qiushi',charset='utf8')
  cursor = conn.cursor(pymysql.cursors.DictCursor)
  for data in datas:
   data_list = (data['username'],int(data['level']),int(data['laugh_count']),int(data['comment_count']),data['content'])
   sql = "INSERT INTO qiushi_info(username,level,laugh_count,comment_count,content) VALUES('%s',%s,%s,%s,'%s')" %(data_list)
   cursor.execute(sql)
   conn.commit()
 except Exception as e:
  print e
 cursor.close()
 conn.close()


def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_mysql(duanzi_list)

if __name__ == "__main__":
 main()

 三. 爬取的数量存入到MongoDB数据库

#!/usr/bin/python
#_*_ coding:utf _*_

import re
import time 
import json
import requests
import pymongo
from multiprocessing import Pool

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_mongo(datas):
 '''
 @datas: 需要插入到mongoDB的数据,封装为字典,通过遍历的方式将数据插入到mongoDB中,insert_one()表示一次插入一条数据
 '''
 client = pymongo.MongoClient('localhost',27017)
 duanzi = client['duanzi_db']
 duanzi_info = duanzi['duanzi_info']
 for data in datas:
  duanzi_info.insert_one(data)

def query_data_from_mongo():
 '''
 查询mongoDB中的数据
 '''
 client = pymongo.MongoClient('localhost',27017)['duanzi_db']['duanzi_info']
 for data in client.find():
  print data 
 print "一共查询到%d条数据" % (client.find().count())


def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_mongo(duanzi_list)

if __name__ == "__main__":
 main()
 #query_data_from_mongo()

一).市面上全部的Python书,都在京东,Tmall和豆子上,于是本身选取了豆瓣来爬取贰).分析网址的布局,其实如故比较轻松的Python互连网爬虫实战项目代码大全,Python多线程爬虫与三种数码存款和储蓄格局完结。,首先有一个主的页面,里面有全部python的链接,1共1388本(个中有100多本其实是重新的),网页底部分页展现一共玖三页

3]-
天涯论坛爬虫。此项指标效益是爬取新浪用户新闻以及人际拓扑关系,爬虫框架使用scrapy,数据存款和储蓄使用mongodb。[3]: 

 五. 将爬虫数据写入到CSV文件

  CSV文件是以逗号,格局分开的文本读写格局,能够透过纯文本恐怕Excel情势读取,是1种普及的数据存款和储蓄方式,此处将爬取的数量存入到CSV文件内。

将数据存入到CSV文件代码内容如下:

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import csv
import time 
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_csv(datas,filename):
 '''
 @datas: 需要写入csv文件的数据内容,是一个列表
 @params:filename,需要写入到目标文件的csv文件名
 '''
 with file(filename,'w+') as f:
  writer = csv.writer(f)
  writer.writerow(('username','level','laugh_count','comment_count','content'))
  for data in datas:
   writer.writerow((data['username'],data['level'],data['laugh_count'],data['comment_count'],data['content']))

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_csv(duanzi_list,'/root/duanzi_info.csv')

if __name__ == "__main__":
 main()

 4. 插入至MySQL数据库

  将爬虫获取的数码插入到关系性数据库MySQL数据库中作为恒久数据存储,首先要求在MySQL数据库中创造库和表,如下:

1. 创建库
MariaDB [(none)]> create database qiushi;
Query OK, 1 row affected (0.00 sec)

2. 使用库
MariaDB [(none)]> use qiushi;
Database changed

3. 创建表格
MariaDB [qiushi]> create table qiushi_info(id int(32) unsigned primary key auto_increment,username varchar(64) not null,level int default 0,laugh_count int default 0,comment_count int default 0,content text default '')engine=InnoDB charset='UTF8';
Query OK, 0 rows affected, 1 warning (0.06 sec)

MariaDB [qiushi]> show create table qiushi_info;
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table       | Create Table                                                                                                                                                                                                                                                                                            |
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| qiushi_info | CREATE TABLE `qiushi_info` (
  `id` int(32) unsigned NOT NULL AUTO_INCREMENT,
  `username` varchar(64) NOT NULL,
  `level` int(11) DEFAULT '0',
  `laugh_count` int(11) DEFAULT '0',
  `comment_count` int(11) DEFAULT '0',
  `content` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

 写入到MySQL数据库中的代码如下:

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import time 
import pymysql
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_mysql(datas):
 '''
 @params: datas,将爬虫获取的数据写入到MySQL数据库中
 '''
 try:
  conn = pymysql.connect(host='localhost',port=3306,user='root',password='',db='qiushi',charset='utf8')
  cursor = conn.cursor(pymysql.cursors.DictCursor)
  for data in datas:
   data_list = (data['username'],int(data['level']),int(data['laugh_count']),int(data['comment_count']),data['content'])
   sql = "INSERT INTO qiushi_info(username,level,laugh_count,comment_count,content) VALUES('%s',%s,%s,%s,'%s')" %(data_list)
   cursor.execute(sql)
   conn.commit()
 except Exception as e:
  print e
 cursor.close()
 conn.close()


def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_mysql(duanzi_list)

if __name__ == "__main__":
 main()

威尼斯人线上娱乐 2三).那个页面是静态页面,url页相比有规律,所以很轻松构造出具备的url的地点威尼斯人线上娱乐 3肆).爬虫每一个分页里面包车型客车有所的Python书和呼应的url,比如第2页里面有”笨办法那本书”,我们只必要领取书名和对应的url威尼斯人线上娱乐 4威尼斯人线上娱乐 5

[4]-
Bilibili用户爬虫。,抓取字段:用户id,外号,性别,头像,等第,经验值,听众数,破壳日,地址,注册时间,具名,品级与经验值等。抓取之后生成B站用户数量报告。

 陆. 将爬取数据写入到文本文件中

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import csv
import time 
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_files(datas,filename):
 '''
 定义数据存入写文件的函数
 @params:datas需要写入的数据
 @filename:将数据写入到指定的文件名
 '''
 print "开始写入文件.."
 with file(filename,'w+') as f:
  f.write("用户名" + "\t" + "用户等级" + "\t" + "笑话数" + "\t" + "评论数" + "\t" + "段子内容" + "\n")
  for data in datas:
   f.write(data['username'] + "\t" + \
    data['level'] + "\t" + \
    data['laugh_count'] + "\t" + \
    data['comment_count'] + "\t" + \
    data['content'] + "\n" + "\n"
   )

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_files(duanzi_list,'/root/duanzi.txt')

if __name__ == "__main__":
 main()

 

 伍. 将爬虫数据写入到CSV文件

  CSV文件是以逗号,方式分开的文本读写情势,能够由此纯文本大概Excel形式读取,是一种普及的数额存款和储蓄格局,此处将爬取的数额存入到CSV文件内。

将数据存入到CSV文件代码内容如下:

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import csv
import time 
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_csv(datas,filename):
 '''
 @datas: 需要写入csv文件的数据内容,是一个列表
 @params:filename,需要写入到目标文件的csv文件名
 '''
 with file(filename,'w+') as f:
  writer = csv.writer(f)
  writer.writerow(('username','level','laugh_count','comment_count','content'))
  for data in datas:
   writer.writerow((data['username'],data['level'],data['laugh_count'],data['comment_count'],data['content']))

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_csv(duanzi_list,'/root/duanzi_info.csv')

if __name__ == "__main__":
 main()

1).上面大家早已领到了玖拾肆个页面包车型客车兼具的Python书和对应的url,一共是93*15差不离1300多本书,首先先去重,然后我们能够把它存到内部存款和储蓄器里面用1个字典保存,也许存到三个csv文件中去(有同学大概想不到为啥要存到文件之中呢,用字典存取不是便于呢,先不说最后颁发)

5]-
博客园博客园爬虫。首要爬取知乎博客园用户的个人音讯、新浪音信、观者和关爱。代码获取天涯论坛天涯论坛Cookie进行登陆,可透过多账号登入来严防搜狐的反对扒手。首要采取scrapy 爬虫框架。

 陆. 将爬取数据写入到文本文件中

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import csv
import time 
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_files(datas,filename):
 '''
 定义数据存入写文件的函数
 @params:datas需要写入的数据
 @filename:将数据写入到指定的文件名
 '''
 print "开始写入文件.."
 with file(filename,'w+') as f:
  f.write("用户名" + "\t" + "用户等级" + "\t" + "笑话数" + "\t" + "评论数" + "\t" + "段子内容" + "\n")
  for data in datas:
   f.write(data['username'] + "\t" + \
    data['level'] + "\t" + \
    data['laugh_count'] + "\t" + \
    data['comment_count'] + "\t" + \
    data['content'] + "\n" + "\n"
   )

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_files(duanzi_list,'/root/duanzi.txt')

if __name__ == "__main__":
 main()

 

  1. 多进度爬虫 对于数据量较大的爬虫,对数码的拍卖需要较高时,可…

二).我们随后分析每本书页面包车型地铁表征:

r[6]-
小说下载分布式爬虫。使用scrapy,redis,
mongodb,graphite实现的多少个分布式互连网爬虫,底层存款和储蓄mongodb集群,分布式使用redis完毕,爬虫状态突显选取graphite达成,首要针对3个小说站点。

威尼斯人线上娱乐 6上一片作品说过我们要求分析:作者/出版社/译者/出版年/页数/定价/ISBN/评分/评价人数

[7]-
中中原人民共和国知网爬虫。设置检索条件后,实行src/CnkiSpider.py抓取多少,抓取数据存款和储蓄在/data目录下,每一种数据文件的率先行为字段名称。

看一下网址的源码,发现根本的音信在div 和div

[8]-
链家网爬虫。爬取日本首都地区链家历年贰手房成交记录。涵盖链家爬虫一文的整个代码,包含链家模拟登六代码。

威尼斯人线上娱乐 7三).那1局地的数据清洗是比较麻烦的,因为不是每1本书都以有点评和评分系统的,而且不是每壹本书都有小编,页面,价格的,所以提取的时候势要求搞好丰裕处理,比如某些页面长的那样:威尼斯人线上娱乐 8原始数据收罗的经过中有无数不1致的数据:

[9]- 京东爬虫。基于scrapy的京东网址爬虫,保存格式为csv。

  • 书的日子表示格式,各类种种都有:有的书的日期是:’September
    2007’,’October 22, 200七’,’2017-九’,’2017-八-2伍’

  • 一些书的价钱是货币单位不联合,有欧元,英镑,日元和人民币比如:CNY
    4玖.00,13伍,1九 €,JPY 4320, $ 17陆.00

[10]- QQ 群爬虫。批量抓取 QQ
群音信,包罗群名称、群号、群人数、群主、群简要介绍等内容,最平生成 XLS(X) /
CSV 结果文件。

壹).有的同校后台问俺,你是用scrapy框架仍然友好出手写的,笔者那么些类型是团结入手写的,其实scrapy是三个那一个棒的框架,假如爬取几十万的多寡,作者自然会用这么些一级武器.

[11]-乌云爬虫。
乌云公开漏洞、知识库爬虫和搜索。全体当面漏洞的列表和各种漏洞的文书内容存在mongodb中,大约约二G内容;假设整站爬全部文件和图纸作为离线查询,大致需求拾G上空、二小时(十M邮电通讯带宽);爬取全体知识库,总共约500M空中。漏洞寻觅选拔了Flask作为web
server,bootstrap作为前端。

2).小编用的是102线程爬取,把持有的url都扔到3个行列之中,然后设置多少个线程去队列之中不断的爬取,然后循环往复,直到队列里的url全体处理完成

2016.9.11补充:

三).数据存款和储蓄的时候,有二种思路:

[12]- 去何方网爬虫。
网络爬虫之Selenium使用代理登入:爬取去何地网站,使用selenium模拟浏览器登入,获取翻页操作。代理能够存入三个文本,程序读取并应用。援助多进度抓取。

  • ①种是平昔把爬取完的多少存到SQL数据Curry面,然后每趟新的url来了后头,直接查询数据Curry面有未有,有的话,就跳过,没有就爬取处理

  • 另一种是存入CSV文件,因为是多线程存取,所以自然要加维护,不然多少个线程同时写二个文本的会有题指标,写成CSV文件也能调换来数据库,而且保存成CSV文件还有2个功利,能够转成pandas万分方便的处理分析.

[13]-
机票爬虫(去哪儿和携程网)。Findtrip是二个根据Scrapy的机票爬虫,近期组合了国内两大机票网址(去哪儿

1).一般大型的网址都有反爬虫计策,尽管我们这一次爬的多少唯有1000本书,不过同样会境遇反爬虫难点

  • 携程)。

二).关于反爬虫计谋,绕过反爬虫有很各种艺术。有的时候加时延(越发是四线程处理的时候),有的时候用cookie,有的会代理,尤其是普及的爬取料定是要用代理池的,小编那边用的是cookie加时延,相比土的方法.

r[14]

叁).断点续传,就算自身的数据量不是非常大,千条规模,但是建议要加断点续传功效,因为您不精晓在爬的时候会并发哪些难点,尽管您能够递归爬取,可是1旦你爬了800多条,程序挂了,你的东西还没用存下来,下次爬取又要重头开首爬,会风疹的(聪明的同校断定猜到,我上边第三步留的伏笔,便是如此原因)

  • 依照requests、MySQLdb、torndb的微博客户端内容爬虫。

1).整个的代码架构我还不曾完全优化,近年来是五个py文件,前面作者会越发优化和包裹的

[15]- 豆瓣电影、书籍、小组、相册、东西等爬虫集。

威尼斯人线上娱乐 9

[17]-
百度mp三全站爬虫,使用redis帮衬断点续传。r[18]-
Tmall和天猫商城的爬虫,能够依照查找关键词,物品id来抓去页面包车型大巴新闻,数据存款和储蓄在

  • spider_main:主假诺爬取9叁个分页的全数书的链接和书面,并且八线程处理
  • 威尼斯人线上娱乐 ,book_html_parser:重借使爬取每壹本书的新闻
  • url_manager:首假诺管制全数的url链接
  • db_manager:首就算数据库的存取和查询
  • util:是多个存放1些大局的变量
  • verify:是自家公开测试代码的1个小程序

[19]-
三个人股票(stock)数量(沪深)爬虫和选股计策测试框架。依据选定的日子范围抓取全体沪深两市股票(stock)的市价数据。协助采用表明式定义选股战术。帮衬二十多线程处理。保存数据到JSON文件、CSV文件。

贰).首要的爬取结果的寄放

威尼斯人线上娱乐 10all_books_link.csv:首要存放在1200多本书的url和书名威尼斯人线上娱乐 11python_books.csv:首要存放具体每1本书的音讯威尼斯人线上娱乐 12三).用到的库爬虫部分:用了requests,beautifulSoup数据清洗:用了大批量的正则表达式,collection模块,对书的出版日期用了datetime和calendar模块八线程:用了threading模块和queue


结论:好,今天的全网分析Python书,爬虫篇,就讲道那里,基本上大家1切这一个项目标手艺点都讲了3次,爬虫依然很有趣的,可是要成为二个爬虫高手还有不少地点要上学,想把爬虫写的爬取速度快,又稳重,仍是能够绕过反爬虫系统,并不是1件容易的事务.
有意思味的同伙,也得以协调动手写一下啊。源码等背后的数码解析篇讲完后,小编会放github上,若有啥样难点,也欢迎留言钻探一下.

本项目收音和录音各样Python网络爬虫实战开源代码,并长时间更新,欢迎补充。

越多Python干货欢迎加小编爱python大神QQ群:30405079玖


相关文章

发表评论

电子邮件地址不会被公开。 必填项已用*标注

网站地图xml地图