梦想还是要有的, 万一实现了呢?

Pyspider把抓取的结果存入mysql数据库

Python 拈花古佛 7740℃ 繁體

为了能够将爬取到的数据存入本地数据库,现在本地创建一个mysql数据库example,然后
在数据库中建立一张表格douban_db,示例如下:

CREATE TABLE `douban_db` (
`url` varchar(20) NOT NULL,
`direct`  varchar(30),
`performer`  varchar(255),
`type`  varchar(30),
`district` varchar(20) NOT NULL,
`language`  varchar(30),
`date`  varchar(30),
`time`  varchar(30),
`alias` varchar(20) NOT NULL,
`score`  varchar(30),
`comments`  varchar(300),
`scenario`  varchar(300),
`IMDb`  varchar(30),
PRIMARY KEY (`url`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8;

2.如果使用开源框架pyspider来进行爬虫的话,默认情况下,会把爬取到的结果存放到result.db这个sqilite数据库中,但是为了方便操作,我们将结果存放到mysql中。接下
来要做的一个操作就是重写on_result方法,实例化调用我们自己实现的SQL方法,具体
实例如下:

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2015-03-20 09:46:20
# Project: douban_movie
import re
from pyspider.database.mysql.mysqldb import SQL
from pyspider.libs.base_handler import *
class Handler(BaseHandler):
headers= {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
}
crawl_config = {
'headers' : headers,
'timeout' : 100
}
@every(minutes=24 * 60)
def on_start(self):
self.crawl('http://movie.douban.com/tag/', callback=self.index_page)
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc("a[href^='http://www.douban.com/tag/']").items():
self.crawl(each.attr.href, callback=self.list_page) 
@config(age=10*24*60*60, priority=2)                
def list_page(self, response):
for each in response.doc('.movie-list a').items():
self.crawl(each.attr.href, priority=9, callback=self.detail_page)  
@config(priority=3)
def detail_page(self, response):
return {
'url': response.url,
'title': response.doc('h1').text(),
'direct': ','.join(x.text() for x in response.doc("a[rel='v:directedBy']").items()),
'performer': ','.join(x.text() for x in response.doc("a[rel='v:starring']").items()),
'type': ','.join(x.text() for x in response.doc("span[property='v:genre']").items()),
'district':  ''.join(x.text() for x in response.doc("a[rel='v:starring']").items()),
'language':  ''.join(x.text() for x in response.doc("a[rel='v:starring']").items()),
'date':  ','.join(x.text() for x in response.doc("span[property='v:initialReleaseDate']").items()),
'time':  ','.join(x.text() for x in response.doc("span[property='v:runtime']").items()),
'alias':  ''.join(x.text() for x in response.doc("a[rel='v:starring']").items()),
'score': response.doc('.rating_num').text(),
'comments': response.doc('html > body > div#wrapper > div#content > div.grid-16-8.clearfix > div.article > div#comments-section > div.mod-hd > h4 > i').text(),
'scenario': response.doc('html > body > div#wrapper > div#content > div.grid-16-8.clearfix > div.article > div.related-info > div#link-report.indent').text(),
'IMDb':  response.doc('a[href^="http://www.imdb.com/title/"]').text(),
}
def on_result(self, result):
if not result or not result['title']:
return
sql = SQL()
sql.replace('douban_db',**result)

 

关于上面这段代码,有下面几点需要说明的:
a. 为了避免服务器判断出客户端在进行爬虫操作,从而禁止ip访问(具体表现为出现403禁止访问),我们需要在发出请求的时候加上一个http头,伪装成使用浏览器访问,具体用法如下:

headers= {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
}
crawl_config = {
'headers' : headers,
'timeout' : 100
}

b. @every(minutes=24 * 60)表示每天执行一次
@config(age=10 * 24 * 60 * 60)表示数据10天后就过期了

c. 接下来是一个比较重要的地方,重写on_result方法,相当于实现了一个多态,程序在最后返回时,会执行on_result方法,默认的情况下,on_result是将数据刷入sqlite中,但是如果我们需要将数据插入mysql中,就需要重写on_result方法,具体使用如下:

def on_result(self, result):
if not result or not result['title']:
return
sql = SQL()
sql.replace('example',**result)

注意这里的if not result or not result[‘title’]:这句判断很重要,不然的会会报错,提示result是未定义类型的。

3.在上面的额代码中,提到了实例化调用我们自己实现的SQL方法,并且引用了from pyspider.database.mysql.mysqldb import SQL这个库文件,那么就必须在这个目录下实现这个库,具体如下:
把下面内容文文放到pyspider/pyspider/database/mysql/目录下命名为mysqldb.py

# -*- encoding: utf-8 -*-
from six import itervalues
import mysql.connector
from datetime import date, datetime, timedelta
class SQL:
username = 'root'  # 数据库用户名
password = 'root'  # 数据库密码
database = 'test'  # 数据库
host = '127.0.0.1'  # 数据库主机地址
connection = ''
connect = True
placeholder = '%s'
def __init__(self):
if self.connect:
SQL.connect(self)
def escape(self, string):
return '`%s`' % string
def connect(self):
config = {
'user': SQL.username,
'password': SQL.password,
'host': SQL.host
}
if SQL.database != None:
config['database'] = SQL.database
try:
cnx = mysql.connector.connect(**config)
SQL.connection = cnx
return True
except mysql.connector.Error as err:
if (err.errno == errorcode.ER_ACCESS_DENIED_ERROR):
print 'The credentials you provided are not correct.'
elif (err.errno == errorcode.ER_BAD_DB_ERROR):
print 'The database you provided does not exist.'
else:
print 'Something went wrong: ', err
return False
def replace(self, tablename=None, **values):
if SQL.connection == '':
print 'Please connect first'
return False
tablename = self.escape(tablename)
if values:
_keys = ', '.join(self.escape(k) for k in values)
_values = ', '.join([self.placeholder, ] * len(values))
sql_query = 'REPLACE INTO %s (%s) VALUES (%s)' % (tablename, _keys, _values)
else:
sql_query = 'REPLACE INTO %s DEFAULT VALUES' % tablename
cur = SQL.connection.cursor()
try:
if values:
cur.execute(sql_query, list(itervalues(values)))
else:
cur.execute(sql_query)
SQL.connection.commit()
return True
except mysql.connector.Error as err:
print ('An error occured: {}'.format(err))
return False

学习文档:http://blog.binux.me/2015/01/pyspider-tutorial-level-1-html-and-css-selector/
测试环境:http://demo.pyspider.org/

转载请注明:拈花古佛 » Pyspider把抓取的结果存入mysql数据库

喜欢 (1)or分享 (0)