只分享硬核技術(shù),如果你全部學(xué)完,我相信你會想象不到你自己的提高。
*章 爬蟲介紹
1.認識爬蟲
第二章:requests實戰(zhàn)(基礎(chǔ)爬蟲)
1.豆瓣電影爬取
2.肯德基餐廳查詢
3.破解百度翻譯
4.搜狗首頁
5.網(wǎng)頁采集器
6.藥監(jiān)總局相關(guān)數(shù)據(jù)爬取
第三章:爬蟲數(shù)據(jù)分析(bs4,xpath,正則表達式)
1.bs4解析基礎(chǔ)
2.bs4案例
3.xpath解析基礎(chǔ)
4.xpath解析案例-4k圖片解析爬取
5.xpath解析案例-58二手房
6.xpath解析案例-爬取站長素材中免費簡歷模板
7.xpath解析案例-*城市名稱爬取
8.正則解析
9.正則解析-分頁爬取
10.爬取圖片
第四章:自動識別驗證碼
1.古詩文網(wǎng)驗證碼識別
fateadm_api.py(識別需要的配置,建議放在同一文件夾下)
調(diào)用api接口![在這里插入圖片描述](
https://img-blog.csdnimg.cn/20210502231903693.png)
第五章:request模塊高級(模擬登錄)
1.代理操作
2.模擬登陸人人網(wǎng)
3.模擬登陸人人網(wǎng)
![在這里插入圖片描述](
https://img-blog.csdnimg.cn/20210502232130464.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQ1ODAzOTIz,size_16,color_FFFFFF,t_70)
第六章:高性能異步爬蟲(線程池,協(xié)程)
1.aiohttp實現(xiàn)多任務(wù)異步爬蟲
2.flask服務(wù)
3.多任務(wù)協(xié)程
4.多任務(wù)異步爬蟲
5.示例
6.同步爬蟲
7.線程池基本使用
8.線程池在爬蟲案例中的應(yīng)用
9.協(xié)程
第七章:動態(tài)加載數(shù)據(jù)處理(selenium模塊應(yīng)用,模擬登錄12306)
1.selenium基礎(chǔ)用法
2.selenium其他自動操作
3.12306登錄示例代碼
4.動作鏈與iframe的處理
5.谷歌無頭瀏覽器+反檢測
6.基于selenium實現(xiàn)1236模擬登錄
7.模擬登錄qq空間
第八章:scrapy框架
1.各種項目實戰(zhàn),scrapy各種配置修改
![在這里插入圖片描述](
https://img-blog.csdnimg.cn/20210502230647534.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQ1ODAzOTIz,size_16,color_FFFFFF,t_70)
2.bossPro示例
3.bossPro示例
4.數(shù)據(jù)庫示例
*章 爬蟲介紹
第0關(guān) 認識爬蟲
1、初始爬蟲
爬蟲,從本質(zhì)上來說,就是利用程序在網(wǎng)上拿到對我們有價值的數(shù)據(jù)。
2、明晰路徑
2-1、瀏覽器工作原理
(1)解析數(shù)據(jù):當(dāng)服務(wù)器把數(shù)據(jù)響應(yīng)給瀏覽器之后,瀏覽器并不會直接把數(shù)據(jù)丟給我們。因為這些數(shù)據(jù)是用計算機的語言寫的,瀏覽器還要把這些數(shù)據(jù)翻譯成我們能看得懂的內(nèi)容;
(2)提取數(shù)據(jù):我們就可以在拿到的數(shù)據(jù)中,挑選出對我們有用的數(shù)據(jù);
(3)存儲數(shù)據(jù):將挑選出來的有用數(shù)據(jù)保存在某一文件/數(shù)據(jù)庫中。
2-2、爬蟲工作原理
(1)獲取數(shù)據(jù):爬蟲程序會根據(jù)我們提供的,向服務(wù)器發(fā)起請求,然后返回數(shù)據(jù);
(2)解析數(shù)據(jù):爬蟲程序會把服務(wù)器返回的數(shù)據(jù)解析成我們能讀懂的格式;
(3)提取數(shù)據(jù):爬蟲程序再從中提取出我們需要的數(shù)據(jù);
(4)儲存數(shù)據(jù):爬蟲程序把這些有用的數(shù)據(jù)保存起來,便于你日后的使用和分析。
————————————————
版權(quán)聲明:本文為CSDN博主「yk 坤帝」的原創(chuàng)文章,遵循CC 4.0 BY-SA版權(quán)協(xié)議,轉(zhuǎn)載請附上原文出處鏈接及本聲明。
原文鏈接:
https://blog.csdn.net/qq_45803923/article/details/116133325
第二章:requests實戰(zhàn)(基礎(chǔ)爬蟲)
1.豆瓣電影爬取
```python
import requests
import json
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
url = "https://movie.douban.com/j/chart/top_list"
params = {
'type': '24',
'interval_id': '100:90',
'action': '',
'start': '0',#從第幾部電影開始取
'limit': '20'#一次取出的電影的個數(shù)
}
response = requests.get(url,params = params,headers = headers)
list_data = response.json()
fp = open('douban.json','w',encoding= 'utf-8')
json.dump(list_data,fp = fp,ensure_ascii= False)
print('over!!!!')
```
2.肯德基餐廳查詢
```python
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
word = input('請輸入一個地址:')
params = {
'cname': '',
'pid': '',
'keyword': word,
'pageIndex': '1',
'pageSize': '10'
}
response = requests.post(url,params = params ,headers = headers)
page_text = response.text
fileName = word + '.txt'
with open(fileName,'w',encoding= 'utf-8') as f:
f.write(page_text)
```
3.破解百度翻譯
```python
import requests
import json
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
post_url = 'https://fanyi.baidu.com/sug'
word = input('enter a word:')
data = {
'kw':word
}
response = requests.post(url = post_url,data = data,headers = headers)
dic_obj = response.json()
fileName = word + '.json'
fp = open(fileName,'w',encoding= 'utf-8')
#ensure_ascii = False,中文不能用ascii代碼
json.dump(dic_obj,fp = fp,ensure_ascii = False)
print('over!')
```
4.搜狗首頁
```python
import requests
url = 'https://www.sogou.com/?pid=sogou-site-d5da28d4865fb927'
response = requests.get(url)
page_text = response.text
print(page_text)
with open('./sougou.html','w',encoding= 'utf-8') as fp:
fp.write(page_text)
print('爬取數(shù)據(jù)結(jié)束!!!')
```
5.網(wǎng)頁采集器
```python
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
url = 'https://www.sogou.com/sogou'
kw = input('enter a word:')
param = {
'query':kw
}
response = requests.get(url,params = param,headers = headers)
page_text = response.text
fileName = kw +'.html'
with open(fileName,'w',encoding= 'utf-8') as fp:
fp.write(page_text)
print(fileName,'保存成功!!!')
```
6.藥監(jiān)總局相關(guān)數(shù)據(jù)爬取
```python
import requests
import json
url = "http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4385.0 Safari/537.36'
}
for page in range(1,6):
page = str(page)
data = {
'on': 'true',
'page': page,
'pageSize': '15',
'productName':'',
'conditionType': '1',
'applyname': '',
'applysn':''
}
json_ids = requests.post(url,data = data,headers = headers).json()
id_list = []
for dic in json_ids['list']:
id_list.append(dic['ID'])
#print(id_list)
post_url = 'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsById'
all_data_list = []
for id in id_list:
data = {
'id':id
}
datail_json = requests.post(url = post_url,data = data,headers = headers).json()
#print(datail_json,'---------------------over')
all_data_list.append(datail_json)
fp = open('allData.json','w',encoding='utf-8')
json.dump(all_data_list,fp = fp,ensure_ascii= False)
print('over!!!')
```
## 第三章:爬蟲數(shù)據(jù)分析(bs4,xpath,正則表達式)
1.bs4解析基礎(chǔ)
```python
from bs4 import BeautifulSoup
fp = open('第三章 數(shù)據(jù)分析/text.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,'lxml')
#print(soup)
#print(soup.a)
#print(soup.div)
#print(soup.find('div'))
#print(soup.find('div',class_="song"))
#print(soup.find_all('a'))
#print(soup.select('.tang'))
#print(soup.select('.tang > ul > li >a')[0].text)
#print(soup.find('div',class_="song").text)
#print(soup.find('div',class_="song").string)
print(soup.select('.tang > ul > li >a')[0]['href'])
```
2.bs4案例
```python
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
url = "http://sanguo.5000yan.com/"
page_text = requests.get(url ,headers = headers).content
#print(page_text)
soup = BeautifulSoup(page_text,'lxml')
li_list = soup.select('.list > ul > li')
fp = open('./sanguo.txt','w',encoding='utf-8')
for li in li_list:
title = li.a.string
#print(title)
detail_url = 'http://sanguo.5000yan.com/'+li.a['href']
print(detail_url)
detail_page_text = requests.get(detail_url,headers = headers).content
detail_soup = BeautifulSoup(detail_page_text,'lxml')
div_tag = detail_soup.find('div',class_="grap")
content = div_tag.text
fp.write(title+":"+content+'\n')
print(title,'爬取成功!!!')
```
3.xpath解析基礎(chǔ)
```python
from lxml import etree
tree = etree.parse('第三章 數(shù)據(jù)分析/text.html')
# r = tree.xpath('/html/head/title')
# print(r)
# r = tree.xpath('/html/body/div')
# print(r)
# r = tree.xpath('/html//div')
# print(r)
# r = tree.xpath('//div')
# print(r)
# r = tree.xpath('//div[@class="song"]')
# print(r)
# r = tree.xpath('//div[@class="song"]/P[3]')
# print(r)
# r = tree.xpath('//div[@class="tang"]//li[5]/a/text()')
# print(r)
# r = tree.xpath('//li[7]/i/text()')
# print(r)
# r = tree.xpath('//li[7]//text()')
# print(r)
# r = tree.xpath('//div[@class="tang"]//text()')
# print(r)
# r = tree.xpath('//div[@class="song"]/img/@src')
# print(r)
```
4.xpath解析案例-4k圖片解析爬取
```python
import requests
from lxml import etree
import os
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
url = 'http://pic.netbian.com/4kmeinv/'
response = requests.get(url,headers = headers)
#response.encoding=response.apparent_encoding
#response.encoding = 'utf-8'
page_text = response.text
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="slist"]/ul/li')
# if not os.path.exists('./picLibs'):
# os.mkdir('./picLibs')
for li in li_list:
img_src = 'http://pic.netbian.com/'+li.xpath('./a/img/@src')[0]
img_name = li.xpath('./a/img/@alt')[0]+'.jpg'
img_name = img_name.encode('iso-8859-1').decode('gbk')
# print(img_name,img_src)
# print(type(img_name))
img_data = requests.get(url = img_src,headers = headers).content
img_path ='picLibs/'+img_name
#print(img_path)
with open(img_path,'wb') as fp:
fp.write(img_data)
print(img_name,"下載成功")
```
5.xpath解析案例-58二手房
```python
import requests
from lxml import etree
url = 'https://bj.58.com/ershoufang/p2/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
page_text = requests.get(url=url,headers = headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//section[@class="list-left"]/section[2]/div')
fp = open('58.txt','w',encoding='utf-8')
for li in li_list:
title = li.xpath('./a/div[2]/div/div/h3/text()')[0]
print(title)
fp.write(title+'\n')
```
6.xpath解析案例-爬取站長素材中免費簡歷模板
```python
import requests
from lxml import etree
import os
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
url = 'https://www.aqistudy.cn/historydata/'
page_text = requests.get(url,headers = headers).text
```
7.xpath解析案例-*城市名稱爬取
```python
import requests
from lxml import etree
import os
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
url = 'https://www.aqistudy.cn/historydata/'
page_text = requests.get(url,headers = headers).text
tree = etree.HTML(page_text)
# holt_li_list = tree.xpath('//div[@class="bottom"]/ul/li')
# all_city_name = []
# for li in holt_li_list:
# host_city_name = li.xpath('./a/text()')[0]
# all_city_name.append(host_city_name)
# city_name_list = tree.xpath('//div[@class="bottom"]/ul/div[2]/li')
# for li in city_name_list:
# city_name = li.xpath('./a/text()')[0]
# all_city_name.append(city_name)
# print(all_city_name,len(all_city_name))
#holt_li_list = tree.xpath('//div[@class="bottom"]/ul//li')
holt_li_list = tree.xpath('//div[@class="bottom"]/ul/li | //div[@class="bottom"]/ul/div[2]/li')
all_city_name = []
for li in holt_li_list:
host_city_name = li.xpath('./a/text()')[0]
all_city_name.append(host_city_name)
print(all_city_name,len(all_city_name))
```
8.正則解析
```python
import requests
import re
import os
if not os.path.exists('./qiutuLibs'):
os.mkdir('./qiutuLibs')
url = 'https://www.qiushibaike.com/imgrank/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4385.0 Safari/537.36'
}
page_text = requests.get(url,headers = headers).text
ex = '
img_src_list = re.findall(ex,page_text,re.S)
print(img_src_list)
for src in img_src_list:
src = 'https:' + src
img_data = requests.get(url = src,headers = headers).content
img_name = src.split('/')[-1]
imgPath = './qiutuLibs/'+img_name
with open(imgPath,'wb') as fp:
fp.write(img_data)
print(img_name,"下載完成!!!!!")
```
9.正則解析-分頁爬取
```python
import requests
import re
import os
if not os.path.exists('./qiutuLibs'):
os.mkdir('./qiutuLibs')
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4385.0 Safari/537.36'
}
url = 'https://www.qiushibaike.com/imgrank/page/%d/'
for pageNum in range(1,3):
new_url = format(url%pageNum)
page_text = requests.get(new_url,headers = headers).text
ex = '
img_src_list = re.findall(ex,page_text,re.S)
print(img_src_list)
for src in img_src_list:
src = 'https:' + src
img_data = requests.get(url = src,headers = headers).content
img_name = src.split('/')[-1]
imgPath = './qiutuLibs/'+img_name
with open(imgPath,'wb') as fp:
fp.write(img_data)
print(img_name,"下載完成!!!!!")
```
10.爬取圖片
```python
import requests
url = 'https://pic.qiushibaike.com/system/pictures/12404/124047919/medium/R7Y2UOCDRBXF2MIQ.jpg'
img_data = requests.get(url).content
with open('qiutu.jpg','wb') as fp:
fp.write(img_data)
```
第四章:自動識別驗證碼
1.古詩文網(wǎng)驗證碼識別
開發(fā)者賬號密碼可以申請
```python
import requests
from lxml import etree
from fateadm_api import FateadmApi
def TestFunc(imgPath,codyType):
pd_id = "xxxxxx" #用戶中心頁可以查詢到pd信息
pd_key = "xxxxxxxx"
app_id = "xxxxxxx" #開發(fā)者分成用的賬號,在開發(fā)者中心可以查詢到
app_key = "xxxxxxx"
#識別類型,
#具體類型可以查看官方網(wǎng)站的價格頁選擇具體的類型,不清楚類型的,可以咨詢客服
pred_type = codyType
api = FateadmApi(app_id, app_key, pd_id, pd_key)
# 查詢余額
balance = api.QueryBalcExtend() # 直接返余額
# api.QueryBalc()
# 通過文件形式識別:
file_name = imgPath
# 多網(wǎng)站類型時,需要增加src_url參數(shù),具體請參考api文檔:
http://docs.fateadm.com/web/#/1?page_id=6
result = api.PredictFromFileExtend(pred_type,file_name) # 直接返回識別結(jié)果
#rsp = api.PredictFromFile(pred_type, file_name) # 返回詳細識別結(jié)果
'''
# 如果不是通過文件識別,則調(diào)用Predict接口:
# result = api.PredictExtend(pred_type,data) # 直接返回識別結(jié)果
rsp = api.Predict(pred_type,data)# 返回詳細的識別結(jié)果
'''
# just_flag = False
# if just_flag :
# if rsp.ret_code == 0:
# #識別的結(jié)果如果與預(yù)期不符,可以調(diào)用這個接口將預(yù)期不符的訂單退款
# # 退款僅在正常識別出結(jié)果后,無法通過網(wǎng)站驗證的情況,請勿非法或者濫用,否則可能進行封號處理
# api.Justice( rsp.request_id)
#card_id = "123"
#card_key = "123"
#充值
#api.Charge(card_id, card_key)
#LOG("print in testfunc")
return result
# if __name__ == "__main__":
# TestFunc()
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
page_text = requests.get(url,headers = headers).text
tree = etree.HTML(page_text)
code_img_src = 'https://so.gushiwen.cn' + tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = requests.get(code_img_src,headers = headers).content
with open('./code.jpg','wb') as fp:
fp.write(img_data)
code_text = TestFunc('code.jpg',30400)
print('識別結(jié)果為:' + code_text)
code_text = TestFunc('code.jpg',30400)
print('識別結(jié)果為:' + code_text)
```
fateadm_api.py(識別需要的配置,建議放在同一文件夾下)
調(diào)用api接口
```python
# coding=utf-8
import os,sys
import hashlib
import time
import json
import requests
FATEA_PRED_URL = "http://pred.fateadm.com"
def LOG(log):
# 不需要測試時,注釋掉日志就可以了
print(log)
log = None
class TmpObj():
def __init__(self):
self.value = None
class Rsp():
def __init__(self):
self.ret_code = -1
self.cust_val = 0.0
self.err_msg = "succ"
self.pred_rsp = TmpObj()
def ParseJsonRsp(self, rsp_data):
if rsp_data is None:
self.err_msg = "http request failed, get rsp Nil data"
return
jrsp = json.loads( rsp_data)
self.ret_code = int(jrsp["RetCode"])
self.err_msg = jrsp["ErrMsg"]
self.request_id = jrsp["RequestId"]
if self.ret_code == 0:
rslt_data = jrsp["RspData"]
if rslt_data is not None and rslt_data != "":
jrsp_ext = json.loads( rslt_data)
if "cust_val" in jrsp_ext:
data = jrsp_ext["cust_val"]
self.cust_val = float(data)
if "result" in jrsp_ext:
data = jrsp_ext["result"]
self.pred_rsp.value = data
def CalcSign(pd_id, passwd, timestamp):
md5 = hashlib.md5()
md5.update((timestamp + passwd).encode())
csign = md5.hexdigest()
md5 = hashlib.md5()
md5.update((pd_id + timestamp + csign).encode())
csign = md5.hexdigest()
return csign
def CalcCardSign(cardid, cardkey, timestamp, passwd):
md5 = hashlib.md5()
md5.update(passwd + timestamp + cardid + cardkey)
return md5.hexdigest()
def HttpRequest(url, body_data, img_data=""):
rsp = Rsp()
post_data = body_data
files = {
'img_data':('img_data',img_data)
}
header = {
'User-Agent': 'Mozilla/5.0',
}
rsp_data = requests.post(url, post_data,files=files ,headers=header)
rsp.ParseJsonRsp( rsp_data.text)
return rsp
class FateadmApi():
# API接口調(diào)用類
# 參數(shù)(appID,appKey,pdID,pdKey)
def __init__(self, app_id, app_key, pd_id, pd_key):
self.app_id = app_id
if app_id is None:
self.app_id = ""
self.app_key = app_key
self.pd_id = pd_id
self.pd_key = pd_key
self.host = FATEA_PRED_URL
def SetHost(self, url):
self.host = url
#
# 查詢余額
# 參數(shù):無
# 返回值:
# rsp.ret_code:正常返回0
# rsp.cust_val:用戶余額
# rsp.err_msg:異常時返回異常詳情
#
def QueryBalc(self):
tm = str( int(time.time()))
sign = CalcSign( self.pd_id, self.pd_key, tm)
param = {
"user_id": self.pd_id,
"timestamp":tm,
"sign":sign
}
url = self.host + "/api/custval"
rsp = HttpRequest(url, param)
if rsp.ret_code == 0:
LOG("query succ ret: {} cust_val: {} rsp: {} pred: {}".format( rsp.ret_code, rsp.cust_val, rsp.err_msg, rsp.pred_rsp.value))
else:
LOG("query failed ret: {} err: {}".format( rsp.ret_code, rsp.err_msg.encode('utf-8')))
return rsp
#
# 查詢網(wǎng)絡(luò)延遲
# 參數(shù):pred_type:識別類型
# 返回值:
# rsp.ret_code:正常返回0
# rsp.err_msg: 異常時返回異常詳情
#
def QueryTTS(self, pred_type):
tm = str( int(time.time()))
sign = CalcSign( self.pd_id, self.pd_key, tm)
param = {
"user_id": self.pd_id,
"timestamp":tm,
"sign":sign,
"predict_type":pred_type,
}
if self.app_id != "":
#
asign = CalcSign(self.app_id, self.app_key, tm)
param["appid"] = self.app_id
param["asign"] = asign
url = self.host + "/api/qcrtt"
rsp = HttpRequest(url, param)
if rsp.ret_code == 0:
LOG("query rtt succ ret: {} request_id: {} err: {}".format( rsp.ret_code, rsp.request_id, rsp.err_msg))
else:
LOG("predict failed ret: {} err: {}".format( rsp.ret_code, rsp.err_msg.encode('utf-8')))
return rsp
#
# 識別驗證碼
# 參數(shù):pred_type:識別類型 img_data:圖片的數(shù)據(jù)
# 返回值:
# rsp.ret_code:正常返回0
# rsp.request_id:*訂單號
# rsp.pred_rsp.value:識別結(jié)果
# rsp.err_msg:異常時返回異常詳情
#
def Predict(self, pred_type, img_data, head_info = ""):
tm = str( int(time.time()))
sign = CalcSign( self.pd_id, self.pd_key, tm)
param = {
"user_id": self.pd_id,
"timestamp": tm,
"sign": sign,
"predict_type": pred_type,
"up_type": "mt"
}
if head_info is not None or head_info != "":
param["head_info"] = head_info
if self.app_id != "":
#
asign = CalcSign(self.app_id, self.app_key, tm)
param["appid"] = self.app_id
param["asign"] = asign
url = self.host + "/api/capreg"
files = img_data
rsp = HttpRequest(url, param, files)
if rsp.ret_code == 0:
LOG("predict succ ret: {} request_id: {} pred: {} err: {}".format( rsp.ret_code, rsp.request_id, rsp.pred_rsp.value, rsp.err_msg))
else:
LOG("predict failed ret: {} err: {}".format( rsp.ret_code, rsp.err_msg))
if rsp.ret_code == 4003:
#lack of money
LOG("cust_val <= 0 lack of money, please charge immediately")
return rsp
#
# 從文件進行驗證碼識別
# 參數(shù):pred_type;識別類型 file_name:文件名
# 返回值:
# rsp.ret_code:正常返回0
# rsp.request_id:*訂單號
# rsp.pred_rsp.value:識別結(jié)果
# rsp.err_msg:異常時返回異常詳情
#
def PredictFromFile( self, pred_type, file_name, head_info = ""):
with open(file_name, "rb") as f:
data = f.read()
return self.Predict(pred_type,data,head_info=head_info)
#
# 識別失敗,進行退款請求
# 參數(shù):request_id:需要退款的訂單號
# 返回值:
# rsp.ret_code:正常返回0
# rsp.err_msg:異常時返回異常詳情
#
# 注意:
# Predict識別接口,僅在ret_code == 0時才會進行扣款,才需要進行退款請求,否則無需進行退款操作
# 注意2:
# 退款僅在正常識別出結(jié)果后,無法通過網(wǎng)站驗證的情況,請勿非法或者濫用,否則可能進行封號處理
#
def Justice(self, request_id):
if request_id == "":
#
return
tm = str( int(time.time()))
sign = CalcSign( self.pd_id, self.pd_key, tm)
param = {
"user_id": self.pd_id,
"timestamp":tm,
"sign":sign,
"request_id":request_id
}
url = self.host + "/api/capjust"
rsp = HttpRequest(url, param)
if rsp.ret_code == 0:
LOG("justice succ ret: {} request_id: {} pred: {} err: {}".format( rsp.ret_code, rsp.request_id, rsp.pred_rsp.value, rsp.err_msg))
else:
LOG("justice failed ret: {} err: {}".format( rsp.ret_code, rsp.err_msg.encode('utf-8')))
return rsp
#
# 充值接口
# 參數(shù):cardid:充值卡號 cardkey:充值卡簽名串
# 返回值:
# rsp.ret_code:正常返回0
# rsp.err_msg:異常時返回異常詳情
#
def Charge(self, cardid, cardkey):
tm = str( int(time.time()))
sign = CalcSign( self.pd_id, self.pd_key, tm)
csign = CalcCardSign(cardid, cardkey, tm, self.pd_key)
param = {
"user_id": self.pd_id,
"timestamp":tm,
"sign":sign,
'cardid':cardid,
'csign':csign
}
url = self.host + "/api/charge"
rsp = HttpRequest(url, param)
if rsp.ret_code == 0:
LOG("charge succ ret: {} request_id: {} pred: {} err: {}".format( rsp.ret_code, rsp.request_id, rsp.pred_rsp.value, rsp.err_msg))
else:
LOG("charge failed ret: {} err: {}".format( rsp.ret_code, rsp.err_msg.encode('utf-8')))
return rsp
##
# 充值,只返回是否成功
# 參數(shù):cardid:充值卡號 cardkey:充值卡簽名串
# 返回值: 充值成功時返回0
##
def ExtendCharge(self, cardid, cardkey):
return self.Charge(cardid,cardkey).ret_code
##
# 調(diào)用退款,只返回是否成功
# 參數(shù): request_id:需要退款的訂單號
# 返回值: 退款成功時返回0
#
# 注意:
# Predict識別接口,僅在ret_code == 0時才會進行扣款,才需要進行退款請求,否則無需進行退款操作
# 注意2:
# 退款僅在正常識別出結(jié)果后,無法通過網(wǎng)站驗證的情況,請勿非法或者濫用,否則可能進行封號處理
##
def JusticeExtend(self, request_id):
return self.Justice(request_id).ret_code
##
# 查詢余額,只返回余額
# 參數(shù):無
# 返回值:rsp.cust_val:余額
##
def QueryBalcExtend(self):
rsp = self.QueryBalc()
return rsp.cust_val
##
# 從文件識別驗證碼,只返回識別結(jié)果
# 參數(shù):pred_type;識別類型 file_name:文件名
# 返回值: rsp.pred_rsp.value:識別的結(jié)果
##
def PredictFromFileExtend( self, pred_type, file_name, head_info = ""):
rsp = self.PredictFromFile(pred_type,file_name,head_info)
return rsp.pred_rsp.value
##
# 識別接口,只返回識別結(jié)果
# 參數(shù):pred_type:識別類型 img_data:圖片的數(shù)據(jù)
# 返回值: rsp.pred_rsp.value:識別的結(jié)果
##
def PredictExtend(self,pred_type, img_data, head_info = ""):
rsp = self.Predict(pred_type,img_data,head_info)
return rsp.pred_rsp.value
def TestFunc():
pd_id = "128292" #用戶中心頁可以查詢到pd信息
pd_key = "bASHdc/12ISJOX7pV3qhPr2ntQ6QcEkV"
app_id = "100001" #開發(fā)者分成用的賬號,在開發(fā)者中心可以查詢到
app_key = "123456"
#識別類型,
#具體類型可以查看官方網(wǎng)站的價格頁選擇具體的類型,不清楚類型的,可以咨詢客服
pred_type = "30400"
api = FateadmApi(app_id, app_key, pd_id, pd_key)
# 查詢余額
balance = api.QueryBalcExtend() # 直接返余額
# api.QueryBalc()
# 通過文件形式識別:
file_name = 'img.gif'
# 多網(wǎng)站類型時,需要增加src_url參數(shù),具體請參考api文檔:
http://docs.fateadm.com/web/#/1?page_id=6
# result = api.PredictFromFileExtend(pred_type,file_name) # 直接返回識別結(jié)果
rsp = api.PredictFromFile(pred_type, file_name) # 返回詳細識別結(jié)果
'''
# 如果不是通過文件識別,則調(diào)用Predict接口:
# result = api.PredictExtend(pred_type,data) # 直接返回識別結(jié)果
rsp = api.Predict(pred_type,data)# 返回詳細的識別結(jié)果
'''
just_flag = False
if just_flag :
if rsp.ret_code == 0:
#識別的結(jié)果如果與預(yù)期不符,可以調(diào)用這個接口將預(yù)期不符的訂單退款
# 退款僅在正常識別出結(jié)果后,無法通過網(wǎng)站驗證的情況,請勿非法或者濫用,否則可能進行封號處理
api.Justice( rsp.request_id)
#card_id = "123"
#card_key = "123"
#充值
#api.Charge(card_id, card_key)
LOG("print in testfunc")
if __name__ == "__main__":
TestFunc()
```
第五章:request模塊高級(模擬登錄)
1.代理操作
```python
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
url = 'https://www.sogou.com/sie?query=ip'
page_text = requests.get(url,headers = headers,proxies = {"https":"183.166.103.86:9999"}).text
with open('ip.html','w',encoding='utf-8') as fp:
fp.write(page_text)
```
2.模擬登陸人人網(wǎng)
```python
import requests
from lxml import etree
from fateadm_api import FateadmApi
def TestFunc(imgPath,codyType):
pd_id = "xxxxx" #用戶中心頁可以查詢到pd信息
pd_key = "xxxxxxxxxxxxxxxxxx"
app_id = "xxxxxxxx" #開發(fā)者分成用的賬號,在開發(fā)者中心可以查詢到
app_key = "xxxxxx"
#識別類型,
#具體類型可以查看官方網(wǎng)站的價格頁選擇具體的類型,不清楚類型的,可以咨詢客服
pred_type = codyType
api = FateadmApi(app_id, app_key, pd_id, pd_key)
# 查詢余額
balance = api.QueryBalcExtend() # 直接返余額
# api.QueryBalc()
# 通過文件形式識別:
file_name = imgPath
# 多網(wǎng)站類型時,需要增加src_url參數(shù),具體請參考api文檔:
http://docs.fateadm.com/web/#/1?page_id=6
result = api.PredictFromFileExtend(pred_type,file_name) # 直接返回識別結(jié)果
#rsp = api.PredictFromFile(pred_type, file_name) # 返回詳細識別結(jié)果
'''
# 如果不是通過文件識別,則調(diào)用Predict接口:
# result = api.PredictExtend(pred_type,data) # 直接返回識別結(jié)果
rsp = api.Predict(pred_type,data)# 返回詳細的識別結(jié)果
'''
# just_flag = False
# if just_flag :
# if rsp.ret_code == 0:
# #識別的結(jié)果如果與預(yù)期不符,可以調(diào)用這個接口將預(yù)期不符的訂單退款
# # 退款僅在正常識別出結(jié)果后,無法通過網(wǎng)站驗證的情況,請勿非法或者濫用,否則可能進行封號處理
# api.Justice( rsp.request_id)
#card_id = "123"
#card_key = "123"
#充值
#api.Charge(card_id, card_key)
#LOG("print in testfunc")
return result
# if __name__ == "__main__":
# TestFunc()
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
url = 'http://www.renren.com/'
page_text = requests.get(url,headers = headers).text
tree = etree.HTML(page_text)
code_img_src = tree.xpath('//*[@id="verifyPic_login"]/@src')[0]
code_img_data = requests.get(code_img_src,headers = headers).content
with open('./code.jpg','wb') as fp:
fp.write(code_img_data)
result = TestFunc('code.jpg',30600)
print('識別結(jié)果為:' + result)
login_url = 'http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=2021121720536'
data = {
'email':'xxxxxxxx',
'icode': result,
'origURL': 'http://www.renren.com/home',
'domain': 'renren.com',
'key_id': '1',
'captcha_type':' web_login',
'password': '47e27dd5ef32b31041ebf56ec85a9b1e4233875e36396241c88245b188c56cdb',
'rkey': 'c655ef0c57a72755f1240d6c0efac67d',
'f': ''
}
response = requests.post(login_url,headers = headers, data = data)
print(response.status_code)
with open('renren.html','w',encoding= 'utf-8') as fp:
fp.write(response.text)
```
fateadm_api.py
```python
# coding=utf-8
import os,sys
import hashlib
import time
import json
import requests
FATEA_PRED_URL = "http://pred.fateadm.com"
def LOG(log):
# 不需要測試時,注釋掉日志就可以了
print(log)
log = None
class TmpObj():
def __init__(self):
self.value = None
class Rsp():
def __init__(self):
self.ret_code = -1
self.cust_val = 0.0
self.err_msg = "succ"
self.pred_rsp = TmpObj()
def ParseJsonRsp(self, rsp_data):
if rsp_data is None:
self.err_msg = "http request failed, get rsp Nil data"
return
jrsp = json.loads( rsp_data)
self.ret_code = int(jrsp["RetCode"])
self.err_msg = jrsp["ErrMsg"]
self.request_id = jrsp["RequestId"]
if self.ret_code == 0:
rslt_data = jrsp["RspData"]
if rslt_data is not None and rslt_data != "":
jrsp_ext = json.loads( rslt_data)
if "cust_val" in jrsp_ext:
data = jrsp_ext["cust_val"]
self.cust_val = float(data)
if "result" in jrsp_ext:
data = jrsp_ext["result"]
self.pred_rsp.value = data
def CalcSign(pd_id, passwd, timestamp):
md5 = hashlib.md5()
md5.update((timestamp + passwd).encode())
csign = md5.hexdigest()
md5 = hashlib.md5()
md5.update((pd_id + timestamp + csign).encode())
csign = md5.hexdigest()
return csign
def CalcCardSign(cardid, cardkey, timestamp, passwd):
md5 = hashlib.md5()
md5.update(passwd + timestamp + cardid + cardkey)
return md5.hexdigest()
def HttpRequest(url, body_data, img_data=""):
rsp = Rsp()
post_data = body_data
files = {
'img_data':('img_data',img_data)
}
header = {
'User-Agent': 'Mozilla/5.0',
}
rsp_data = requests.post(url, post_data,files=files ,headers=header)
rsp.ParseJsonRsp( rsp_data.text)
return rsp
class FateadmApi():
# API接口調(diào)用類
# 參數(shù)(appID,appKey,pdID,pdKey)
def __init__(self, app_id, app_key, pd_id, pd_key):
self.app_id = app_id
if app_id is None:
self.app_id = ""
self.app_key = app_key
self.pd_id = pd_id
self.pd_key = pd_key
self.host = FATEA_PRED_URL
def SetHost(self, url):
self.host = url
#
# 查詢余額
# 參數(shù):無
# 返回值:
# rsp.ret_code:正常返回0
# rsp.cust_val:用戶余額
# rsp.err_msg:異常時返回異常詳情
#
def QueryBalc(self):
tm = str( int(time.time()))
sign = CalcSign( self.pd_id, self.pd_key, tm)
param = {
"user_id": self.pd_id,
"timestamp":tm,
"sign":sign
}
url = self.host + "/api/custval"
rsp = HttpRequest(url, param)
if rsp.ret_code == 0:
LOG("query succ ret: {} cust_val: {} rsp: {} pred: {}".format( rsp.ret_code, rsp.cust_val, rsp.err_msg, rsp.pred_rsp.value))
else:
LOG("query failed ret: {} err: {}".format( rsp.ret_code, rsp.err_msg.encode('utf-8')))
return rsp
#
# 查詢網(wǎng)絡(luò)延遲
# 參數(shù):pred_type:識別類型
# 返回值:
# rsp.ret_code:正常返回0
# rsp.err_msg: 異常時返回異常詳情
#
def QueryTTS(self, pred_type):
tm = str( int(time.time()))
sign = CalcSign( self.pd_id, self.pd_key, tm)
param = {
"user_id": self.pd_id,
"timestamp":tm,
"sign":sign,
"predict_type":pred_type,
}
if self.app_id != "":
#
asign = CalcSign(self.app_id, self.app_key, tm)
param["appid"] = self.app_id
param["asign"] = asign
url = self.host + "/api/qcrtt"
rsp = HttpRequest(url, param)
if rsp.ret_code == 0:
LOG("query rtt succ ret: {} request_id: {} err: {}".format( rsp.ret_code, rsp.request_id, rsp.err_msg))
else:
LOG("predict failed ret: {} err: {}".format( rsp.ret_code, rsp.err_msg.encode('utf-8')))
return rsp
#
# 識別驗證碼
# 參數(shù):pred_type:識別類型 img_data:圖片的數(shù)據(jù)
# 返回值:
# rsp.ret_code:正常返回0
# rsp.request_id:*訂單號
# rsp.pred_rsp.value:識別結(jié)果
# rsp.err_msg:異常時返回異常詳情
#
def Predict(self, pred_type, img_data, head_info = ""):
tm = str( int(time.time()))
sign = CalcSign( self.pd_id, self.pd_key, tm)
param = {
"user_id": self.pd_id,
"timestamp": tm,
"sign": sign,
"predict_type": pred_type,
"up_type": "mt"
}
if head_info is not None or head_info != "":
param["head_info"] = head_info
if self.app_id != "":
#
asign = CalcSign(self.app_id, self.app_key, tm)
param["appid"] = self.app_id
param["asign"] = asign
url = self.host + "/api/capreg"
files = img_data
rsp = HttpRequest(url, param, files)
if rsp.ret_code == 0:
LOG("predict succ ret: {} request_id: {} pred: {} err: {}".format( rsp.ret_code, rsp.request_id, rsp.pred_rsp.value, rsp.err_msg))
else:
LOG("predict failed ret: {} err: {}".format( rsp.ret_code, rsp.err_msg))
if rsp.ret_code == 4003:
#lack of money
LOG("cust_val <= 0 lack of money, please charge immediately")
return rsp
#
# 從文件進行驗證碼識別
# 參數(shù):pred_type;識別類型 file_name:文件名
# 返回值:
# rsp.ret_code:正常返回0
# rsp.request_id:*訂單號
# rsp.pred_rsp.value:識別結(jié)果
# rsp.err_msg:異常時返回異常詳情
#
def PredictFromFile( self, pred_type, file_name, head_info = ""):
with open(file_name, "rb") as f:
data = f.read()
return self.Predict(pred_type,data,head_info=head_info)
#
# 識別失敗,進行退款請求
# 參數(shù):request_id:需要退款的訂單號
# 返回值:
# rsp.ret_code:正常返回0
# rsp.err_msg:異常時返回異常詳情
#
# 注意:
# Predict識別接口,僅在ret_code == 0時才會進行扣款,才需要進行退款請求,否則無需進行退款操作
# 注意2:
# 退款僅在正常識別出結(jié)果后,無法通過網(wǎng)站驗證的情況,請勿非法或者濫用,否則可能進行封號處理
#
def Justice(self, request_id):
if request_id == "":
#
return
tm = str( int(time.time()))
sign = CalcSign( self.pd_id, self.pd_key, tm)
param = {
"user_id": self.pd_id,
"timestamp":tm,
"sign":sign,
"request_id":request_id
}
url = self.host + "/api/capjust"
rsp = HttpRequest(url, param)
if rsp.ret_code == 0:
LOG("justice succ ret: {} request_id: {} pred: {} err: {}".format( rsp.ret_code, rsp.request_id, rsp.pred_rsp.value, rsp.err_msg))
else:
LOG("justice failed ret: {} err: {}".format( rsp.ret_code, rsp.err_msg.encode('utf-8')))
return rsp
#
# 充值接口
# 參數(shù):cardid:充值卡號 cardkey:充值卡簽名串
# 返回值:
# rsp.ret_code:正常返回0
# rsp.err_msg:異常時返回異常詳情
#
def Charge(self, cardid, cardkey):
tm = str( int(time.time()))
sign = CalcSign( self.pd_id, self.pd_key, tm)
csign = CalcCardSign(cardid, cardkey, tm, self.pd_key)
param = {
"user_id": self.pd_id,
"timestamp":tm,
"sign":sign,
'cardid':cardid,
'csign':csign
}
url = self.host + "/api/charge"
rsp = HttpRequest(url, param)
if rsp.ret_code == 0:
LOG("charge succ ret: {} request_id: {} pred: {} err: {}".format( rsp.ret_code, rsp.request_id, rsp.pred_rsp.value, rsp.err_msg))
else:
LOG("charge failed ret: {} err: {}".format( rsp.ret_code, rsp.err_msg.encode('utf-8')))
return rsp
##
# 充值,只返回是否成功
# 參數(shù):cardid:充值卡號 cardkey:充值卡簽名串
# 返回值: 充值成功時返回0
##
def ExtendCharge(self, cardid, cardkey):
return self.Charge(cardid,cardkey).ret_code
##
# 調(diào)用退款,只返回是否成功
# 參數(shù): request_id:需要退款的訂單號
# 返回值: 退款成功時返回0
#
# 注意:
# Predict識別接口,僅在ret_code == 0時才會進行扣款,才需要進行退款請求,否則無需進行退款操作
# 注意2:
# 退款僅在正常識別出結(jié)果后,無法通過網(wǎng)站驗證的情況,請勿非法或者濫用,否則可能進行封號處理
##
def JusticeExtend(self, request_id):
return self.Justice(request_id).ret_code
##
# 查詢余額,只返回余額
# 參數(shù):無
# 返回值:rsp.cust_val:余額
##
def QueryBalcExtend(self):
rsp = self.QueryBalc()
return rsp.cust_val
##
# 從文件識別驗證碼,只返回識別結(jié)果
# 參數(shù):pred_type;識別類型 file_name:文件名
# 返回值: rsp.pred_rsp.value:識別的結(jié)果
##
def PredictFromFileExtend( self, pred_type, file_name, head_info = ""):
rsp = self.PredictFromFile(pred_type,file_name,head_info)
return rsp.pred_rsp.value
##
# 識別接口,只返回識別結(jié)果
# 參數(shù):pred_type:識別類型 img_data:圖片的數(shù)據(jù)
# 返回值: rsp.pred_rsp.value:識別的結(jié)果
##
def PredictExtend(self,pred_type, img_data, head_info = ""):
rsp = self.Predict(pred_type,img_data,head_info)
return rsp.pred_rsp.value
def TestFunc():
pd_id = "128292" #用戶中心頁可以查詢到pd信息
pd_key = "bASHdc/12ISJOX7pV3qhPr2ntQ6QcEkV"
app_id = "100001" #開發(fā)者分成用的賬號,在開發(fā)者中心可以查詢到
app_key = "123456"
#識別類型,
#具體類型可以查看官方網(wǎng)站的價格頁選擇具體的類型,不清楚類型的,可以咨詢客服
pred_type = "30400"
api = FateadmApi(app_id, app_key, pd_id, pd_key)
# 查詢余額
balance = api.QueryBalcExtend() # 直接返余額
# api.QueryBalc()
# 通過文件形式識別:
file_name = 'img.gif'
# 多網(wǎng)站類型時,需要增加src_url參數(shù),具體請參考api文檔:
http://docs.fateadm.com/web/#/1?page_id=6
# result = api.PredictFromFileExtend(pred_type,file_name) # 直接返回識別結(jié)果
rsp = api.PredictFromFile(pred_type, file_name) # 返回詳細識別結(jié)果
'''
# 如果不是通過文件識別,則調(diào)用Predict接口:
# result = api.PredictExtend(pred_type,data) # 直接返回識別結(jié)果
rsp = api.Predict(pred_type,data)# 返回詳細的識別結(jié)果
'''
just_flag = False
if just_flag :
if rsp.ret_code == 0:
#識別的結(jié)果如果與預(yù)期不符,可以調(diào)用這個接口將預(yù)期不符的訂單退款
# 退款僅在正常識別出結(jié)果后,無法通過網(wǎng)站驗證的情況,請勿非法或者濫用,否則可能進行封號處理
api.Justice( rsp.request_id)
#card_id = "123"
#card_key = "123"
#充值
#api.Charge(card_id, card_key)
LOG("print in testfunc")
if __name__ == "__main__":
TestFunc()
```
3.爬取人人網(wǎng)當(dāng)前用戶的個人詳情頁數(shù)據(jù)
```python
import requests
from lxml import etree
from fateadm_api import FateadmApi
def TestFunc(imgPath,codyType):
pd_id = "xxxxxxx" #用戶中心頁可以查詢到pd信息
pd_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
app_id = "xxxxxxxx" #開發(fā)者分成用的賬號,在開發(fā)者中心可以查詢到
app_key = "xxxxxxxxx"
#識別類型,
#具體類型可以查看官方網(wǎng)站的價格頁選擇具體的類型,不清楚類型的,可以咨詢客服
pred_type = codyType
api = FateadmApi(app_id, app_key, pd_id, pd_key)
# 查詢余額
balance = api.QueryBalcExtend() # 直接返余額
# api.QueryBalc()
# 通過文件形式識別:
file_name = imgPath
# 多網(wǎng)站類型時,需要增加src_url參數(shù),具體請參考api文檔:
http://docs.fateadm.com/web/#/1?page_id=6
result = api.PredictFromFileExtend(pred_type,file_name) # 直接返回識別結(jié)果
#rsp = api.PredictFromFile(pred_type, file_name) # 返回詳細識別結(jié)果
'''
# 如果不是通過文件識別,則調(diào)用Predict接口:
# result = api.PredictExtend(pred_type,data) # 直接返回識別結(jié)果
rsp = api.Predict(pred_type,data)# 返回詳細的識別結(jié)果
'''
# just_flag = False
# if just_flag :
# if rsp.ret_code == 0:
# #識別的結(jié)果如果與預(yù)期不符,可以調(diào)用這個接口將預(yù)期不符的訂單退款
# # 退款僅在正常識別出結(jié)果后,無法通過網(wǎng)站驗證的情況,請勿非法或者濫用,否則可能進行封號處理
# api.Justice( rsp.request_id)
#card_id = "123"
#card_key = "123"
#充值
#api.Charge(card_id, card_key)
#LOG("print in testfunc")
return result
# if __name__ == "__main__":
# TestFunc()
session = requests.Session()
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
url = 'http://www.renren.com/'
page_text = requests.get(url,headers = headers).text
tree = etree.HTML(page_text)
code_img_src = tree.xpath('//*[@id="verifyPic_login"]/@src')[0]
code_img_data = requests.get(code_img_src,headers = headers).content
with open('./code.jpg','wb') as fp:
fp.write(code_img_data)
result = TestFunc('code.jpg',30600)
print('識別結(jié)果為:' + result)
login_url = 'http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=2021121720536'
data = {
'email':'15893301681',
'icode': result,
'origURL': 'http://www.renren.com/home',
'domain': 'renren.com',
'key_id': '1',
'captcha_type':' web_login',
'password': '47e27dd5ef32b31041ebf56ec85a9b1e4233875e36396241c88245b188c56cdb',
'rkey': 'c655ef0c57a72755f1240d6c0efac67d',
'f': '',
}
response = session.post(login_url,headers = headers, data = data)
print(response.status_code)
with open('renren.html','w',encoding= 'utf-8') as fp:
fp.write(response.text)
# headers = {
# 'cookies'
# }
detail_url = 'http://www.renren.com/975996803/profile'
detail_page_text = session.get(detail_url,headers = headers).text
with open('bobo.html','w',encoding= 'utf-8') as fp:
fp.write(detail_page_text)
```
第六章:高性能異步爬蟲(線程池,協(xié)程)
1.aiohttp實現(xiàn)多任務(wù)異步爬蟲
```python
import requests
import asyncio
import time
import aiohttp
start = time.time()
urls = [
'http://127.0.0.1:5000/bobo','http://127.0.0.1:5000/jay','http://127.0.0.1:5000/tom'
]
async def get_page(url):
#print('正在下載',url)
#response = requests.get(url)
#print('下載完畢',response.text)
async with aiohttp.ClientSession() as session:
async with await session.get(url) as response:
page_text = await response.text()
print(page_text)
tasks = []
for url in urls:
c = get_page(url)
task = asyncio.ensure_future(c)
tasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
end = time.time()
print('總耗時',end - start)
```
2.flask服務(wù)
```python
from flask import Flask
import time
app = Flask(__name__)
@app.route('/bobo')
def index_bobo():
time.sleep(2)
return 'Hello bobo'
@app.route('/jay')
def index_jay():
time.sleep(2)
return 'Hello jay'
@app.route('/tom')
def index_tom():
time.sleep(2)
return 'Hello tom'
if __name__ == '__main__':
app.run(threaded = True)
```
3.多任務(wù)協(xié)程
```python
import asyncio
import time
async def request(url):
print('正在下載',url)
#time.sleep(2)
await asyncio.sleep(2)
print('下載完成',url)
start = time.time()
urls = ['www.baidu.com',
'www.sogou.com',
'www,goubanjia.com'
]
stasks = []
for url in urls:
c = request(url)
task = asyncio.ensure_future(c)
stasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(stasks))
print(time.time()-start)
```
4.多任務(wù)異步爬蟲
```python
import requests
import asyncio
import time
#import aiohttp
start = time.time()
urls = [
'http://127.0.0.1:5000/bobo','http://127.0.0.1:5000/jay','http://127.0.0.1:5000/tom'
]
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
async def get_page(url):
print('正在下載',url)
response = requests.get(url,headers =headers)
print('下載完畢',response.text)
tasks = []
for url in urls:
c = get_page(url)
task = asyncio.ensure_future(c)
tasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
end = time.time()
print('總耗時',end - start)
```
5.示例
```python
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
url = 'https://www.pearvideo.com/videoStatus.jsp?contId=1719770&mrd=0.559512982919081'
response = requests.get(url,headers = headers)
print(response.text)
"https://video.pearvideo.com/mp4/short/20210209/1613307944808-15603370-hd.mp4
```
6.同步爬蟲
```python
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
urls = [
'https://www.cnblogs.com/shaozheng/p/12795953.html',
'https://www.cnblogs.com/hanfe1/p/12661505.html',
'https://www.cnblogs.com/tiger666/articles/11070427.html']
def get_content(url):
print('正在爬取:',url)
response = requests.get(url,headers = headers)
if response.status_code == 200:
return response.content
def parse_content(content):
print('響應(yīng)數(shù)據(jù)的長度為:',len(content))
for url in urls:
content = get_content(url)
parse_content(content)
```
7.線程池基本使用
```python
# import time
# def get_page(str):
# print('正在下載:',str)
# time.sleep(2)
# print('下載成功:',str)
# name_list = ['xiaozi','aa','bb','cc']
# start_time = time.time()
# for i in range(len(name_list)):
# get_page(name_list[i])
# end_time = time.time()
# print('%d second'%(end_time-start_time))
import time
from multiprocessing.dummy import Pool
start_time = time.time()
def get_page(str):
print('正在下載:',str)
time.sleep(2)
print('下載成功:',str)
name_list = ['xiaozi','aa','bb','cc']
pool = Pool(4)
pool.map(get_page,name_list)
end_time = time.time()
print(end_time-start_time)
```
8.線程池在爬蟲案例中的應(yīng)用
```python
import requests
from lxml import etree
import re
from multiprocessing.dummy import Pool
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
url = 'https://www.pearvideo.com/'
page_text = requests.get(url,headers = headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="vervideo-tlist-bd recommend-btbg clearfix"]/ul/li')
#li_list = tree.xpath('//ul[@class="vervideo-tlist-small"]/li')
urls = []
for li in li_list:
detail_url = 'https://www.pearvideo.com/' + li.xpath('./div/a/@href')[0]
#name = li.xpath('./div/a/div[2]/text()')[0] + '.mp4'
name = li.xpath('./div/a/div[2]/div[2]/text()')[0] + '.mp4'
#print(detail_url,name)
detail_page_text = requests.get(detail_url,headers = headers).text
# ex = 'srcUrl=https://imgs.edutt.com/skin/xxxx/image/nopic.gif,vdoUrl'
# video_url = re.findall(ex,detail_page_text)[0]
#video_url = tree.xpath('//img[@class="img"]/@src')[0]
#https://video.pearvideo.com/mp4/short/20210209/{}-15603370-hd.mp4
#xhrm碼
print(detail_page_text)
'''
dic = {
'name':name,
'url':video_url
}
urls.append(dic)
def get_video_data(dic):
url = dic['url']
print(dic['name'],'正在下載......')
data = requests.get(url,headers = headers).context
with open(dic['name','w']) as fp:
fp.write(data)
print(dic['name'],'下載成功!')
pool = Pool(4)
pool.map(get_video_data,urls)
pool.close()
pool.join()
'''
```
9.協(xié)程
```python
import asyncio
async def request(url):
print('正在請求的url是',url)
print('請求成功,',url)
return url
c = request('www.baidu.com')
# loop = asyncio.get_event_loop()
# loop.run_until_complete(c)
# loop = asyncio.get_event_loop()
# task = loop.create_task(c)
# print(task)
# loop.run_until_complete(task)
# print(task)
# loop = asyncio.get_event_loop()
# task = asyncio.ensure_future(c)
# print(task)
# loop.run_until_complete(task)
# print(task)
def callback_func(task):
print(task.result())
loop = asyncio.get_event_loop()
task = asyncio.ensure_future(c)
task.add_done_callback(callback_func)
loop.run_until_complete(task)
```
第七章:動態(tài)加載數(shù)據(jù)處理(selenium模塊應(yīng)用,模擬登錄12306)
![在這里插入圖片描述](
https://img-blog.csdnimg.cn/20210502232901935.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQ1ODAzOTIz,size_16,color_FFFFFF,t_70)
1.selenium基礎(chǔ)用法
```python
from selenium import webdriver
from lxml import etree
from time import sleep
bro = webdriver.Chrome(executable_path='chromedriver.exe')
bro.get('http://scxk.nmpa.gov.cn:81/xk/')
page_text = bro.page_source
tree = etree.HTML(page_text)
li_list = tree.xpath('//ul[@id="gzlist"]/li')
for li in li_list:
name = li.xpath('./dl/@title')[0]
print(name)
sleep(5)
bro.quit()
```
2.selenium其他自動操作
```python
from selenium import webdriver
from lxml import etree
from time import sleep
bro = webdriver.Chrome()
bro.get('https://www.taobao.com/')
sleep(2)
search_input = bro.find_element_by_xpath('//*[@id="q"]')
search_input.send_keys('Iphone')
sleep(2)
# bro.execute_async_script('window.scrollTo(0,document.body.scrollHeight)')
# sleep(5)
btn = bro.find_element_by_xpath('//*[@id="J_TSearchForm"]/div[1]/button')
print(type(btn))
btn.click()
bro.get('https://www.baidu.com')
sleep(2)
bro.back()
sleep(2)
bro.forward()
sleep(5)
bro.quit()
```
3.12306登錄示例代碼
```python
# 大二
# 2021年2月18日
# 寒假開學(xué)時間3月7日
from selenium import webdriver
import time
from PIL import Image
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import ChromeOptions
from selenium.webdriver import ActionChains
# chrome_options = Options()
# chrome_options.add_argument('--headless')
# chrome_options.add_argument('--disable-gpu')
bro = webdriver.Chrome()
bro.maximize_window()
time.sleep(5)
# option = ChromeOptions()
# option.add_experimental_option('excludeSwitches', ['enable-automation'])
# bro = webdriver.Chrome(chrome_options=chrome_options)
# chrome_options.add_argument("--window-size=1920,1050")
# bro = webdriver.Chrome(chrome_options=chrome_options,options= option)
bro.get('https://kyfw.12306.cn/otn/resources/login.html')
time.sleep(3)
bro.find_element_by_xpath('/html/body/div[2]/div[2]/ul/li[2]/a').click()
bro.save_screenshot('aa.png')
time.sleep(2)
code_img_ele = bro.find_element_by_xpath('//*[@id="J-loginImg"]')
time.sleep(2)
location = code_img_ele.location
print('location:',location)
size = code_img_ele.size
print('size',size)
rangle = (
int(location['x']),int(location['y']),int(location['x'] + int(size['width'])),int(location['y']+int(size['height']))
)
print(rangle)
i = Image.open('./aa.png')
code_img_name = './code.png'
frame = i.crop(rangle)
frame.save(code_img_name)
#bro.quit()
# 大二
# 2021年2月19日
# 寒假開學(xué)時間3月7日
#驗證碼坐標(biāo)無法準確識別,坐標(biāo)錯位,使用無頭瀏覽器可以識別
'''
result = print(chaojiying.PostPic(im, 9004)['pic_str'])
all_list = []
if '|' in result:
list_1 = result.split('!')
count_1 = len(list_1)
for i in range(count_1):
xy_list = []
x = int(list_1[i].split(',')[0])
y = int(list_1[i].split(',')[1])
xy_list.append(x)
xy_list.append(y)
all_list.append(xy_list)
else:
xy_list = []
x = int(list_1[i].split(',')[0])
y = int(list_1[i].split(',')[1])
xy_list.append(x)
xy_list.append(y)
all_list.append(xy_list)
print(all_list)
for l in all_list:
x = l[0]
y = l[1]
ActionChains(bro).move_to_element_with_offset(code_img_ele,x,y).click().perform()
time.sleep(0.5)
bro.find_element_by_id('J-userName').send_keys('')
time.sleep(2)
bro.find_element_by_id('J-password').send_keys('')
time.sleep(2)
bro.find_element_by_id('J-login').click()
bro.quit()
'''
```
4.動作鏈與iframe的處理
```python
from selenium import webdriver
from time import sleep
from selenium.webdriver import ActionChains
bro = webdriver.Chrome()
bro.get('https://www.runoob.com/try/try.php?filename=juquryui-api-droppable')
bro.switch_to.frame('id')
div = bro.find_elements_by_id('')
action = ActionChains(bro)
action.click_and_hold(div)
for i in range(5):
action.move_by_offset(17,0)
sleep(0.3)
action.release()
print(div)
```
5.谷歌無頭瀏覽器+反檢測
```python
from selenium import webdriver
from time import sleep
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import ChromeOptions
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
bro = webdriver.Chrome(chrome_options=chrome_options,options=option)
bro.get('https://www.baidu.com')
print(bro.page_source)
sleep(2)
bro.quit()
```
6.基于selenium實現(xiàn)1236模擬登錄
```python
#2021年2.18
import requests
from hashlib import md5
class Chaojiying_Client(object):
def __init__(self, username, password, soft_id):
self.username = username
password = password.encode('utf8')
self.password = md5(password).hexdigest()
self.soft_id = soft_id
self.base_params = {
'user': self.username,
'pass2': self.password,
'softid': self.soft_id,
}
self.headers = {
'Connection': 'Keep-Alive',
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
}
def PostPic(self, im, codetype):
"""
im: 圖片字節(jié)
codetype: 題目類型 參考
http://www.chaojiying.com/price.html
"""
params = {
'codetype': codetype,
}
params.update(self.base_params)
files = {'userfile': ('ccc.jpg', im)}
r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
return r.json()
def ReportError(self, im_id):
"""
im_id:報錯題目的圖片ID
"""
params = {
'id': im_id,
}
params.update(self.base_params)
r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
return r.json()
# if __name__ == '__main__':
# chaojiying = Chaojiying_Client('超級鷹用戶名', '超級鷹用戶名的密碼', '96001')
# im = open('a.jpg', 'rb').read()
# print chaojiying.PostPic(im, 1902)
# chaojiying = Chaojiying_Client('xxxxxxxxxx', 'xxxxxxxxxx', 'xxxxxxx')
# im = open('第七章:動態(tài)加載數(shù)據(jù)處理/12306.jpg', 'rb').read()
# print(chaojiying.PostPic(im, 9004)['pic_str'])
from selenium import webdriver
import time
bro = webdriver.Chrome()
bro.get('https://kyfw.12306.cn/otn/resources/login.html')
time.sleep(3)
bro.find_element_by_xpath('/html/body/div[2]/div[2]/ul/li[2]/a').click()
```
7.模擬登錄qq空間
```python
from selenium import webdriver
from selenium.webdriver import ActionChains
from time import sleep
bro = webdriver.Chrome()
bro.get('https://qzone.qq.com/')
bro.switch_to.frame('login_frame')
bro.find_element_by_id('switcher_plogin').click()
#account = input('請輸入賬號:')
bro.find_element_by_id('u').send_keys('')
#password = input('請輸入密碼:')
bro.find_element_by_id('p').send_keys('')
bro.find_element_by_id('login_button').click()
```
第八章:scrapy框架
1.各種項目實戰(zhàn),scrapy各種配置修改
2.bossPro示例
```python
# 大二
# 2021年2月23日星期二
# 寒假開學(xué)時間3月7日
import requests
from lxml import etree
#url = 'https://www.zhipin.com/c101010100/?query=python&ka=sel-city-101010100'
url = 'https://www.zhipin.com/c101120100/b_%E9%95%BF%E6%B8%85%E5%8C%BA/?ka=sel-business-5'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
page_text = requests.get(url,headers = headers).text
tree = etree.HTML(page_text)
print(tree)
li_list = tree.xpath('//*[@id="main"]/div/div[2]/ul/li')
print(li_list)
for li in li_list:
job_name = li.xpath('.//span[@class="job-name"]a/text()')
print(job_name)
```
3.qiubaiPro示例
```python
# -*- coding: utf-8 -*-
# 大二
# 2021年2月21日星期日
# 寒假開學(xué)時間3月7日
import requests
from lxml import etree
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
url = 'https://www.qiushibaike.com/text/'
page_text = requests.get(url,headers = headers).text
tree = etree.HTML(page_text)
div_list = tree.xpath('//div[@id="content"]/div[1]/div[2]/div')
print(div_list)
# print(tree.xpath('//*[@id="qiushi_tag_124072337"]/a[1]/div/span//text()'))
for div in div_list:
auther = div.xpath('./div[1]/a[2]/h2/text()')[0]
# print(auther)
content = div.xpath('./a[1]/div/span//text()')
content = ''.join(content)
# content = div.xpath('//*[@id="qiushi_tag_124072337"]/a[1]/div/span')
# print(content)
print(auther,content)
# print(tree.xpath('//*[@id="qiushi_tag_124072337"]/div[1]/a[2]/h2/text()'))
```
4.數(shù)據(jù)庫示例
```python
# 大二
# 2021年2月21日星期日
# 寒假開學(xué)時間3月7日
import pymysql
# 鏈接數(shù)據(jù)庫
# 參數(shù)1:mysql服務(wù)器所在主機ip
# 參數(shù)2:用戶名
# 參數(shù)3:密碼
# 參數(shù)4:要鏈接的數(shù)據(jù)庫名
# db = pymysql.connect("localhost", "root", "200829", "wj" )
db = pymysql.connect("192.168.31.19", "root", "200829", "wj" )
# 創(chuàng)建一個cursor對象
cursor = db.cursor()
sql = "select version()"
# 執(zhí)行sql語句
cursor.execute(sql)
# 獲取返回的信息
data = cursor.fetchone()
print(data)
# 斷開
cursor.close()
db.close()
```
![在這里插入圖片描述](
https://img-blog.csdnimg.cn/20210502233758293.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQ1ODAzOTIz,size_16,color_FFFFFF,t_70)
在這上面scrapy項目不容易上傳
有需要scrapy相關(guān)的,可以在我的資源上下載
也可以在公眾號(yk 坤帝,跟博客昵稱一樣)獲取
公眾號獲取的速度可能有點慢,才申請的,還在探索過程
![yk坤帝](
https://img-blog.csdnimg.cn/20210502234148351.png)
有問題的,想交流的也可以在公眾號上留言
如果你一步一步看到了這里,那么恭喜你,你已經(jīng)具備了一名爬蟲工程師的基本素養(yǎng),毫不夸張的說你已經(jīng)達到了就業(yè)標(biāo)準。
好了,今天的分享就到這里,謝謝你的閱讀,我相信善于學(xué)習(xí)的人一定會提高。
免責(zé)聲明:本信息由用戶發(fā)布,本站不承擔(dān)本信息引起的任何交易及知識產(chǎn)權(quán)侵權(quán)的法律責(zé)任!
學(xué)員評價ASK list
-
黃**評價:在這里學(xué)習(xí),我很放心,老師的能力值得信任手機號碼: 157****4905 評價時間: 2025-02-05
-
甘**評價:值得推薦,老師講的很好手機號碼: 138****4410 評價時間: 2025-02-05
-
王**評價:本來孩子不太喜歡學(xué)習(xí),出來工作之后需要提升的是在是太多了,偶然的機會來這里試聽了一次,就在這邊報名了手機號碼: 136****3254 評價時間: 2025-02-05
本文由 懂老師 整理發(fā)布。更多培訓(xùn)課程,學(xué)習(xí)資訊,課程優(yōu)惠,課程開班,學(xué)校地址等學(xué)校信息,可以留下你的聯(lián)系方式,讓課程老師跟你詳細解答:
咨詢電話:400-850-8622