在抓取一些必须用selenium进行抓取的网站时,又想加载速度时,可以按下面的方式去操作
在mac,带参数启动程序:
open -a “Google Chrome” –args -ignore-certificate-errors –proxy-server=127.0.0.1:8080
这里去下载selenium grid
https://www.selenium.dev/downloads/
这里去下载 chrome的webdriver
https://chromedriver.storage.googleapis.com/index.html
再写一个启动脚本:
java -jar ./selenium-server-4.1.2.jar standalone
使用mitmproxy来过滤掉一些无用的请求
安装 pip3 install mitmporxy
写一个过滤脚本: filter_rewardstyle.com.py
import re
from mitmproxy import ctx, http
# def http_connect(flow: http.HTTPFlow):
# if "rewardstyle.com" not in flow.request.host:
# ctx.log("忽略connect请求:"+flow.request.url)
# flow.response = http.Response.make(404)
# return
def request(flow: http.HTTPFlow):
ctx.log("============ request: " + flow.request.url)
if "rewardstyle.com" not in flow.request.host:
if re.search(r'\.(css|js|jpg|png|gif|woff|tiff|ico)$', flow.request.url):
ctx.log("忽略资源:"+flow.request.url)
flow.response = http.Response.make(404)
return
def response(flow):
# ctx.log(flow.request.url)
"""修改应答数据"""
if "rewardstyle.com" not in flow.request.url:
flow.response.text = "by mitmproxy"
运行这个命令:
mitmdump -s filter_rewardstyle.com.py