在抓取一些必须用selenium进行抓取的网站时,又想加载速度时,可以按下面的方式去操作
在mac,带参数启动程序:
open -a “Google Chrome” –args -ignore-certificate-errors –proxy-server=127.0.0.1:8080
这里去下载selenium grid
https://www.selenium.dev/downloads/
这里去下载 chrome的webdriver
https://chromedriver.storage.googleapis.com/index.html
再写一个启动脚本:
java -jar ./selenium-server-4.1.2.jar standalone
使用mitmproxy来过滤掉一些无用的请求
安装 pip3 install mitmporxy
写一个过滤脚本: filter_rewardstyle.com.py
import re from mitmproxy import ctx, http # def http_connect(flow: http.HTTPFlow): # if "rewardstyle.com" not in flow.request.host: # ctx.log("忽略connect请求:"+flow.request.url) # flow.response = http.Response.make(404) # return def request(flow: http.HTTPFlow): ctx.log("============ request: " + flow.request.url) if "rewardstyle.com" not in flow.request.host: if re.search(r'\.(css|js|jpg|png|gif|woff|tiff|ico)$', flow.request.url): ctx.log("忽略资源:"+flow.request.url) flow.response = http.Response.make(404) return def response(flow): # ctx.log(flow.request.url) """修改应答数据""" if "rewardstyle.com" not in flow.request.url: flow.response.text = "by mitmproxy"
运行这个命令:
mitmdump -s filter_rewardstyle.com.py