月度归档:2022年03月

selenium结合mitmporxy进行抓取的一些技巧

在抓取一些必须用selenium进行抓取的网站时,又想加载速度时,可以按下面的方式去操作

在mac,带参数启动程序:
open -a “Google Chrome” –args -ignore-certificate-errors –proxy-server=127.0.0.1:8080

这里去下载selenium grid
https://www.selenium.dev/downloads/

这里去下载 chrome的webdriver
https://chromedriver.storage.googleapis.com/index.html

再写一个启动脚本:
java -jar ./selenium-server-4.1.2.jar standalone

使用mitmproxy来过滤掉一些无用的请求
安装 pip3 install mitmporxy

写一个过滤脚本: filter_rewardstyle.com.py

import re
from mitmproxy import ctx, http

# def http_connect(flow: http.HTTPFlow):
#     if "rewardstyle.com" not in flow.request.host:
#             ctx.log("忽略connect请求:"+flow.request.url)
#             flow.response = http.Response.make(404)
#             return


def request(flow: http.HTTPFlow):
    ctx.log("============ request: " + flow.request.url)
    if "rewardstyle.com" not in flow.request.host:
        if re.search(r'\.(css|js|jpg|png|gif|woff|tiff|ico)$', flow.request.url):
            ctx.log("忽略资源:"+flow.request.url)
            flow.response = http.Response.make(404)
            return


def response(flow):
    # ctx.log(flow.request.url)
    """修改应答数据"""
    if "rewardstyle.com" not in flow.request.url:
        flow.response.text = "by mitmproxy"

运行这个命令:
mitmdump -s filter_rewardstyle.com.py