方案确定

  • 官网/小程序/公众号破解:http
  • 官网/小程序/公众号抓取:browser
  • app 破解

业务需求细节确认

eg:

  • 舱位
  • 剩余座位数
  • 机型
  • 报价币种
  • 报价税
  • 查询人数
  • 儿童价格
  • 行李额
  • 售罄行程处理

技术选型

Browser 类 推荐:

  • webdriver + (Python|Java|.net|…)
  • Puppeteer + NodeJS

其它 headless browser,排名不分前后~~

网页解析

Chrome Webdriver

1
2
3
4
5

// c++
ChromeOptions options = new ChromeOptions();
options.addExtensions(new File("/path/to/extension.crx"));
ChromeDriver driver = new ChromeDriver(options);
  • 设置代理
1
2
3
4
5
6
7
8
9
10
// js
ChromeOptions options = new ChromeOptions();
// Add the WebDriver proxy capability.
Proxy proxy = new Proxy();
proxy.setHttpProxy("myhttpproxy:3337");
options.setCapability("proxy", proxy);

// Add a ChromeDriver-specific capability.
options.addExtensions(new File("/path/to/extension.crx"));
ChromeDriver driver = new ChromeDriver(options);
  • 启动非默认 Chrome
1
2
3
4
// c++
ChromeOptions options = new ChromeOptions();
options.setBinary("/path/to/other/chrome/binary");
.....

Python

  • bs4
  • pyquery
  • lxml.html
  • re
1
2
3
4
5
6
7
8
// `YAML
https://gist.github.com/MercuryRising/4061368
==== Total trials: 100000 =====
bs4 total time: 38.0
pq total time: 5.2
lxml (cssselect) total time: 5.1
lxml (xpath) total time: 3.0
regex total time: 8.4 (doesn't find all p)

Node

官网破解

工具

技术点

  • python 里的 js 引擎
    • js2py
    • pyv8

产品化

  • restful
  • headless or not
  • stability

Puppeteer 示例代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
(async () => {
let browser = await puppeteer.launch({
headless: false,
userDataDir: "/tmp/lagou/" + Math.random,
slowMo: 300,
// args: [
// '--proxy-server=http://' + curProxy
// ]
});

const page = await browser.newPage();

await page.evaluate(async () => {
Object.defineProperty(navigator, "webdriver", { get: () => false });
});

const searchTextId = "#search_input";
const searchBtnId = "#search_button";

try {
//await page.waitFor(20000 * Math.random() + 10000);
await page.goto("https://www.lagou.com/", {
timeout: 3000,
});

const companyBtn = await page.waitForSelector(searchTextId, {
timeout: 3000,
});

await page.click(searchTextId);
await page.keyboard.type("百度");
await page.click(searchBtnId);
} catch (err) {
logger.info("open page exception:" + err);
}

await browser.close();
})();