phpwebdriver+ docker-selenium+linux实现网络爬虫

linux上需要安装docker服务,如果没有安装请看前面的文章
#拉取docker 镜像 
docker pull selenium/standalone-chrome:4.0.0-alpha-7-prerelease-20200826

#创建selenium docker容器
docker run -d -p 4444:4444 --name=selenium -v /dev/shm:/dev/shm selenium/standalone-chrome:4.0.0-alpha-7-prerelease-20200826

#查看容器状态
docker ps

搭建php环境、安装compser此处不赘述

composer require php-webdriver/webdriver
<?php
/**
 * Created by PhpStorm.
 * User: lizhiguo
 * Date: 2020/8/31
 * Time: 10:05
 */
require __DIR__ . '/vendor/autoload.php';
use \Facebook\WebDriver\Remote\RemoteWebDriver;
use \Facebook\WebDriver\Remote\DesiredCapabilities;
use \Facebook\WebDriver\Chrome\ChromeOptions;
$host='http://127.0.0.1:4444';
$desiredCapabilities = DesiredCapabilities::chrome();

// Disable accepting SSL certificates
$desiredCapabilities->setCapability('acceptSslCerts', false);

// Run headless firefox

$chromeOptions = new ChromeOptions();
$chromeOptions->addArguments(['--no-sandbox', '--headless']);

$desiredCapabilities->setCapability(ChromeOptions::CAPABILITY_W3C, $chromeOptions);


$driver = RemoteWebDriver::create($host, $desiredCapabilities);


for ($i=1;$i<=14;$i++){
	echo $url="https://www.amazon.com/s?k=keyboard&page=".$i."&qid=".time()."&ref=sr_pg_3";
	$driver->get($url);
//	$chromeOptions->getCookies($url);
	print_r($source=$driver->getPageSource());
	file_put_contents($i.'.html',$source);
}

//$driver->manage()->getCookies();


$driver->quit();
https://php-webdriver.github.io/php-webdriver/latest/Facebook/WebDriver/Chrome/ChromeDriver.html

centos 8 docker 搭建 chrome or opera + python+selenium webdriver环境,实现网络数据爬虫

yum update
#使用centos 8

#安装python及包扩展工具pip
yum install python38

#安装完成后 查看版本
[root@7c73e1180bfb ~]# python3.8 -V
Python 3.8.0

[root@7c73e1180bfb ~]# pip3.8 -V
pip 19.2.3 from /usr/lib/python3.8/site-packages/pip (python 3.8)

#不同系统版本,包依赖可能存在差别,如果无法安装请移步到编译安装
linux(centos)安装python
#安装selenium包 pip3.8 install selenium OR #国内下载慢,使用国内镜像安装 pip3.8 install selenium -i https://pypi.tuna.tsinghua.edu.cn/simple --trusted-host pypi.tuna.tsinghua.edu.cn
#下载Opera浏览器
# https://download4.operacdn.com/ftp/pub/opera/desktop/
#下载
wget https://download4.operacdn.com/ftp/pub/opera/desktop/70.0.3728.95/linux/opera-stable_70.0.3728.95_amd64.rpm

#安装
yum localinstall opera-stable_70.0.3728.95_amd64.rpm

#查看浏览器版本
[root@7c73e1180bfb ~]# opera -version
70.0.3728.95

#安装浏览器驱动,选择对应版本驱动
https://github.com/operasoftware/operachromiumdriver/releases

wget https://github.com/operasoftware/operachromiumdriver/releases/download/v.84.0.4147.89/operadriver_linux64.zip

unzip operadriver_linux64.zip

cp operadriver /usr/bin/operadriver 
#下载google chrome浏览器
# https://www.chrome64bit.com/index.php/google-chrome-64-bit-for-linux
#下载
wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm

#安装
yum install google-chrome-stable_current_x86_64.rpm

#查看浏览器版本
[root@d289bf70da9a ~]# google-chrome --version
Google Chrome 85.0.4183.83 

#安装浏览器驱动,选择对应版本驱动
https://npm.taobao.org/mirrors/chromedriver/

wget https://cdn.npm.taobao.org/dist/chromedriver/85.0.4183.87/chromedriver_linux64.zip

unzip chromedriver_linux64.zip

cp chromedriver /usr/bin/chromedriver

#查看驱动版本
[root@d289bf70da9a ~]# chromedriver --version
ChromeDriver 85.0.4183.87 (cd6713ebf92fa1cacc0f1a598df280093af0c5d7-refs/branch-heads/4183@{#1689})
新建 webdriver.py  文件  格式与下面保持一致防止执行报错
import io
import sys
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8') #改变标准输出的默认编码

#这将使Selenium WebDriver等待直到完全加载并解析了初始HTML文档,并放弃了样式表,图像和子帧的加载。
#设置为eager时,Selenium WebDriver等待直到 DOMContentLoaded 返回事件
options = Options()
options.page_load_strategy = 'none'
options.add_argument("--no-sandbox")
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)


#最大化窗口
driver.maximize_window()

for num in range(1,14):
    driver.get("https://www.amazon.com/s?k=keyboard&page=%s&qid=%s&ref=sr_pg_3" %(num,time.time()))
    html_source = driver.page_source
    print(html_source)
    print(driver.current_url)
    print(driver.get_cookies())


#采集完成关闭浏览器
driver.close()
driver.quit()
#执行输出
python3.8 webdriver.py

使用TCP/IP协议栈指纹进行远程操作系统辨识 主动识别、被动识别

在做亚马逊爬虫的时候,亚马逊的屏蔽规则让人费解,传统的模拟浏览器请求header、cookie,换IP对亚马逊反爬虫策略并不能完全解释清楚,还存在其他的反爬虫策略,因为亚马逊并不会完全封禁IP,隔断时间会被解封,这样将牺牲一部分用户群体。

困惑产生原因:

1.相同IP、同样的抓取方式,在linux操作系统下面抓取数据已经被封闭,换成windows操作系统时却可以正常抓取数据

2.linux操作系统,通过docker 安装centos ubuntu 蝶变 等操作系统及不同版本,采用相同抓取方式,别封禁的情况截然不同,有些正常抓取,有些被封了,他们的出网ip相同,为什么会存在这种情况?

猜想:难道亚马逊可以识别到服务器与docker容器里面的网卡MAC地址?亦或者能识别我们的操作系统类型及版本号?

最开始错误思虑:http请求时,伪造 User-Agent:windows操作系统,他应该识别到的只能是windows操作系统呀!

User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 

当然这个问题被放了,没有再做过多的思考,毕竟对当时TCP/IP传输协议了解甚少

。。。。。。。。

几个月后

心血来潮

查阅很多资料

发现

协议栈指纹

协议栈指纹识别是一项强大的技术,能够以很高的概率迅速确定操作系统的版本。虽然TCP/IP协议栈的定义已经成为一项标准,但是各个厂家,如微软和RedHat等在编写自己的TCP/IP协议栈时,却做出了不同的解释。这些解释因具有独一无二的特性,故被称为“指纹”。通过这些细微的差别,可以准确定位操作系统的版本。

TCP/IP堆栈指纹识别作为一种识别准确率很高的技术,被广泛1653运用于nmaP,p0f等著名安全检测工具中。TCP/IP堆栈指纹识别分为两种,即主动识别和被动识别。

p0f 被动识别工具

#安装
yum install p0f
[root@izwz9bb1rjtnk ~]# p0f -h
--- p0f 3.09b by Michal Zalewski <lcamtuf@coredump.cx> ---

p0f: invalid option -- 'h'
Usage: p0f [ ...options... ] [ 'filter rule' ]

Network interface options:

  -i iface  - listen on the specified network interface
  -r file   - read offline pcap data from a given file
  -p        - put the listening interface in promiscuous mode
  -L        - list all available interfaces

Operating mode and output settings:

  -f file   - read fingerprint database from 'file' (/etc/p0f/p0f.fp)
  -o file   - write information to the specified log file
  -s name   - answer to API queries at a named unix socket
  -u user   - switch to the specified unprivileged account and chroot
  -d        - fork into background (requires -o or -s)

Performance-related options:

  -S limit  - limit number of parallel API connections (20)
  -t c,h    - set connection / host cache age limits (30s,120m)
  -m c,h    - cap the number of active connections / hosts (1000,10000)

Optional filter expressions (man tcpdump) can be specified in the command
line to prevent p0f from looking at incidental network traffic.

Problems? You can reach the author at <lcamtuf@coredump.cx>.
监听 eth0 网卡 443端口 将日志写入p0f3.log
p0f -f /etc/p0f/p0f.fp -o ./p0f3.log -i eth0  'port 443'
#输出
.-[ 172.18.37.42/53464 -> 163.177.83.164/443 (syn) ]-
|
| client   = 172.18.37.42/53464
| os       = Linux 3.11 and newer
| dist     = 0
| params   = none
| raw_sig  = 4:64+0:0:1460:mss*20,7:mss,sok,ts,nop,ws:df,id+:0
|
`----

.-[ 172.18.37.42/53464 -> 163.177.83.164/443 (mtu) ]-
|
| client   = 172.18.37.42/53464
| link     = Ethernet or modem
| raw_mtu  = 1500
|
`----

.-[ 172.18.37.42/53464 -> 163.177.83.164/443 (syn+ack) ]-
|
| server   = 163.177.83.164/443
| os       = Linux 3.x
| dist     = 12
| params   = tos:0x05
| raw_sig  = 4:52+12:0:1440:mss*10,7:mss,nop,nop,sok,nop,ws:df:0
|
`----

.-[ 172.18.37.42/53464 -> 163.177.83.164/443 (mtu) ]-
|
| server   = 163.177.83.164/443
| link     = IPIP or SIT
| raw_mtu  = 1480
|
`----

.-[ 222.131.36.189/50664 -> 172.18.37.42/443 (syn) ]-
|
| client   = 222.131.36.189/50664
| os       = Mac OS X
| dist     = 12
| params   = generic fuzzy tos:0x05
| raw_sig  = 4:52+12:0:1420:65535,7:mss,nop,ws,nop,nop,ts,sok,eol+1:df,ecn:0
|
`----

发送http请求到测试服务器,虽然User-Agent伪装windows操作系统,但是通过栈指纹还是可以识别到请求操作系统类型,感觉像是掩耳盗铃,所以通过栈指纹来反爬虫却成为一件很容易的事情。这是亚马逊反爬虫的策略之一。

主动栈指纹识别

#安装
yum install nmap
#帮助文档
[root@root ~]# nmap -h
Nmap 6.40 ( http://nmap.org )
Usage: nmap [Scan Type(s)] [Options] {target specification}
TARGET SPECIFICATION:
  Can pass hostnames, IP addresses, networks, etc.
  Ex: scanme.nmap.org, microsoft.com/24, 192.168.0.1; 10.0.0-255.1-254
  -iL <inputfilename>: Input from list of hosts/networks
  -iR <num hosts>: Choose random targets
  --exclude <host1[,host2][,host3],...>: Exclude hosts/networks
  --excludefile <exclude_file>: Exclude list from file
HOST DISCOVERY:
  -sL: List Scan - simply list targets to scan
  -sn: Ping Scan - disable port scan
  -Pn: Treat all hosts as online -- skip host discovery
  -PS/PA/PU/PY[portlist]: TCP SYN/ACK, UDP or SCTP discovery to given ports
  -PE/PP/PM: ICMP echo, timestamp, and netmask request discovery probes
  -PO[protocol list]: IP Protocol Ping
  -n/-R: Never do DNS resolution/Always resolve [default: sometimes]
  --dns-servers <serv1[,serv2],...>: Specify custom DNS servers
  --system-dns: Use OS's DNS resolver
  --traceroute: Trace hop path to each host
SCAN TECHNIQUES:
  -sS/sT/sA/sW/sM: TCP SYN/Connect()/ACK/Window/Maimon scans
  -sU: UDP Scan
  -sN/sF/sX: TCP Null, FIN, and Xmas scans
  --scanflags <flags>: Customize TCP scan flags
  -sI <zombie host[:probeport]>: Idle scan
  -sY/sZ: SCTP INIT/COOKIE-ECHO scans
  -sO: IP protocol scan
  -b <FTP relay host>: FTP bounce scan
PORT SPECIFICATION AND SCAN ORDER:
  -p <port ranges>: Only scan specified ports
    Ex: -p22; -p1-65535; -p U:53,111,137,T:21-25,80,139,8080,S:9
  -F: Fast mode - Scan fewer ports than the default scan
  -r: Scan ports consecutively - don't randomize
  --top-ports <number>: Scan <number> most common ports
  --port-ratio <ratio>: Scan ports more common than <ratio>
SERVICE/VERSION DETECTION:
  -sV: Probe open ports to determine service/version info
  --version-intensity <level>: Set from 0 (light) to 9 (try all probes)
  --version-light: Limit to most likely probes (intensity 2)
  --version-all: Try every single probe (intensity 9)
  --version-trace: Show detailed version scan activity (for debugging)
SCRIPT SCAN:
  -sC: equivalent to --script=default
  --script=<Lua scripts>: <Lua scripts> is a comma separated list of 
           directories, script-files or script-categories
  --script-args=<n1=v1,[n2=v2,...]>: provide arguments to scripts
  --script-args-file=filename: provide NSE script args in a file
  --script-trace: Show all data sent and received
  --script-updatedb: Update the script database.
  --script-help=<Lua scripts>: Show help about scripts.
           <Lua scripts> is a comma separted list of script-files or
           script-categories.
OS DETECTION:
  -O: Enable OS detection
  --osscan-limit: Limit OS detection to promising targets
  --osscan-guess: Guess OS more aggressively
TIMING AND PERFORMANCE:
  Options which take <time> are in seconds, or append 'ms' (milliseconds),
  's' (seconds), 'm' (minutes), or 'h' (hours) to the value (e.g. 30m).
  -T<0-5>: Set timing template (higher is faster)
  --min-hostgroup/max-hostgroup <size>: Parallel host scan group sizes
  --min-parallelism/max-parallelism <numprobes>: Probe parallelization
  --min-rtt-timeout/max-rtt-timeout/initial-rtt-timeout <time>: Specifies
      probe round trip time.
  --max-retries <tries>: Caps number of port scan probe retransmissions.
  --host-timeout <time>: Give up on target after this long
  --scan-delay/--max-scan-delay <time>: Adjust delay between probes
  --min-rate <number>: Send packets no slower than <number> per second
  --max-rate <number>: Send packets no faster than <number> per second
FIREWALL/IDS EVASION AND SPOOFING:
  -f; --mtu <val>: fragment packets (optionally w/given MTU)
  -D <decoy1,decoy2[,ME],...>: Cloak a scan with decoys
  -S <IP_Address>: Spoof source address
  -e <iface>: Use specified interface
  -g/--source-port <portnum>: Use given port number
  --data-length <num>: Append random data to sent packets
  --ip-options <options>: Send packets with specified ip options
  --ttl <val>: Set IP time-to-live field
  --spoof-mac <mac address/prefix/vendor name>: Spoof your MAC address
  --badsum: Send packets with a bogus TCP/UDP/SCTP checksum
OUTPUT:
  -oN/-oX/-oS/-oG <file>: Output scan in normal, XML, s|<rIpt kIddi3,
     and Grepable format, respectively, to the given filename.
  -oA <basename>: Output in the three major formats at once
  -v: Increase verbosity level (use -vv or more for greater effect)
  -d: Increase debugging level (use -dd or more for greater effect)
  --reason: Display the reason a port is in a particular state
  --open: Only show open (or possibly open) ports
  --packet-trace: Show all packets sent and received
  --iflist: Print host interfaces and routes (for debugging)
  --log-errors: Log errors/warnings to the normal-format output file
  --append-output: Append to rather than clobber specified output files
  --resume <filename>: Resume an aborted scan
  --stylesheet <path/URL>: XSL stylesheet to transform XML output to HTML
  --webxml: Reference stylesheet from Nmap.Org for more portable XML
  --no-stylesheet: Prevent associating of XSL stylesheet w/XML output
MISC:
  -6: Enable IPv6 scanning
  -A: Enable OS detection, version detection, script scanning, and traceroute
  --datadir <dirname>: Specify custom Nmap data file location
  --send-eth/--send-ip: Send using raw ethernet frames or IP packets
  --privileged: Assume that the user is fully privileged
  --unprivileged: Assume the user lacks raw socket privileges
  -V: Print version number
  -h: Print this help summary page.
EXAMPLES:
  nmap -v -A scanme.nmap.org
  nmap -v -sn 192.168.0.0/16 10.0.0.0/8
  nmap -v -iR 10000 -Pn -p 80
SEE THE MAN PAGE (http://nmap.org/book/man.html) FOR MORE OPTIONS AND EXAMPLES
root@MHAnode04:~# nmap -O 50.2.83.130

Starting Nmap 7.01 ( https://nmap.org ) at 2020-06-04 04:11 EDT
Nmap scan report for 50.2.83.130
Host is up (0.15s latency).
Not shown: 999 closed ports
PORT   STATE SERVICE
22/tcp open  ssh
Aggressive OS guesses: Linux 2.6.32 - 3.13 (96%), Linux 3.2 - 4.0 (94%), Linux 2.6.32 - 3.10 (93%), HP P2000 G3 NAS device (93%), Ubiquiti AirMax NanoStation WAP (Linux 2.6.32) (92%), Linux 2.6.32 (92%), Linux 3.7 (92%), Infomir MAG-250 set-top box (92%), Linux 2.6.23 - 2.6.38 (91%), Linux 2.6.32 - 3.1 (91%)
No exact OS matches for host (test conditions non-ideal).
Network Distance: 18 hops

OS detection performed. Please report any incorrect results at https://nmap.org/submit/ .
Nmap done: 1 IP address (1 host up) scanned in 9.81 seconds

产考资料:

https://blog.csdn.net/he_and/article/details/88350861
https://blog.csdn.net/freeking101/article/details/72962349
https://www.ixueshu.com/document/977e456638c1f9cf.html
https://www.doc88.com/p-8846033523821.html
https://wenku.baidu.com/view/6bdb6c2bff4733687e21af45b307e87100f6f878.html
https://blog.csdn.net/whatday/article/details/105517801
https://wenku.baidu.com/view/c6711182e53a580216fcfe75.html
http://shouce.jb51.net/kali-linux-tutorial/21.html
https://baike.baidu.com/item/%E5%8D%8F%E8%AE%AE%E6%A0%88%E6%8C%87%E7%BA%B9/7113052?fr=aladdin
https://blog.csdn.net/freeking101/article/details/72962349