共计 3680 个字符,预计需要花费 10 分钟才能阅读完成。
背景
- 因为下周一要去腾讯面试写代码,面试官说会考爬虫和数据处理,我就寻思着写个爬虫练练手,因为 最近几天一直都在拉勾上找工作就用拉勾拿来练手了。
环境
- OSX 10.14.6 on Macbook pro 2017
- Selenium==3.11.0
- Python3.64
- Chromedriver 70.0.3538.97
Github 仓库
- https://github.com/supersu097/mycrawler/tree/master/lagou
实现参考
- Request 版本:http://eunsetee.com/EKKQ
- Selenium 版本:http://eunsetee.com/EKNx
说明
- Request 版本代码是完全照搬原作者的,Selenium 版只参考了点击下一页的两行代码,其它 代码 copy 自以前写的爬虫, 具体大家可以进原文章看看,然后对比我下面的代码
Selenium 版本代码
# coding = 'utf-8'
from selenium.webdriver.chrome.webdriver import WebDriver as ChromeDriver
from user_agent import generate_user_agent
from os.path import join as os_join
from bs4 import BeautifulSoup
import os
import time
import pandas
class Lagou(object):
CURR_PATH = os.path.abspath('.')
CHROME_DRIVER_PATH: str = os_join(CURR_PATH, 'chromedriver_mac')
def __init__(self, initial_url: str):
from selenium.webdriver.chrome.options import Options as ChromeOptions
self.initial_url: str = initial_url
options: ChromeOptions = ChromeOptions()
# options.add_argument('headless')
custom_ua: str = generate_user_agent(os='win')
options.add_argument('user-agent=' + custom_ua)
self.driver = ChromeDriver(executable_path=self.CHROME_DRIVER_PATH,
chrome_options=options)
def get_first_page_source(self):
try:
self.driver.implicitly_wait(5)
self.driver.set_script_timeout(20)
self.driver.set_page_load_timeout(25)
self.driver.get(self.initial_url)
page_source: str = self.driver.page_source
return page_source
except KeyboardInterrupt:
self.driver.quit()
print('The exception of KeyboardInterrupt'
'detected and the chrome driver has quited.')
except Exception as e:
print(str(e))
self.driver.quit()
exit(1)
def process_data(self, page_source):
soup = BeautifulSoup(page_source, 'lxml')
company_list = soup.select('ul.item_con_list li')
data_list = []
for company in company_list:
attrs = company.attrs
company_name = attrs['data-company']
job_name = attrs['data-positionname']
job_salary = attrs['data-salary']
data_list.append(company_name + ',' + job_name + ',' + job_salary)
return data_list
def get_next_page_source(self):
try:
self.driver.implicitly_wait(5)
self.driver.set_script_timeout(20)
self.driver.set_page_load_timeout(25)
next_page = self.driver.find_elements_by_xpath("//span[@class='pager_next ']")
next_page[0].click()
page_source: str = self.driver.page_source
return page_source
except KeyboardInterrupt:
self.driver.quit()
print('The exception of KeyboardInterrupt'
'detected and the chrome driver has quited.')
except Exception as e:
print(str(e))
self.driver.quit()
exit(1)
@staticmethod
def save_data(self, data, csv_header):
table = pandas.DataFrame(data)
table.to_csv(r'/Users/sharp/Desktop/LaGou.csv', header=csv_header, index=False, mode='a+')
def save_data_into_csv(self, line_data):
with open(r'/Users/sharp/Desktop/LaGou.csv', 'a+') as f:
f.write(line_data+'\n')
url = 'https://www.lagou.com/jobs/list_linux%E8%BF%90%E7%BB%B4?labelWords=&fromSearch=true&suginput='
lagou = Lagou(url)
print('Get page {} source'.format(str(1)))
first_page_source = lagou.get_first_page_source()
first_page_data = lagou.process_data(first_page_source)
lagou.save_data_into_csv('company_name,job_name,job_salary')
for data in first_page_data:
lagou.save_data_into_csv(data)
for i in range(1, 30):
print('Get page {} source'.format(str(i+1)))
next_page_source = lagou.get_next_page_source()
next_page_data = lagou.process_data(next_page_source)
for data in first_page_data:
lagou.save_data_into_csv(data)
time.sleep(8)
Selenium 版本代码说明
说几个值得注意的地方 – xpath 拿到下一页后要去下标为 0 的第一个元素然后再 click(),代码如下:
next_page = self.driver.find_elements_by_xpath("//span[@class='pager_next ']")
next_page[0].click()
- 用 headless 模式运行会报 “element is not clickable at point” 错误,所以
options.add_argument('headless')
这一行就注释掉了 - 再一个就是休眠的时间,我第一次爬的时候休眠设置的是 5 秒,结果爬到第 12 页就跳转到了登陆页面,后来我调成了 8 秒,实践证明可以一口气爬完 30 页。
- 最后要说明的是原博的代码实在读起来比较晦涩,我的呢纯过程式,先处理第一页,然后顺序处理 剩余的页面,一目明了,代码更是清晰易懂。
数据截图
截图中上半部分是 selenium 版本的爬虫生成的,下半部分是 requests 版本的。
题外话
目前这两个版本的脚本都是默认爬取 linux 运维职位,如果想支持在命令行指定爬取任意职位可以给我 发邮件 ([email protected]) 付费定制脚本哦。
正文完
发表至: Web爬虫
2019-09-23