背景

  • 因为下周一要去腾讯面试写代码,面试官说会考爬虫和数据处理,我就寻思着写个爬虫练练手,因为 最近几天一直都在拉勾上找工作就用拉勾拿来练手了。

环境

  • OSX 10.14.6 on Macbook pro 2017
  • Selenium==3.11.0
  • Python3.64
  • Chromedriver 70.0.3538.97

Github仓库

实现参考

  • Request版本:http://eunsetee.com/EKKQ
  • Selenium版本:http://eunsetee.com/EKNx

说明

  • Request版本代码是完全照搬原作者的,Selenium版只参考了点击下一页的两行代码,其它 代码copy自以前写的爬虫,具体大家可以进原文章看看,然后对比我下面的代码

Selenium版本代码

# coding = 'utf-8'

from selenium.webdriver.chrome.webdriver import WebDriver as ChromeDriver
from user_agent import generate_user_agent
from os.path import join as os_join
from bs4 import BeautifulSoup
import os
import time
import pandas


class Lagou(object):
    CURR_PATH = os.path.abspath('.')
    CHROME_DRIVER_PATH: str = os_join(CURR_PATH, 'chromedriver_mac')

    def __init__(self, initial_url: str):
        from selenium.webdriver.chrome.options import Options as ChromeOptions
        self.initial_url: str = initial_url
        options: ChromeOptions = ChromeOptions()
        # options.add_argument('headless')
        custom_ua: str = generate_user_agent(os='win')
        options.add_argument('user-agent=' + custom_ua)
        self.driver = ChromeDriver(executable_path=self.CHROME_DRIVER_PATH,
                                   chrome_options=options)

    def get_first_page_source(self):
        try:
            self.driver.implicitly_wait(5)
            self.driver.set_script_timeout(20)
            self.driver.set_page_load_timeout(25)
            self.driver.get(self.initial_url)
            page_source: str = self.driver.page_source
            return page_source
        except KeyboardInterrupt:
            self.driver.quit()
            print('The exception of KeyboardInterrupt '
                  'detected and the chrome driver has quited.')
        except Exception as e:
            print(str(e))
            self.driver.quit()
            exit(1)

    def process_data(self, page_source):
        soup = BeautifulSoup(page_source, 'lxml')
        company_list = soup.select('ul.item_con_list li')
        data_list = []
        for company in company_list:
            attrs = company.attrs
            company_name = attrs['data-company']
            job_name = attrs['data-positionname']
            job_salary = attrs['data-salary']
            data_list.append(company_name + ',' + job_name + ',' + job_salary)
        return data_list

    def get_next_page_source(self):
        try:
            self.driver.implicitly_wait(5)
            self.driver.set_script_timeout(20)
            self.driver.set_page_load_timeout(25)
            next_page = self.driver.find_elements_by_xpath("//span[@class='pager_next ']")
            next_page[0].click()
            page_source: str = self.driver.page_source
            return page_source
        except KeyboardInterrupt:
            self.driver.quit()
            print('The exception of KeyboardInterrupt '
                  'detected and the chrome driver has quited.')
        except Exception as e:
            print(str(e))
            self.driver.quit()
            exit(1)

    @staticmethod
    def save_data(self, data, csv_header):
        table = pandas.DataFrame(data)
        table.to_csv(r'/Users/sharp/Desktop/LaGou.csv', header=csv_header, index=False, mode='a+')

    def save_data_into_csv(self, line_data):
        with open(r'/Users/sharp/Desktop/LaGou.csv', 'a+') as f:
            f.write(line_data+'\n')


url = 'https://www.lagou.com/jobs/list_linux%E8%BF%90%E7%BB%B4?labelWords=&fromSearch=true&suginput='
lagou = Lagou(url)
print('Get page {} source'.format(str(1)))
first_page_source = lagou.get_first_page_source()
first_page_data = lagou.process_data(first_page_source)
lagou.save_data_into_csv('company_name,job_name,job_salary')
for data in first_page_data:
    lagou.save_data_into_csv(data)

for i in range(1, 30):
    print('Get page {} source'.format(str(i+1)))
    next_page_source = lagou.get_next_page_source()
    next_page_data = lagou.process_data(next_page_source)
    for data in first_page_data:
        lagou.save_data_into_csv(data)
    time.sleep(8)

Selenium版本代码说明

说几个值得注意的地方 – xpath拿到下一页后要去下标为0的第一个元素然后再click(),代码如下:

next_page = self.driver.find_elements_by_xpath("//span[@class='pager_next ']")
next_page[0].click()
  • 用headless模式运行会报“element is not clickable at point”错误,所以 options.add_argument('headless')这一行就注释掉了
  • 再一个就是休眠的时间,我第一次爬的时候休眠设置的是5秒,结果爬到第12页就跳转到了登陆页面, 后来我调成了8秒,实践证明可以一口气爬完30页。
  • 最后要说明的是原博的代码实在读起来比较晦涩,我的呢纯过程式,先处理第一页,然后顺序处理 剩余的页面,一目明了,代码更是清晰易懂。

数据截图

分别用requests和selenium实现了拉勾的爬虫 截图中上半部分是selenium版本的爬虫生成的,下半部分是requests版本的。

题外话

目前这两个版本的脚本都是默认爬取linux运维职位,如果想支持在命令行指定爬取任意职位可以给我 发邮件([email protected])付费定制脚本哦。