背景
- 因为下周一要去腾讯面试写代码,面试官说会考爬虫和数据处理,我就寻思着写个爬虫练练手,因为 最近几天一直都在拉勾上找工作就用拉勾拿来练手了。
环境
- OSX 10.14.6 on Macbook pro 2017
- Selenium==3.11.0
- Python3.64
- Chromedriver 70.0.3538.97
Github仓库
- https://github.com/supersu097/mycrawler/tree/master/lagou
实现参考
- Request版本:http://eunsetee.com/EKKQ
- Selenium版本:http://eunsetee.com/EKNx
说明
- Request版本代码是完全照搬原作者的,Selenium版只参考了点击下一页的两行代码,其它 代码copy自以前写的爬虫,具体大家可以进原文章看看,然后对比我下面的代码
Selenium版本代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
# coding = 'utf-8' from selenium.webdriver.chrome.webdriver import WebDriver as ChromeDriver from user_agent import generate_user_agent from os.path import join as os_join from bs4 import BeautifulSoup import os import time import pandas class Lagou(object): CURR_PATH = os.path.abspath('.') CHROME_DRIVER_PATH: str = os_join(CURR_PATH, 'chromedriver_mac') def __init__(self, initial_url: str): from selenium.webdriver.chrome.options import Options as ChromeOptions self.initial_url: str = initial_url options: ChromeOptions = ChromeOptions() # options.add_argument('headless') custom_ua: str = generate_user_agent(os='win') options.add_argument('user-agent=' + custom_ua) self.driver = ChromeDriver(executable_path=self.CHROME_DRIVER_PATH, chrome_options=options) def get_first_page_source(self): try: self.driver.implicitly_wait(5) self.driver.set_script_timeout(20) self.driver.set_page_load_timeout(25) self.driver.get(self.initial_url) page_source: str = self.driver.page_source return page_source except KeyboardInterrupt: self.driver.quit() print('The exception of KeyboardInterrupt ' 'detected and the chrome driver has quited.') except Exception as e: print(str(e)) self.driver.quit() exit(1) def process_data(self, page_source): soup = BeautifulSoup(page_source, 'lxml') company_list = soup.select('ul.item_con_list li') data_list = [] for company in company_list: attrs = company.attrs company_name = attrs['data-company'] job_name = attrs['data-positionname'] job_salary = attrs['data-salary'] data_list.append(company_name + ',' + job_name + ',' + job_salary) return data_list def get_next_page_source(self): try: self.driver.implicitly_wait(5) self.driver.set_script_timeout(20) self.driver.set_page_load_timeout(25) next_page = self.driver.find_elements_by_xpath("//span[@class='pager_next ']") next_page[0].click() page_source: str = self.driver.page_source return page_source except KeyboardInterrupt: self.driver.quit() print('The exception of KeyboardInterrupt ' 'detected and the chrome driver has quited.') except Exception as e: print(str(e)) self.driver.quit() exit(1) @staticmethod def save_data(self, data, csv_header): table = pandas.DataFrame(data) table.to_csv(r'/Users/sharp/Desktop/LaGou.csv', header=csv_header, index=False, mode='a+') def save_data_into_csv(self, line_data): with open(r'/Users/sharp/Desktop/LaGou.csv', 'a+') as f: f.write(line_data+'\n') url = 'https://www.lagou.com/jobs/list_linux%E8%BF%90%E7%BB%B4?labelWords=&fromSearch=true&suginput=' lagou = Lagou(url) print('Get page {} source'.format(str(1))) first_page_source = lagou.get_first_page_source() first_page_data = lagou.process_data(first_page_source) lagou.save_data_into_csv('company_name,job_name,job_salary') for data in first_page_data: lagou.save_data_into_csv(data) for i in range(1, 30): print('Get page {} source'.format(str(i+1))) next_page_source = lagou.get_next_page_source() next_page_data = lagou.process_data(next_page_source) for data in first_page_data: lagou.save_data_into_csv(data) time.sleep(8) |
Selenium版本代码说明
说几个值得注意的地方 - xpath拿到下一页后要去下标为0的第一个元素然后再click(),代码如下:
1 2 3 |
next_page = self.driver.find_elements_by_xpath("//span[@class='pager_next ']") next_page[0].click() |
- 用headless模式运行会报“element is not clickable at point”错误,所以
options.add_argument('headless')
这一行就注释掉了 - 再一个就是休眠的时间,我第一次爬的时候休眠设置的是5秒,结果爬到第12页就跳转到了登陆页面, 后来我调成了8秒,实践证明可以一口气爬完30页。
- 最后要说明的是原博的代码实在读起来比较晦涩,我的呢纯过程式,先处理第一页,然后顺序处理 剩余的页面,一目明了,代码更是清晰易懂。
数据截图
截图中上半部分是selenium版本的爬虫生成的,下半部分是requests版本的。
题外话
目前这两个版本的脚本都是默认爬取linux运维职位,如果想支持在命令行指定爬取任意职位可以给我 发邮件([email protected])付费定制脚本哦。