python写的一个简单爬虫

最近几天在用python写一个简单的爬虫，主要是来爬取一些实时的漏洞库。

由于技术太菜，中途遇到了好些问题，我在这里作一个小的总结

爬取的网站是https://www.seebug.org

python版本为3.x

因为是https的，刚开始使用requests获取源码的时候总是报SSL错误，很无奈，最终还是选择使用selenium，利用pip安装好selenium之后要做的是下载浏览器driver，我这里下载的是chromedriver，路径一定要在Application文件夹下。

获取网站的cookie，刚开始获取cookie的时候我用的是cookiejar，后来发现不怎么好使，最后还是觉得使用selenium方便，直接模拟浏览器操作。省心又放心！

获取相应标签的内容，一开始我选择用beautifulsoup来进行解析，因为要解析五个内容，测试发现使用xpath比beautifulsoup要更简单，果断选择xpath，使用html.xpath获取的文件是list形式，我在写入csv时都转换成了str型。

爬虫文件需要保存为csv文件，起初怎么都保存不进去，最后发现是”w“和”wb“的问题，csv文件的话最好注意一下writerow和writerows的区别

这个程序中途出现了多次TypeError,AttributeError报错提示，不过最终还是通过google&baidu解决了问题。。。以下是一个简单的demo，后续完善…

#coding:utf-8

import requests
import re
from selenium import webdriver
import csv
import datetime
import time
from lxml import etree


def doSth(url):
    global time
    chrome = webdriver.Chrome()
    chrome.get(url)
    #time.sleep(5)


    __jsluid = '__jsluid=' + chrome.get_cookie('__jsluid')['value'] + ';'
    __jsl_clearance = '__jsl_clearance=' + chrome.get_cookie('__jsl_clearance')['value'] + ';'
    #csrftoken='csrftoken='+ chrome.get_cookie('csrftoken')['value']+';'

    chrome.quit()

    headers={
        "Host": "www.seebug.org",
        "Connection": "close",
        "Cache-Control": "max-age=0",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Referer": "https://www.seebug.org/",
        "Accept-Language": "zh-CN,zh;q=0.9",
        "Cookie": __jsluid + __jsl_clearance #+ csrftoken
            }

    requests.packages.urllib3.disable_warnings() 
    res=requests.get(url,headers=headers,verify=False).content
    html=etree.HTML(res)


    #articles=[]
    title=html.xpath('//*[@id="j-vul-title"]/span/text()')
    title_str="".join(title)
    time=html.xpath('//*[@id="j-vul-basic-info"]/div/div[1]/dl[2]/dd/text()')
    time_str="".join(time)
    number=html.xpath('//*[@id="j-vul-basic-info"]/div/div[3]/dl[1]/dd/a/text()')
    number_str="".join(number)
    step=html.xpath('//*[@id="j-vul-basic-info"]/div/div[1]/dl[4]/dd/div[1]/@data-original-title')
    step_str="".join(step)
    desc=html.xpath('//*[@id="j-affix-target"]/div[2]/div[1]/section[2]/div[2]/div[2]/p[1]/text()')
    desc_str="".join(desc)
    articles.append([title_str,time_str,number_str,step_str,desc_str])
    


                #保存在csv文件中
    with open("seebug.csv","w",newline="") as f:
        writer=csv.writer(f,dialect=("excel"))
        writer.writerow(["标题","时间","编号","危害级别","漏洞描述"])
        for row in articles:
            writer.writerow(row)

articles=[]
url="https://www.seebug.org/vuldb/ssvid-"
for i in range(97210,97213):
    url_data=url+str(i)
    doSth(url_data)


#def main(h=1,m=0):  
 #   while True:  
  #      now = datetime.datetime.now()  
        # print(now.hour, now.minute)  
   #     if now.hour == h and now.minute == m:  
    #        break  
        # 每隔60秒检测一次   
     #   time.sleep(60)  
   # doSth()#爬虫程序