V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
V2EX 提问指南
c137Toma
V2EX  ›  问与答

用 selenium 爬 bjidex.com 商品数据,无论试多少遍都只能拿到 40 条。脑壳扣了半天,想不明白。

  •  
  •   c137Toma · 3 天前 · 1001 次点击

    我的代码:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    import pandas as pd
    import time
    
    # 设置 Chrome 选项以启用 headless 模式和自定义 user-agent
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument(
        f'--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"'
    )
    
    # 初始化 WebDriver
    driver = webdriver.Chrome(
        executable_path="D:/lab/chromedriver-win64/chromedriver.exe", options=chrome_options
    )  
    url = "https://webs.bjidex.com/sys-bsc-home/#/bscConsole/tradingMarket"
    
    # 访问网页
    driver.get(url)
    
    # 初始化一个列表来保存数据
    data_list = []
    
    # 爬取数据
    for page in range(184):
    
        # 构建翻页按钮的 XPath
        if page < 4:
            next_button_xpath = "/html/body/div[2]/div[2]/div/div/div[2]/div/div/section/div/div/div/div[7]/div/div[2]/ul/li[10]/button"
        elif 4 < page < 181:
            next_button_xpath = "/html/body/div[2]/div[2]/div/div/div[2]/div/div/section/div/div/div/div[7]/div/div[2]/ul/li[12]/button"
        elif 181 < page:
            next_button_xpath = "/html/body/div[2]/div[2]/div/div/div[2]/div/div/section/div/div/div/div[7]/div/div[2]/ul/li[10]/button"
        else:
            next_button_xpath = "/html/body/div[2]/div[2]/div/div/div[2]/div/div/section/div/div/div/div[7]/div/div[2]/ul/li[11]/button"
    
        # 爬取每页的 10 组数据
        for i in range(1, 11):
            time.sleep(1)  # 等待页面加载新内容
            # 构建每组的 XPath
            product_name_xpath = f"/html/body/div[2]/div[2]/div/div/div[2]/div/div/section/div/div/div/div[7]/div/div[1]/div/ul/li[{i}]/div[1]/div/div[1]/div/span[1]"
            supplier_list_xpath = f"/html/body/div[2]/div[2]/div/div/div[2]/div/div/section/div/div/div/div[7]/div/div[1]/div/ul/li[{i}]/div[1]/div/div[3]/span"
            product_type_xpath = f"/html/body/div[2]/div[2]/div/div/div[2]/div/div/section/div/div/div/div[7]/div/div[1]/div/ul/li[{i}]/div[1]/div/div[1]/div/span[2]"
            application_scenario_xpath = f"/html/body/div[2]/div[2]/div/div/div[2]/div/div/section/div/div/div/div[7]/div/div[1]/div/ul/li[{i}]/div[1]/div/div[3]/div"
            product_description_xpath = f"/html/body/div[2]/div[2]/div/div/div[2]/div/div/section/div/div/div/div[7]/div/div[1]/div/ul/li[{i}]/div[1]/div/div[2]"
            price_xpath = f"/html/body/div[2]/div[2]/div/div/div[2]/div/div/section/div/div/div/div[7]/div/div[1]/div/ul/li[{i}]/div[2]/div[1]"
    
            try:
                # 供应商提供商品名称
                product_name = driver.find_element(By.XPATH, product_name_xpath).text
    
                # 数据供应商名单
                supplier_list = driver.find_element(By.XPATH, supplier_list_xpath).text
    
                # 商品类型
                product_type = driver.find_element(By.XPATH, product_type_xpath).text
    
                # 应用场景
                application_scenario = driver.find_element(
                    By.XPATH, application_scenario_xpath
                ).text
    
                # 商品描述
                product_description = driver.find_element(
                    By.XPATH, product_description_xpath
                ).text
    
                # 价格
                price = driver.find_element(By.XPATH, price_xpath).text
    
                # 将数据添加到列表
                data_list.append(
                    {
                        "页数": page + 1,
                        "供应商提供商品名称": product_name,
                        "数据供应商名单": supplier_list,
                        "商品类型": product_type,
                        "应用场景": application_scenario,
                        "商品描述": product_description,
                        "价格": price,
                    }
                )
            except Exception as e:
                print(f"Error on page {page + 1}, item {i}: {e}")
    
        # 点击翻页按钮
        try:
            next_button = WebDriverWait(driver, 2).until(  # 等待时间为 2 秒
                EC.element_to_be_clickable((By.XPATH, next_button_xpath))
            )
            next_button.click()
        except Exception as e:
            print("翻页出错或已经是最后一页:", e)
            break  # 如果无法翻页,则跳出循环
    
    # 关闭浏览器
    driver.quit()
    
    # 将列表转换为 DataFrame
    data_df = pd.DataFrame(data_list)
    
    # 输出为表格
    data_df.to_csv(
        "bjidex.com_data.csv", index=False, encoding="utf_8_sig"
    )  # 保存为 CSV 文件
    print(data_df)  # 打印 DataFrame
    

    输出

    PS D:\lab\bigdata24.9.9> & C:/tools/miniconda3/python.exe d:/lab/bigdata24.9.9/bjidex.com.py
    
    DevTools listening on ws://127.0.0.1:60057/devtools/browser/66a61aa5-3598-4069-94bd-d4f10be20d96
    [42892:8184:0915/232611.717:ERROR:ssl_client_socket_impl.cc(882)] handshake failed; returned -1, SSL error code 1, net_error -101
    [42892:8184:0915/232611.834:ERROR:ssl_client_socket_impl.cc(882)] handshake failed; returned -1, SSL error code 1, net_error -101
    [42892:8184:0915/232630.213:ERROR:ssl_client_socket_impl.cc(882)] handshake failed; returned -1, SSL error code 1, net_error -101
    翻页出错或已经是最后一页: Message: 
    
        页数  ...       价格
    0    1  ...   0.5 元/次
    1    1  ...     0 元/次
    2    1  ...     0 元/次
    3    1  ...   2.5 元/次
    ...
    37   4  ...   0.2 元/次
    38   4  ...   0.1 元/次
    39   4  ...  0.15 元/次
    
    [40 rows x 7 columns]
    
    8 条回复    2024-09-17 00:02:31 +08:00
    mumbler
        1
    mumbler  
       3 天前
    用桌面去调试,去掉 headless ,看看页面是否显示正常
    xe2vdw
        2
    xe2vdw  
       3 天前 via Android
    elif 4 < page < 181:
    没有= 4
    HUZHUANGZHUANG
        3
    HUZHUANGZHUANG  
       2 天前
    找个 AI 给你看看代码问题
    c137Toma
        4
    c137Toma  
    OP
       2 天前
    @xe2vdw else 对应 = 4 和 = 181
    c137Toma
        5
    c137Toma  
    OP
       2 天前
    我看到第 4 页和第 181 页的时候翻页按钮会变,for 循环里面是不是 page = 3 对应第 4 页,page = 180 对应第 181 页?
    1423
        6
    1423  
       2 天前
    这是工作? 还是学生?
    工作的话, 这也太菜了
    blackeeper
        7
    blackeeper  
       2 天前
    直接调接口不香吗?
    ``` python
    import requests

    url = 'https://webs.bjidex.com/api/dstp/data-asset-server/dataProduct/deal/list'
    data = {"searchName":"","productType":"","industry":"","viewCode":1,"pageSize":10,"pageNum":8}
    headers = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Referer' : 'https://webs.bjidex.com/sys-bsc-home/',
    'Origin' : 'https://webs.bjidex.com'
    }
    response = requests.post(url, json=data,headers=headers)
    print(response.text)
    ```
    c137Toma
        8
    c137Toma  
    OP
       2 天前
    是学生,谢谢各位。
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   1147 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 24ms · UTC 23:01 · PVG 07:01 · LAX 16:01 · JFK 19:01
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.