selenium crawler - ming5ming

What is Selenium?#

Selenium is a comprehensive project of tools and libraries that support web browser automation.

It provides extensions to simulate user interactions with browsers, for extending the distribution servers assigned by browsers, and for implementing the infrastructure of the W3C WebDriver specification, which allows you to write interchangeable code for all major web browsers.

The Selenium library in Python is the interface for Selenium, which can simulate browser operations on pages like a human, retrieving web page information.

Based on this feature, Selenium makes the code logic simpler when scraping certain website information, and does not require reversing JS encrypted code.

However, because it simulates operations, the scraping efficiency is not as good as other crawlers.

To demonstrate the power of Selenium, let's take an example:

Scraping the fan names and the number of fans from a Bilibili personal page.

Note: When scraping data, be aware of the regulations in the website's robots.txt, and do not have too high a scraping frequency to avoid burdening the website. The fan names and the number of fans scraped in this article are public content.

Installation#

$ pip install selenium

Analyzing the Website#

On the fan page of the personal space, the fan information is located within the <li> elements under the <ul>.

Fan Page
However, the number of fans for each fan is not stored in the <li> but on the server, not saved locally. It is triggered by moving the mouse over the fan's avatar or name, which sends data to the local via JS. This operation is implemented using AJAX technology.

AJAX (Asynchronous JavaScript and XML) is a technology that allows for updating parts of a web page without reloading the entire page.

When the mouse hovers over the avatar, a <div id="id-card"> is generated at the end of the <body>:

id-card

The number of fans is located under the <div id="id-card"> in the <span class="idc-meta-item">:

fansNum

Matching Methods#

There are many methods to match elements in Selenium:

xpath (most commonly used)
by id
by name/tag name/class name
by link
by css selector

XPath is useful because it can use relative paths for matching and has simple syntax.
For example, to match the fan avatar, you can write:

//div[@id="id-card"]

And the position of that element in XML:

<html>
  ...
  <body>
    ...
    <div id = "id-card">
      ...
    </div>
  </body>
</html>

Of course, CSS selectors can also be very useful sometimes:

XML:

<html>
 <body>
  <p class="content">Site content goes here.</p>
</body>
<html>

CSS selector:

p.content

Writing the Crawler#

Initialization:

def initDriver(url):
# Set headless browser
    options = webdriver.ChromeOptions()
    options.add_argument('headless')
    options.add_experimental_option('excludeSwitches', ['enable-logging'])

# Initialize
    driver = webdriver.Chrome(options=options)
    actions = ActionChains(driver)

# Open link
    driver.get(url)
    driver.implicitly_wait(10)

    return driver, actions

Getting the page number:

def getPageNum(driver):
# Match the position of the bottom pagination element using xpath to get the page number
    text = driver.find_element("xpath", '//ul[@class="be-pager"]/span[@class="be-pager-total"]')\
                 .get_attribute("textContent")\
                 .split(' ')
    return text[1]

Iterating through all pages:

def spawnCards(page, driver, actions):
    # Iterate through all pages
    for i in range(1,int(page) + 1):
        print(f"get data in page {i}\n")
        # Trigger ajax to generate card
        spawn(driver, actions)
        if (i != int(page)):
            # Go to next page
            goNextPage(driver, actions)
            time.sleep(6)

Generating cards:

def spawn(driver, actions):
    # Get card list
    ulList = driver.find_elements("xpath", '//ul[@class="relation-list"]/li')
    # Generate card
    for li in ulList:
        getCard(li, actions)
        time.sleep(2)

def getCard(li, actions):
    cover = li.find_element("xpath", './/a[@class="cover"]')
    actions.move_to_element(cover)
    actions.perform()
    actions.reset_actions()

Getting and storing data:

def writeData(driver):
    # Get card list
    cardList = driver.find_elements("xpath", '//div[@id="id-card"]')
    for card in cardList:
        up_name = card.find_element("xpath", './/img[@class="idc-avatar"]').get_attribute("alt")
        up_fansNum = card.find_elements('css selector','span.idc-meta-item')[1].get_attribute("textContent")
        print(f'name:{up_name}, {up_fansNum}')
        # Write to csv file
        with open('.\\date.csv', mode='a', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow([up_name, up_fansNum])

Complete code:

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
import csv

def initDriver(url):
    options = webdriver.ChromeOptions()
    options.add_argument('headless')
    options.add_experimental_option('excludeSwitches', ['enable-logging'])
    driver = webdriver.Chrome(options=options)
    actions = ActionChains(driver)
    driver.get(url)
    driver.implicitly_wait(10)
    return driver, actions

def getPageNum(driver):
    text = driver.find_element("xpath", '//ul[@class="be-pager"]/span[@class="be-pager-total"]').get_attribute("textContent").split(' ')
    return text[1]

def goNextPage(driver, actions):
    bottom = driver.find_element("xpath", '//li[@class="be-pager-next"]/a')
    actions.click(bottom)
    actions.perform()
    actions.reset_actions()

def getCard(li, actions):
    cover = li.find_element("xpath", './/a[@class="cover"]')
    actions.move_to_element(cover)
    actions.perform()
    actions.reset_actions()

def writeData(driver):
    # Get card list
    cardList = driver.find_elements("xpath", '//div[@id="id-card"]')
    for card in cardList:
        up_name = card.find_element("xpath", './/img[@class="idc-avatar"]').get_attribute("alt")
        up_fansNum = card.find_elements('css selector','span.idc-meta-item')[1].get_attribute("textContent")
        print(f'name:{up_name}, {up_fansNum}')
        # Write info into csv file
        with open('.\\date.csv', mode='a', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow([up_name, up_fansNum])

def spawn(driver, actions):
    # Get card list
    ulList = driver.find_elements("xpath", '//ul[@class="relation-list"]/li')
    # Spawn card
    for li in ulList:
        getCard(li, actions)
        time.sleep(2)
    
def spawnCards(page, driver, actions):
    for i in range(1,int(page) + 1):
        print(f"get data in page {i}\n")
        spawn(driver, actions)
        if (i != int(page)):
            goNextPage(driver, actions)
            time.sleep(6) 

def main():
    # Init driver
    uid = input("bilibili uid:")
    url = "https://space.bilibili.com/" + uid + "/fans/fans"
    driver, actions = initDriver(url)
    page = getPageNum(driver)

    # Spawn card info (ajax)
    spawnCards(page, driver, actions)
    writeData(driver)

    driver.quit()

if __name__ == "__main__":
    main()

Results#

Reflection#

Areas for improvement:

Due to AJAX asynchronous loading, it is necessary to wait for the page to fully load before locating elements. Using the time.sleep() method is not efficient or elegant; the WebDriverWait() method can solve this. It can poll the page status and return true when the page is fully loaded.
Multiple XPath expressions with repeated paths were used, which consumed too much memory.
Data extraction could be done concurrently for faster results. However, considering the server load, only a single-threaded version was written.

References#

selenium doc

Recommended reading:

Ajax

Xpath