What is Selenium?#
Selenium is a comprehensive project of tools and libraries that support web browser automation.
It provides extensions to simulate user interactions with browsers, for extending the distribution servers assigned by browsers, and for implementing the infrastructure of the W3C WebDriver specification, which allows you to write interchangeable code for all major web browsers.
The Selenium library in Python is the interface for Selenium, which can simulate browser operations on pages like a human, retrieving web page information.
Based on this feature, Selenium makes the code logic simpler when scraping certain website information, and does not require reversing JS encrypted code.
However, because it simulates operations, the scraping efficiency is not as good as other crawlers.
To demonstrate the power of Selenium, let's take an example:
Scraping the fan names and the number of fans from a Bilibili personal page.
Note: When scraping data, be aware of the regulations in the website's robots.txt, and do not have too high a scraping frequency to avoid burdening the website. The fan names and the number of fans scraped in this article are public content.
Installation#
$ pip install selenium
Analyzing the Website#
On the fan page of the personal space, the fan information is located within the <li>
elements under the <ul>
.
However, the number of fans for each fan is not stored in the <li>
but on the server, not saved locally. It is triggered by moving the mouse over the fan's avatar or name, which sends data to the local via JS. This operation is implemented using AJAX technology.
AJAX (Asynchronous JavaScript and XML) is a technology that allows for updating parts of a web page without reloading the entire page.
When the mouse hovers over the avatar, a <div id="id-card">
is generated at the end of the <body>
:
The number of fans is located under the <div id="id-card">
in the <span class="idc-meta-item">
:
Matching Methods#
There are many methods to match elements in Selenium:
- xpath (most commonly used)
- by id
- by name/tag name/class name
- by link
- by css selector
XPath is useful because it can use relative paths for matching and has simple syntax.
For example, to match the fan avatar, you can write:
//div[@id="id-card"]
And the position of that element in XML:
<html>
...
<body>
...
<div id = "id-card">
...
</div>
</body>
</html>
Of course, CSS selectors can also be very useful sometimes:
XML:
<html>
<body>
<p class="content">Site content goes here.</p>
</body>
<html>
CSS selector:
p.content
Writing the Crawler#
Initialization:
def initDriver(url):
# Set headless browser
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_experimental_option('excludeSwitches', ['enable-logging'])
# Initialize
driver = webdriver.Chrome(options=options)
actions = ActionChains(driver)
# Open link
driver.get(url)
driver.implicitly_wait(10)
return driver, actions
Getting the page number:
def getPageNum(driver):
# Match the position of the bottom pagination element using xpath to get the page number
text = driver.find_element("xpath", '//ul[@class="be-pager"]/span[@class="be-pager-total"]')\
.get_attribute("textContent")\
.split(' ')
return text[1]
Iterating through all pages:
def spawnCards(page, driver, actions):
# Iterate through all pages
for i in range(1,int(page) + 1):
print(f"get data in page {i}\n")
# Trigger ajax to generate card
spawn(driver, actions)
if (i != int(page)):
# Go to next page
goNextPage(driver, actions)
time.sleep(6)
Generating cards:
def spawn(driver, actions):
# Get card list
ulList = driver.find_elements("xpath", '//ul[@class="relation-list"]/li')
# Generate card
for li in ulList:
getCard(li, actions)
time.sleep(2)
def getCard(li, actions):
cover = li.find_element("xpath", './/a[@class="cover"]')
actions.move_to_element(cover)
actions.perform()
actions.reset_actions()
Getting and storing data:
def writeData(driver):
# Get card list
cardList = driver.find_elements("xpath", '//div[@id="id-card"]')
for card in cardList:
up_name = card.find_element("xpath", './/img[@class="idc-avatar"]').get_attribute("alt")
up_fansNum = card.find_elements('css selector','span.idc-meta-item')[1].get_attribute("textContent")
print(f'name:{up_name}, {up_fansNum}')
# Write to csv file
with open('.\\date.csv', mode='a', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow([up_name, up_fansNum])
Complete code:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
import csv
def initDriver(url):
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
actions = ActionChains(driver)
driver.get(url)
driver.implicitly_wait(10)
return driver, actions
def getPageNum(driver):
text = driver.find_element("xpath", '//ul[@class="be-pager"]/span[@class="be-pager-total"]').get_attribute("textContent").split(' ')
return text[1]
def goNextPage(driver, actions):
bottom = driver.find_element("xpath", '//li[@class="be-pager-next"]/a')
actions.click(bottom)
actions.perform()
actions.reset_actions()
def getCard(li, actions):
cover = li.find_element("xpath", './/a[@class="cover"]')
actions.move_to_element(cover)
actions.perform()
actions.reset_actions()
def writeData(driver):
# Get card list
cardList = driver.find_elements("xpath", '//div[@id="id-card"]')
for card in cardList:
up_name = card.find_element("xpath", './/img[@class="idc-avatar"]').get_attribute("alt")
up_fansNum = card.find_elements('css selector','span.idc-meta-item')[1].get_attribute("textContent")
print(f'name:{up_name}, {up_fansNum}')
# Write info into csv file
with open('.\\date.csv', mode='a', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow([up_name, up_fansNum])
def spawn(driver, actions):
# Get card list
ulList = driver.find_elements("xpath", '//ul[@class="relation-list"]/li')
# Spawn card
for li in ulList:
getCard(li, actions)
time.sleep(2)
def spawnCards(page, driver, actions):
for i in range(1,int(page) + 1):
print(f"get data in page {i}\n")
spawn(driver, actions)
if (i != int(page)):
goNextPage(driver, actions)
time.sleep(6)
def main():
# Init driver
uid = input("bilibili uid:")
url = "https://space.bilibili.com/" + uid + "/fans/fans"
driver, actions = initDriver(url)
page = getPageNum(driver)
# Spawn card info (ajax)
spawnCards(page, driver, actions)
writeData(driver)
driver.quit()
if __name__ == "__main__":
main()
Results#
Reflection#
Areas for improvement:
- Due to AJAX asynchronous loading, it is necessary to wait for the page to fully load before locating elements. Using the
time.sleep()
method is not efficient or elegant; theWebDriverWait()
method can solve this. It can poll the page status and returntrue
when the page is fully loaded. - Multiple XPath expressions with repeated paths were used, which consumed too much memory.
- Data extraction could be done concurrently for faster results. However, considering the server load, only a single-threaded version was written.
References#
selenium doc
Recommended reading:
Ajax
Xpath