Recently, I've been watching "The Daily Life of a Female High School Student", but the video quality on Bilibili is heavily compressed, so I found a parsing website. As a result, I discovered that the website's JavaScript code was obfuscated, so I decided to reverse engineer it.
Reverse Engineering Target#
aHR0cHM6Ly9qeC5wbGF5ZXJqeS5jb20v
Analysis#
It can be seen that the video has been loaded, but there is no request related to the video.
The response data for these requests cannot be loaded, but it can be inferred that they should be video segments. After opening the file, it is displayed as follows:
So, the approximate request process is:
The M3U8 file is a text file that contains the request URL for video clips. With it, we can obtain the URL of the entire video file.
Tracing back, I found the request that accepts the M3U8 file:
The composition of the request URL is:
https://domain/timestamp/unknown/unknown.m3u8 + fixed parameters
If you're not careful here, it's easy to set an XHR breakpoint to trace the JavaScript code. I fell into this trap myself at the beginning of my analysis, going back and forth in the complex code and various encryption methods.
However, we don't need to construct this request. It is accepted by the previous request and even transmitted in plain text!
Let's take a look at the payload of this request:
Two parameters, time and key. Time is a timestamp, and key seems to be an MD5 ciphertext. After reverse engineering the JavaScript code, it turns out to be CBC encryption. However, what makes me speechless is that just when I finished analyzing the JavaScript code and started simulating the request using Python, I found that the key and time were transmitted in plain text in the previous request again!!!
Writing Code#
#getPram.py
import time
import requests as rq
from lxml import etree
from lxml import html
headers = {
"referer": "https://jx.playerjy.com/",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
"Cache-Control": "no-cache",
"Host": "jy.we-vip.com:5433"
}
def getHtml(ep):
url = "https://jy.we-vip.com:5433/?url=" + ep
return rq.get(url=url, headers=headers, verify=False).content
# get time and key
def getData(bili_url):
htmlString = html.fromstring(getHtml(bili_url))
result = htmlString.xpath('//body/script/text()')
#gettime
timebegin = result[0].index('time') + 8
timeend = int(str(result[0]).index('",', timebegin))
time = result[0][timebegin:timeend]
#getkey
keybegin = result[0].index('key') + 7
keyend = int(str(result[0]).index('",', keybegin))
key = result[0][keybegin:keyend]
return time, key
if __name__ == "__main__":
time, key = getData()
print(f"time is {time}, key is {key}")
import requests as rq
import getPram
time, key = getPram.getData(input("bilibili_url:"))
headers = {
"authority" : "jy.we-vip.com:5433",
"accept" : "application/json, text/javascript, */*; q=0.01",
"content-type" : "application/x-www-form-urlencoded; charset=UTF-8",
"user-agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
"Host" : "jy.we-vip.com:5433",
"Content-Length": "115",
"Cache-Control": "no-cache"
}
url = r"https://jy.we-vip.com:5433/API.php"
data = {
"url": "https://www.bilibili.com/bangumi/play/ep276690",
"time": time,
"key": key
}
rsp = rq.post(url, headers=headers, data=data, verify=False)
print(rsp.content)
Reflection#
- The analysis approach was like relying on a defense, going round and round without properly analyzing the request process, wasting a lot of time. Reverse engineering is like a high school math problem, starting with only a vague direction and requiring patience to trace back and track.
- The code was written based on the defense, and the logic is not very clear. There are many ways to extract substrings, but I chose the simplest but least readable method...