This is a crawler which crawls anime data from various streaming websites and generate output in certain CSV format. The output can then be copied and pasted to the data files used by the Enhanced Bangumi API project. This crawler currently supports the following streaming websites:
Running this crawler requires Python 3. Simply run pip install -r requirements.txt
to install all the dependencies, or conda install --yes --file requirements.txt
if you are using a Conda environment.
This crawler requires the following arguments. Taking Non Non Biyori Season 1 as an example, you need:
- Subject URL (required; flags
-u
or--url
)- The URL of the anime series, for example,
https://www.crunchyroll.com/non-non-biyori
.
- The URL of the anime series, for example,
- Bangumi subject ID (required; flags
-s
or--subject
)- The subject ID on Bangumi. For example, if the Bangumi URL is https://bgm.tv/subject/78405, then
78405
is the subject ID.
- The subject ID on Bangumi. For example, if the Bangumi URL is https://bgm.tv/subject/78405, then
- Bangumi episode IDs (required; flags
-e
or--episodes
)- A list of episode IDs on Bangumi. The sequence of IDs must match the sequence of episodes listing on the streaming website, and the IDs must be provided in the format of page ranges (see the definition here). For example,
319289-319299,320742
means that episode 1 on Crunchyroll has episode ID 319289 on Bangumi, episode 11 has episode ID 319299, and episode 12 has episode ID 320742. You may also skip some episodes which are listed on the streaming website but not on Bangumi by having two consecutive commas like705488-705491,,705492-705494
.
- A list of episode IDs on Bangumi. The sequence of IDs must match the sequence of episodes listing on the streaming website, and the IDs must be provided in the format of page ranges (see the definition here). For example,
- Crunchyroll collection ID (optional; flag
--cr-collection
)- Collection ID on Crunchyroll. This argument is only required when a Crunchyroll series has multiple collections (multiple seasons or dubbed version). For example, season 1 of Non Non Biyori has collection ID 21335, so you must provide
--cr-collection 21335
in order to crawl data for that particular season. If you do not specify this argument but multiple collections exists in the series, the crawler will print all available collections and stop. You can apply this trick to find out what collection ID to use.
- Collection ID on Crunchyroll. This argument is only required when a Crunchyroll series has multiple collections (multiple seasons or dubbed version). For example, season 1 of Non Non Biyori has collection ID 21335, so you must provide
- Funimation season ID (optional; flag
--funi-season
)- Season ID on Funimation, similar to the Crunchyroll collection ID.
- Netflix season ID (optional; flag
--nflx_season
)- Season ID on Netflix, similar to the Crunchyroll collection ID.
With all the arguments prepared, the crawler can be run in two modes outside of the root directory:
- Interactive mode: Execute without arguments, that is,
python crawler
. It will let you input the arguments one by one. - Quiet mode: Execute with arguments, for example,
python crawler -u "https://www.crunchyroll.com/non-non-biyori" -s 78405 -e "319289-319299,320742" --cr-collection 21335
.
The first line of the output is the source record, and the following are episode records.
Due to geo restrictions, crawling animes from Crunchyroll only works with a US IP address.
If you need to crawl anime series with explicit content on Crunchyroll, you must have an adult Crunchyroll account to bypass the maturity wall. Set the following environment variables before running the crawler:
export CR_ACCOUNT=<account>
export CR_PASSWORD=<password>
Due to the limitation that there is no good solution to bypass Incapsula bot detection, crawling animes from Funimation needs some extra manual work. On the anime series page, you need to use your browser's inspector and search for TITLE_DATA
to get the series ID. Then you need to construct the URL which contains that ID, for example, https://www.funimation.com/shows/594522/. URLs like https://www.funimation.com/shows/sword-art-online/ are not accepted at the moment.
Due to geo restrictions, crawling animes from iQIYI only works with a Chinese IP address.
You can build a simple proxy service on Alibaba Cloud Function Compute, Tencent Cloud Serverless Cloud Function, etc. and set the following environment variable before running the crawler if you are not in mainland China:
export CN_PROXY=<proxy-url>
# Example of proxy-url: https://xxx.cn-shanghai.fc.aliyuncs.com/2016-08-15/proxy/proxy/proxy/?url=%s
This proxy service should take an encoded iQIYI API URL as the input, make requests to the API and return the response to the crawler. You can customize your proxy service's query string or URL format, but make sure to have exactly one %s
in the proxy service's URL so that the crawler can correctly work.
All references can be found in the source code. Special thanks to the following open-source projects:
- Bangumi Data Helper: APIs for most Chinese streaming websites
- CrunchyrollDownloaderPy, Crunchyroll API Wiki and CR-Unblocker Server: Crunchyroll APIs
- Tencent Video Spider: Tencent Video APIs
- ViuTV API: Viu APIs
- WeVideo: LeTV APIs
- youtube-dl: Some APIs and extracting data from websites
© 101对双生儿 2020.