Skip to content

maobowen/enhanced-bangumi-api-data-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub Top Language Build Status Coverage Status GitHub License

Data Crawler for Enhanced Bangumi API

This is a crawler which crawls anime data from various streaming websites and generate output in certain CSV format. The output can then be copied and pasted to the data files used by the Enhanced Bangumi API project. This crawler currently supports the following streaming websites:

Installation

Running this crawler requires Python 3. Simply run pip install -r requirements.txt to install all the dependencies, or conda install --yes --file requirements.txt if you are using a Conda environment.

Usage

This crawler requires the following arguments. Taking Non Non Biyori Season 1 as an example, you need:

  • Subject URL (required; flags -u or --url)
    • The URL of the anime series, for example, https://www.crunchyroll.com/non-non-biyori.
  • Bangumi subject ID (required; flags -s or --subject)
  • Bangumi episode IDs (required; flags -e or --episodes)
  • Crunchyroll collection ID (optional; flag --cr-collection)
    • Collection ID on Crunchyroll. This argument is only required when a Crunchyroll series has multiple collections (multiple seasons or dubbed version). For example, season 1 of Non Non Biyori has collection ID 21335, so you must provide --cr-collection 21335 in order to crawl data for that particular season. If you do not specify this argument but multiple collections exists in the series, the crawler will print all available collections and stop. You can apply this trick to find out what collection ID to use.
  • Funimation season ID (optional; flag --funi-season)
    • Season ID on Funimation, similar to the Crunchyroll collection ID.
  • Netflix season ID (optional; flag --nflx_season)
    • Season ID on Netflix, similar to the Crunchyroll collection ID.

With all the arguments prepared, the crawler can be run in two modes outside of the root directory:

  • Interactive mode: Execute without arguments, that is, python crawler. It will let you input the arguments one by one.
  • Quiet mode: Execute with arguments, for example, python crawler -u "https://www.crunchyroll.com/non-non-biyori" -s 78405 -e "319289-319299,320742" --cr-collection 21335.

The first line of the output is the source record, and the following are episode records.

Crunchyroll

Due to geo restrictions, crawling animes from Crunchyroll only works with a US IP address.

If you need to crawl anime series with explicit content on Crunchyroll, you must have an adult Crunchyroll account to bypass the maturity wall. Set the following environment variables before running the crawler:

export CR_ACCOUNT=<account>
export CR_PASSWORD=<password>

Funimation

Due to the limitation that there is no good solution to bypass Incapsula bot detection, crawling animes from Funimation needs some extra manual work. On the anime series page, you need to use your browser's inspector and search for TITLE_DATA to get the series ID. Then you need to construct the URL which contains that ID, for example, https://www.funimation.com/shows/594522/. URLs like https://www.funimation.com/shows/sword-art-online/ are not accepted at the moment.

iQIYI (爱奇艺)

Due to geo restrictions, crawling animes from iQIYI only works with a Chinese IP address.

You can build a simple proxy service on Alibaba Cloud Function Compute, Tencent Cloud Serverless Cloud Function, etc. and set the following environment variable before running the crawler if you are not in mainland China:

export CN_PROXY=<proxy-url>
# Example of proxy-url: https://xxx.cn-shanghai.fc.aliyuncs.com/2016-08-15/proxy/proxy/proxy/?url=%s

This proxy service should take an encoded iQIYI API URL as the input, make requests to the API and return the response to the crawler. You can customize your proxy service's query string or URL format, but make sure to have exactly one %s in the proxy service's URL so that the crawler can correctly work.

References

All references can be found in the source code. Special thanks to the following open-source projects:


© 101对双生儿 2020.

About

Data crawler for the Enhanced Bangumi API project.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages