Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[BUG] Official sites' links not working from details dictionary #9

Open
chauhannaman98 opened this issue Jan 3, 2021 · 0 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@chauhannaman98
Copy link
Owner

The issue has been raised on StackOverflow also. View here.

Trying to scrape the data of official sites of a title page on IMDb using Beautiful Soup. For example, if I need to get data of Intersteller, I have this code:

from bs4 import BeautifulSoup
import requests

url = 'https://www.imdb.com/title/tt0816692/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
title_detail_soup = soup.find('div', {'id': 'titleDetails'})
details_soup = title_detail_soup.find_all('div', class_='txt-block')
detail_list = ['Official Sites:', 'Country:', 'Language:',
                'Release Date:', 'Also Known As:', 'Filming Locations:']
details = {}
for detail in details_soup:
    try:
        # Each heading (h4) has detail heading
        head = detail.find('h4')
        if head.get_text() in detail_list:
            # If the detail heading is in the detail list
            if head.get_text() == 'Official Sites:':
                # If details is about official sites
                official_site = {}
                detail.h4.decompose()    # remove <h4> tags
                a_tags = detail.find_all('a')
                for a_tag in a_tags:
                    # exclude See more>> links
                    if a_tag.get_text() != 'See more':
                        data = url+a_tag['href']    # final link is base URL + hyperlink
                        official_site[a_tag.get_text()] = data
                details['official-sites'] = official_site
    except Exception as e:
        print(e)
print(details)    # Print the detail dictionary

HTML of the page:

<div class="article" id="titleDetails">
    <span class="rightcornerlink">
        <a href="https://contribute.imdb.com/updates?edit=tt0816692/details&amp;ref_=tt_dt_dt">Edit</a>
    </span>
    <h2>Details</h2>
    <div class="txt-block">
        <h4 class="inline">Official Sites:</h4>
            <a href="/offsite/?page-action=offsite-facebook&amp;token=BCYpckvEa_ZSPp2TC3Ztr1DNqde5ZCUHig7950CLYvsgSHOzBCfJSHpgg71IYRsZYP1DuUpTZb9H%0D%0AhK4BzY5AiKU5Vy2oFn7i91MVFT_TnR39yhU5V5NBAse2mY_ht5WdsmSBxQPGRBC6pIJJym7IXbao%0D%0ATz9SG3r8MjKfwIe9hBrJU5Y-vNdnR_uaDq_24s2NGj5ikJYWl_093YIHy_I2lnK-I6jK9OvOpwgw%0D%0AupABQOymuxA%0D%0A&amp;ref_=tt_pdt_ofs_offsite_0" rel="nofollow">Official Facebook</a>
        <span class="ghost">|</span>
            <a href="/offsite/?page-action=offsite-interstellarmovie&amp;token=BCYuB9Ouy5QXl_3W_k3RrnnXUdrfSLbBFfOcrJTX0yo5TtTDqsSLpry8x7drK8l0xpOJSEqt73Hz%0D%0A08qyki3_i83CrCym7SXSkevFQpT32TjuuJLgIlQ-W5CpRd-wZC9eD4R3SZOMdOfSjeoOtqiE5uU_%0D%0Az-YG1i5AImXY2xLmHSNwABh1hU7VHS-FnqKDW9G-4KOF78zpKdDIfrwlRs8px0yef9u51LojZz05%0D%0A0OBfTmRs_JI%0D%0A&amp;ref_=tt_pdt_ofs_offsite_1" rel="nofollow">Official site</a>
        <span class="ghost">|</span>
        <span class="see-more inline">
            <a href="externalsites?ref_=tt_dt_dt#official">See more</a>&nbsp;»
        </span>
    </div>
</div>

Output dictionary:

{
    'official-sites': {
        'Official Facebook': 'https://www.imdb.com/title/tt0816692/offsite/?page-action=offsite-facebook&token=BCYqzjQrP9OA_yaYNwA9Q8hI5gt41EmHuu0_ePjZPHKui-hEmAEySo-0SHzZmSjpeeEVy3Art6SH%0D%0ATseW16b3uKMjIH8iOyO-ZVYR025mQ4YCbZIWUKEcEM-z0eOeUvud3KGbuQTCxrNhTGAx7xgFIB89%0D%0Al9jT6pvqSpSCdNYACnBhk_8MuNjCn8GIJZk-6PR1MZ1xQB5yDrqRNhNt9Dg8IDMXVpxTR8-LFu2I%0D%0Amf5KmXbmXos%0D%0A', 
        'Official site': 'https://www.imdb.com/title/tt0816692/offsite/?page-action=offsite-interstellarmovie&token=BCYsMb9WTKJLH9M9nmxvLDpn8ikQDnQmpVQZBurp9Trd1-XXbA_Bh4xoKx6yf3Qx4YNn3fT9UhFe%0D%0AnzcULcEY5SFJ7CW8kBj6dQvZA9GyvqfZMyIDS7daNe6rne6DkdL23CDPAkk1Xwr9rjiE6FF_m0vX%0D%0ASLH2NnzOf8BcKnaWILhGGdvHTYeZ_uRGm4QCIOzxw-CvLM2rag04ZbXM2ZUEvQm6OedW9XumtsnQ%0D%0AoP7ce67sytE%0D%0A'
    }
}
@chauhannaman98 chauhannaman98 added bug Something isn't working help wanted Extra attention is needed labels Jan 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant