Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issues for websites in non-English languages such as Chinese, Japanese, etc. #64

Open
gaowanliang opened this issue Jan 30, 2021 · 7 comments

Comments

@gaowanliang
Copy link

The encoding of the downloaded website is a Unicode Numeric character reference, and this encoding does not display the real content in the browser

4ifOH.png

@mima3
Copy link

mima3 commented Apr 11, 2022

I got around this by the following method.

  • create new class that inherits from WebPage
  • create new save_html
#略
            root.getroottree().write(file_name, method="html", encoding=self.encoding)
#略

@rajatomar788
Copy link
Owner

rajatomar788 commented Apr 11, 2022

@mima3 this is one of the ways to do it. The other being changing the .encoding attribute of the WebPage object.

@muzicstation
Copy link

Is there a valid method for ver 7.0 or later versions?

@BradKML
Copy link

BradKML commented Apr 2, 2023

Is it possible to just check the encoding of the webpage based on what they claim? There are two major ways of getting the encoding to decode.

@rajatomar788
Copy link
Owner

@BrandonKMLee your first example works on top of the second one so they are not two separate things. And also majority of the times the encoding reported by website are wrong so it is always a trial and to find the best encoding on the user side.

@BradKML
Copy link

BradKML commented Apr 3, 2023

@rajatomar788 in that case wound need to run through Python Chatdet or cChardet to "smell" the text, even if it is not a guarantee it is a good default to have.

@PeterBon
Copy link

PeterBon commented Sep 6, 2023

Is there a valid method for ver 7.0 or later versions?

我通过修改schedulers.py解决了:

class Scheduler(SchedulerBase):
    def _handle_resource(self, resource):
        try:
            self.logger.debug('Scheduler trying to get resource at: [%s]' % resource.url)
            resource.get(resource.context.url)
            # NOTE :meth:`get` can change the :attr:`filepath` of the resource
            resource.encoding = 'utf-8'  # 这里添加一行
            self.index.add_resource(resource)
        except ConnectionError:
            self.logger.error(
                "Scheduler ConnectionError Failed to retrieve resource from [%s]"
                % resource.url)
            # self.index.add_entry(resource.url, resource.filepath)
        except Exception as e:
            self.logger.exception(e)
            # self.index.add_entry(resource.url, resource.filepath)
        else:
            self.logger.debug('Scheduler running handler for: [%s]' % resource.url)
            resource.retrieve()
        self.index.add_resource(resource)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants