Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault or encoding error when parsing a URL #59

Open
lopuhin opened this issue Jul 29, 2019 · 3 comments
Open

Segfault or encoding error when parsing a URL #59

lopuhin opened this issue Jul 29, 2019 · 3 comments

Comments

@lopuhin
Copy link
Member

lopuhin commented Jul 29, 2019

See #58 (comment) and #58 (comment)

Also repeating here

Traceback (most recent call last):
  File "./bin/triage_links", line 34, in get_url_parts
    link = urljoin(record.url, record.href)
  File "scurl/cgurl.pyx", line 308, in scurl.cgurl.urljoin
  File "scurl/cgurl.pyx", line 353, in scurl.cgurl.urljoin
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 503: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./bin/triage_links", line 102, in <module>
    main()
  File "./bin/triage_links", line 13, in main
    CSVPipeline(callback=process).execute()
  File "./bin/../crawler/utils/csvpipeline.py", line 42, in execute
    self.save_csv()
  File "./bin/../crawler/utils/csvpipeline.py", line 96, in save_csv
    df = df.compute()
  File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 175, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 446, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dask/threaded.py", line 82, in get
Segmentation fault: 11

To reproduce, run a broad crawl on this dataset and extract all links:

https://www.kaggle.com/cheedcheed/top1m

use urljoin() and urlsplit() on each one.

@lopuhin lopuhin changed the title Segfault when parsing a URL Segfault or encoding error when parsing a URL Jul 29, 2019
@ddebernardy
Copy link

For clarity, you only need to extract all links from the front page.

@ddebernardy
Copy link

scurl.csv.zip

^ Seems to be enough to reproduce on my system.

$ python
Python 3.7.3 (default, Mar 27 2019, 09:23:15) 
[Clang 10.0.1 (clang-1001.0.46.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> from scurl.cgurl import urlparse
>>> df = pd.read_csv('data/scurl.csv')
>>> test = df.drop_duplicates()
>>> test.url.apply(lambda r: urlparse(r))
[... works fine...]
Name: url, Length: 2150750, dtype: object
>>> df.url.apply(lambda r: urlparse(r))
Segmentation fault: 11

@ddebernardy
Copy link

ddebernardy commented Jul 29, 2019

That it works fine when I drop dups is somewhat intriguing. Maybe the code is running out of memory or something. (Maybe there's a memory leak in there somewhere?)

If you've more memory than I do and it doesn't choke on your system as a result, you can probably use df.append() a few times to make it large enough to segfault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants