Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation instructions are wrong #58

Open
ddebernardy opened this issue Jul 11, 2019 · 7 comments
Open

Installation instructions are wrong #58

ddebernardy opened this issue Jul 11, 2019 · 7 comments

Comments

@ddebernardy
Copy link

ddebernardy commented Jul 11, 2019

I was following the install instructions from the README (macOS 10.14.5).

There was one warning about

s3fs 0.2.1 has requirement six>=1.12.0, but you'll have six 1.11.0 which is incompatible.

... which I ignored. And then this failed:

[...]
$ make build_ext
python setup.py build_ext --inplace
Compiling scurl/cgurl.pyx because it changed.
Compiling scurl/canonicalize.pyx because it changed.
[1/2] Cythonizing scurl/canonicalize.pyx
[2/2] Cythonizing scurl/cgurl.pyx
running build_ext
building 'scurl.cgurl' extension
creating build
creating build/temp.macosx-10.14-x86_64-3.7
creating build/temp.macosx-10.14-x86_64-3.7/scurl
creating build/temp.macosx-10.14-x86_64-3.7/third_party
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/base
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/base/strings
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/base/third_party
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/base/third_party/icu
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/url
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/url/third_party
creating build/temp.macosx-10.14-x86_64-3.7/third_party/chromium/url/third_party/mozilla
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -I. -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c scurl/cgurl.cpp -o build/temp.macosx-10.14-x86_64-3.7/scurl/cgurl.o -std=gnu++14 -I./third_party/chromium/ -fPIC -Ofast -pthread -w -DU_COMMON_IMPLEMENTATION
scurl/cgurl.cpp:638:10: fatal error: 
      '../third_party/chromium/url/third_party/mozilla/url_parse.h' file not
      found
#include "../third_party/chromium/url/third_party/mozilla/url_parse.h"
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
error: command 'clang' failed with exit status 1
make: *** [build_ext] Error 1

The offending folder exists but is empty.

@nyov
Copy link

nyov commented Jul 21, 2019

This is maybe, because that subfolder lies within a git submodule.
Did you git submodule init?

@ddebernardy
Copy link
Author

@nyov: not that I can recollect - certainly not if this wasn't in the installation instructions...

@ddebernardy
Copy link
Author

ddebernardy commented Jul 21, 2019

This did the trick before running pip install -r requirements.txt:

git submodule init
git submodule update --init --recursive

@ddebernardy
Copy link
Author

ddebernardy commented Jul 21, 2019

But then it segfaults on one of the urls in my dataset. Oh well...

Traceback (most recent call last):
  File "./bin/triage_links", line 34, in get_url_parts
    link = urljoin(record.url, record.href)
  File "scurl/cgurl.pyx", line 308, in scurl.cgurl.urljoin
  File "scurl/cgurl.pyx", line 353, in scurl.cgurl.urljoin
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 503: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./bin/triage_links", line 102, in <module>
    main()
  File "./bin/triage_links", line 13, in main
    CSVPipeline(callback=process).execute()
  File "./bin/../crawler/utils/csvpipeline.py", line 42, in execute
    self.save_csv()
  File "./bin/../crawler/utils/csvpipeline.py", line 96, in save_csv
    df = df.compute()
  File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 175, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dask/base.py", line 446, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dask/threaded.py", line 82, in get
Segmentation fault: 11

(The offending strings are buried in a file with millions of entries, so I'm afraid I can't locate it easily, but the utf8 encoding related error is hopefully good enough a hint as to what the issue is.)

@nyov
Copy link

nyov commented Jul 21, 2019

Glad you managed to figure it out, I forgot about the "update init".
The error sucks, but if you think it's a bug, it should be a new ticket.

You could throw in some logging, to get the position in the file (dump a part of the raw bytes string of the line, to grep for, or something).

(I haven't actually used scurl or I might help you with that error. But looks obvious: wrong encoding on some of your text ➡ mojibake.)

@ddebernardy
Copy link
Author

ddebernardy commented Jul 21, 2019

No no, that's totally a bug in the library; not the data. The library is supposed to join urls from out there in the wild (this being part of scrapy), so it cannot possibly expect valid data, let alone segfault when it encounters anything wrong.

To reproduce, run a broad crawl on this dataset and extract all links:

https://www.kaggle.com/cheedcheed/top1m

use urljoin() and urlsplit() on each one.

@lopuhin
Copy link
Member

lopuhin commented Jul 29, 2019

Thanks for reports @ddebernardy and for the help @nyov , I created a separate issue to track the segfault/encodig issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants