Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails on groups with "+" character on name #30

Open
zipob opened this issue Jul 11, 2019 · 3 comments
Open

Fails on groups with "+" character on name #30

zipob opened this issue Jul 11, 2019 · 3 comments

Comments

@zipob
Copy link

zipob commented Jul 11, 2019

Fails when trying to download messages from groups with "+" on name like sfnet.harrastus.audio+video.

anon@anon ~/google-groups % ./crawler.sh -sh          
#!/usr/bin/env bash

export _ORG="${_ORG:-}"
export _GROUP="${_GROUP:-sfnet.harrastus.audio+video}"
export _D_OUTPUT="${_D_OUTPUT:-./sfnet.harrastus.audio+video/}"
export _USER_AGENT="${_USER_AGENT:-Mozilla/5.0 (X11; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0}"
export _WGET_OPTIONS="${_WGET_OPTIONS:--4}"

__wget_hook () 
{ 
    :
}
__wget__ () 
{ 
    if [[ ! -f "$1" ]]; then
        wget --user-agent="$_USER_AGENT" $_WGET_OPTIONS "$2" -O "$1";
        __wget_hook "$1" "$2";
    fi
}
mkdir: created directory './sfnet.harrastus.audio+video'
mkdir: created directory './sfnet.harrastus.audio+video//threads/'
mkdir: created directory './sfnet.harrastus.audio+video//msgs/'
mkdir: created directory './sfnet.harrastus.audio+video//mbox/'
:: Downloading all topics (thread) pages...
:: Creating './sfnet.harrastus.audio+video//threads/t.0' with 'categories/sfnet.harrastus.audio+video'
:: Fetching data from 'https://groups.google.com/forum/?_escaped_fragment_=categories/sfnet.harrastus.audio+video'...
--2019-07-11 09:42:47--  https://groups.google.com/forum/?_escaped_fragment_=categories/sfnet.harrastus.audio+video
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving groups.google.com (groups.google.com)... 64.233.165.102, 64.233.165.113, 64.233.165.101, ...
Connecting to groups.google.com (groups.google.com)|64.233.165.102|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2019-07-11 09:42:48 ERROR 400: Bad Request.

:: Downloading list of all messages...
:: Downloading all raw messages...

@icy
Copy link
Owner

icy commented Apr 9, 2020

hi @zipob , thanks a lot for your reporting. This is probably because the query wasn't encoded correctly before being sent to google server. I will try with some fix soon.

And I'm sorry for my belated response.

@icy
Copy link
Owner

icy commented Apr 13, 2020

Google has spec. here https://developers.google.com/search/docs/ajax-crawling/docs/specification, but it doesn't sound working, e.g,

https://groups.google.com/forum/?_escaped_fragment_=forum/sfnet.harrastus.audio%2Bvideo

will generate a invalid group name error

@icy
Copy link
Owner

icy commented Apr 13, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants