Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing dbGap metadata includes a newline, which splits a record #5

Open
seandavi opened this issue Aug 14, 2017 · 12 comments
Open

Writing dbGap metadata includes a newline, which splits a record #5

seandavi opened this issue Aug 14, 2017 · 12 comments

Comments

@seandavi
Copy link
Contributor

seandavi commented Aug 14, 2017

phs000007/phs000007.v13/supplemental_data/phs000007.v13_study_variable_code_value.txt.gz

readLines('phs000007/phs000007.v13/supplemental_data/phs000007.v13_study_variable_code_value.txt.gz')[14002:14005]

Note how the last record is split into two lines. This breaks reading the record as a tsv file. Records may need to be written as csv with quotes or have newlines stripped before writing. Alternatively (and this might be better), if dbGaPR pulls data directly into R without file creation, just use that rather than writing files.

[1] "7\t13\tphs000007.v13\t4117\t1\tphv00004117.v1\t1\tNORMAL\t2"
[2] "7\t13\tphs000007.v13\t4117\t1\tphv00004117.v1\t2\tPOSSIBLE DEMENTIA\t1"
[3] "7\t13\tphs000007.v13\t4117\t1\tphv00004117.v1\t3\tFACTORS SUCH AS ILLITERACY, NOT"
[4] " FLUENT IN ENGLISH, OR DEPRESSION THAT CAUSES POOR TESTING\t4"
@davemcg
Copy link
Collaborator

davemcg commented Aug 14, 2017

The ftpdownload script creates files as a default. So this is, technically, a bug in James's code. I'll see if there's a way around this.

@seandavi
Copy link
Contributor Author

@jameslhao, could you take a look?

@davemcg
Copy link
Collaborator

davemcg commented Aug 14, 2017

I don't see any options to change the behavior of ftpDownload

Script:
https://github.com/NCBI-Hackathons/ComplexPhenotypes/blob/master/src/Downloading_dbGaP_metadata.R

@jameslhao
Copy link
Collaborator

jameslhao commented Aug 15, 2017 via email

@davemcg
Copy link
Collaborator

davemcg commented Aug 15, 2017

The script is above.

The problem file is:
phs000007/phs000007.v13/supplemental_data/phs000007.v13_study_variable_code_value.txt.gz
(see Sean's post above).

@jameslhao
Copy link
Collaborator

Thanks for pointing out the problem. I will see what I can do. I am thinking about replacing all the ftp files with Jason format, better than XML for parsing and less drama in dealing with free text.

@seandavi
Copy link
Contributor Author

Thanks, @jameslhao, for looking into the issue. Are you talking about replacing the files in the NCBI ftp directory with json, or some other set of files?

@jameslhao
Copy link
Collaborator

Jason or simply filtering out potential newlines and tabs before creating the files.

@jameslhao
Copy link
Collaborator

I have a full control of the ftp sub-dir called by dbgapr. Thought about Json format lately. May be it is the time to implement it.

@jameslhao
Copy link
Collaborator

As far as the files on the ftp studies dir (official version), we can provide json version as well, but will take sometime.

@davemcg
Copy link
Collaborator

davemcg commented Aug 16, 2017

James has this been fixed? I can't build a proper sqlite database because the data is jumbled.

@jameslhao
Copy link
Collaborator

jameslhao commented Aug 16, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants