Writing dbGap metadata includes a newline, which splits a record #5

seandavi · 2017-08-14T22:05:18Z

phs000007/phs000007.v13/supplemental_data/phs000007.v13_study_variable_code_value.txt.gz

readLines('phs000007/phs000007.v13/supplemental_data/phs000007.v13_study_variable_code_value.txt.gz')[14002:14005]

Note how the last record is split into two lines. This breaks reading the record as a tsv file. Records may need to be written as csv with quotes or have newlines stripped before writing. Alternatively (and this might be better), if dbGaPR pulls data directly into R without file creation, just use that rather than writing files.

[1] "7\t13\tphs000007.v13\t4117\t1\tphv00004117.v1\t1\tNORMAL\t2"
[2] "7\t13\tphs000007.v13\t4117\t1\tphv00004117.v1\t2\tPOSSIBLE DEMENTIA\t1"
[3] "7\t13\tphs000007.v13\t4117\t1\tphv00004117.v1\t3\tFACTORS SUCH AS ILLITERACY, NOT"
[4] " FLUENT IN ENGLISH, OR DEPRESSION THAT CAUSES POOR TESTING\t4"

The text was updated successfully, but these errors were encountered:

davemcg · 2017-08-14T23:16:06Z

The ftpdownload script creates files as a default. So this is, technically, a bug in James's code. I'll see if there's a way around this.

seandavi · 2017-08-14T23:19:59Z

@jameslhao, could you take a look?

davemcg · 2017-08-14T23:22:11Z

I don't see any options to change the behavior of ftpDownload

Script:
https://github.com/NCBI-Hackathons/ComplexPhenotypes/blob/master/src/Downloading_dbGaP_metadata.R

jameslhao · 2017-08-15T01:30:46Z

Hi David, Sure, I can fix if it is a bug. Could you be more specific about how the function is called, current behavior, and the desired behavior? James

…

On Mon, Aug 14, 2017 at 7:16 PM, David McGaughey ***@***.***> wrote: The ftpdownload script creates files as a default. So this is, technically, a bug in James's code. I'll see if there's a way around this. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AQKZJNUUu6qJWX2k_frDRVe9mee-dak4ks5sYNU3gaJpZM4O29Qx> .

davemcg · 2017-08-15T12:36:21Z

The script is above.

The problem file is:
phs000007/phs000007.v13/supplemental_data/phs000007.v13_study_variable_code_value.txt.gz
(see Sean's post above).

jameslhao · 2017-08-15T13:48:23Z

Thanks for pointing out the problem. I will see what I can do. I am thinking about replacing all the ftp files with Jason format, better than XML for parsing and less drama in dealing with free text.

seandavi · 2017-08-15T13:52:04Z

Thanks, @jameslhao, for looking into the issue. Are you talking about replacing the files in the NCBI ftp directory with json, or some other set of files?

jameslhao · 2017-08-15T13:52:12Z

Jason or simply filtering out potential newlines and tabs before creating the files.

jameslhao · 2017-08-15T13:57:06Z

I have a full control of the ftp sub-dir called by dbgapr. Thought about Json format lately. May be it is the time to implement it.

jameslhao · 2017-08-15T14:02:04Z

As far as the files on the ftp studies dir (official version), we can provide json version as well, but will take sometime.

davemcg · 2017-08-16T14:24:16Z

James has this been fixed? I can't build a proper sqlite database because the data is jumbled.

jameslhao · 2017-08-16T15:33:46Z

I take a simple approach, filtered out all potential newlines and tabs and regenerated the files. The problem should be fixed by now. Please let me know if not. Thanks again for bringing this up. James

…

On Wed, Aug 16, 2017 at 10:24 AM, David McGaughey ***@***.***> wrote: James has this been fixed? I can't build a proper sqlite database because the data is jumbled. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AQKZJJV1lDTF8lCO7TS5eWggR7TfUAU6ks5sYvuRgaJpZM4O29Qx> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing dbGap metadata includes a newline, which splits a record #5

Writing dbGap metadata includes a newline, which splits a record #5

seandavi commented Aug 14, 2017 •

edited

Loading

davemcg commented Aug 14, 2017

seandavi commented Aug 14, 2017

davemcg commented Aug 14, 2017

jameslhao commented Aug 15, 2017 via email

davemcg commented Aug 15, 2017

jameslhao commented Aug 15, 2017

seandavi commented Aug 15, 2017

jameslhao commented Aug 15, 2017

jameslhao commented Aug 15, 2017

jameslhao commented Aug 15, 2017

davemcg commented Aug 16, 2017

jameslhao commented Aug 16, 2017 via email

Writing dbGap metadata includes a newline, which splits a record #5

Writing dbGap metadata includes a newline, which splits a record #5

Comments

seandavi commented Aug 14, 2017 • edited Loading

davemcg commented Aug 14, 2017

seandavi commented Aug 14, 2017

davemcg commented Aug 14, 2017

jameslhao commented Aug 15, 2017 via email

davemcg commented Aug 15, 2017

jameslhao commented Aug 15, 2017

seandavi commented Aug 15, 2017

jameslhao commented Aug 15, 2017

jameslhao commented Aug 15, 2017

jameslhao commented Aug 15, 2017

davemcg commented Aug 16, 2017

jameslhao commented Aug 16, 2017 via email

seandavi commented Aug 14, 2017 •

edited

Loading