Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Links #120

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Links #120

wants to merge 3 commits into from

Conversation

edsu
Copy link
Contributor

@edsu edsu commented Dec 6, 2018

Currently if you attempt to bag a directory that contains symbolic links bagit.py happily creates the bag, but the manifests will not include the linked files or directories.

bagit.py --validate will say this bag is valid, because the manifests don't mention the linked content, and the code that walks the filesystem during validation ignores links.

However when you are looking at the filesystem you can see there is content present that is not in the manifest. This seems to be contrary to 3.A.4:

For BagIt 1.0, every payload file MUST be listed in every payload manifest.

This PR includes changes to support following symbolic links when creating and validating bags:

bagit.py --follow-links mydir
bagit.py --validate --follow-links mydir

I think an argument could be made that follow links should be the default. I'm thinking of situations where bags are quite large and may use symlinks to manage large payload files/directories.

Is this a crazy idea?

This commit adds a follow_links parameter to Bag.make_bag and
Bag.__init__. When set to true Bag._is_dangerous will allow
links to files that exist outside of the payload directory. By
default it is set to False, but I think a reasonable argument
could be made for making the default set to True.

See LibraryOfCongress#115 for more context.
There were several places in the code where filesystem walking happens.
These needed to be adjusted to take the follow_links option that
is provided to make_bag or the Bag constructor. In addition there are a
few functions outside of the Bag class which walk the fileystem which
needed to be aware of whether the user wants to follow links.

Interestingly this work surfaced a bit of a bug in bagit-python. If the
directory to be bagged contains a symlink before bagging, bagit will happily
atomically move that directory into the bag's payload directory. However
when the manifests are created they totally ignore the symbolic links.
So the data is there present on the filesystem but totally not
represented in the manifests.

With the addition of --follow-links you can now create a bag that
contains symlinks, and verify it later. I think an argument could be
made that this should be the default behavior, and perhaps you should
only be able to turn it off?
real_path = os.path.realpath(norm_path)
real_bag_path = os.path.realpath(norm_bag_path)
if os.path.commonprefix((real_path, real_bag_path)) != real_bag_path \
and not self.follow_links:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So --follow-links allows a symlink pointing to /etc/passwd?

@kba
Copy link
Contributor

kba commented Dec 7, 2018

I see two different use cases for symbolic links:

  1. Organize complex data within the payload dir
  2. Avoid redundancy across bags by replacing file/dir with symlink to file/dir outside payload dir

For 1) the most reasonable behavior seems to follow the links as this PR implements and should indeed be the default.

The bahvior for 2) should only be enabled explicitly, e.g. while developing or for testing, with a flag like --allow-external-symlinks. If not enabled explicitly, symlinks pointing beyond the payload directory should be a validation error. Because while that bag is considered valid, serialized (which must keep symlinks intact so 1) won't break), transferred to another client and deserialized, it will not be valid anymore.

There's also security issues when allowing payload data to point outside the bag dir, data/legit.data pointing to /etc/passwd and such.

@edsu
Copy link
Contributor Author

edsu commented Dec 7, 2018

Yes. It seems simplest for bagit-python to trust the integrity of the filesystem, and if it can read files in the payload directory it should do so. Perhaps this is naive. But the current behavior where symlinks could be present in the payload and totally ignored by the bagging and validation process seems wrong.

test.py Outdated
src = j(os.path.dirname(__file__), "README.rst")
dst = j(self.tmpdir, "README.rst")
os.symlink(src, dst)
self.assertRaisesRegex(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it must be assertRaisesRegexp to work in 2.7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes -- nice catch!

@coveralls
Copy link

coveralls commented Dec 8, 2018

Coverage Status

Coverage decreased (-1.4%) to 82.166% when pulling 1dc1cc6 on edsu:links into c39b650 on LibraryOfCongress:master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants