Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on how to get a list of domains associated with fingerprinting #125

Open
birdsarah opened this issue Oct 20, 2018 · 4 comments

Comments

@birdsarah
Copy link
Contributor

I've been working through the various data options trying to compile a list of domains that classify as fingerprinting. I'm getting mixed results and wondering if you can clarify, what you think of as the canonical approach.

Apologies if I'm just misreading the documentation. I'm happy to submit a PR to docs if you think it would be useful.

I can use the data source, and get a list of tracker ids as follows

fp_trackers = set()
regions = {'de', 'eu', 'fr', 'global', 'us'}
for region in regions:
    who_tracks_data = DataSource(region=region)
    who_tracks_fp = who_tracks_data.trackers.df[who_tracks_data.trackers.df.bad_qs > 0.1]
    fp_trackers.update(list(who_tracks_fp.tracker.values))

This gives me 193 trackers. I can then map this to domains using the map from create_tracker_map.

could_not_find = []
domains = set()
for tracker in fp_trackers:
    try:
        domains.update(tracker_info['trackers'][tracker]['domains'])
    except KeyError:
        could_not_find.append(tracker)

This will give me 326 domains.

If I take a different route, and read in all the csv files under assets folders labeled domains.csv, I can get a list of domains like this

domains_df = pd.concat([
    pd.read_csv(file, parse_dates=['month'])
    for file in asset_paths['domains'] # I have previously assembled all the paths
])
fingerprinting_trackers = domains_df[domains_df.bad_qs > 0.1].host_tld.unique()

But this gives me a list of 292 domains.

I can think of an explanation for this - not all host_tld's might have a bad_qs that meets the threshold but they've been added to the tracker map for other reasons.

However, given that the other csv files may also be relevant, I was starting to lose confidence and so wanted to check in.

Many thanks in advance for your help.

@birdsarah birdsarah changed the title Clarification on how to get a list domains associated with fingerprinting Clarification on how to get a list of domains associated with fingerprinting Oct 20, 2018
@sammacbeth
Copy link
Contributor

The domains.csv and trackers.csv files represent different aggregations of the same data. If we consider the fingerprinting case:

  • domains.csv counts the proportion of times when each hostname (at TLD+1 level) was seen sending a fingerprint (or suspected fingerprint) in a third-party context on a page.
  • trackers.csv counts the proportion for any of the hostnames associated with a tracker - from the mapping in the tracker database.

For the majority of trackers the relationship between domains and trackers is one-to-one. For others the domains files will show to which domains fingerprinting data is sent, while the trackers view shows a more aggregated picture of what the tracker is doing.

For example, Facebook uses facebook.net as a CDN, and we can see from the stats little evidence of tracking on this domain. The tracking requests are aimed at facebook.com where they have the user's login cookie. In the tracker view we report the aggregate view of both domains, which shows the aggregate view of Facebook's third-party traffic.

I hope that clears things up a little for you. From your use-case it looks like the domains.csv data view would fit better.

@ecnmst
Copy link
Contributor

ecnmst commented Oct 22, 2018

Hi @birdsarah, many thanks for the PR and issues raised. domains.csv is currently not exposed via the API. If you'd find this useful, you can add this to loader.py, extending our API. Here's one way to do it:

class Domains(PandasDataLoader):
    def __init__(self, data_months, region="global"):
        super().__init__(data_months, name="domains", region=region)

then add this to class DataSource, still on loader.py

       ...
        self.domains = Domains(
            data_months=self.data_months,
            region=region
        )

Now you can consume domains via the DataSource:

data = DataSource(region="global")
domains = data.domains.df 

where domains would be a pandas dataframe of all months for which domains.csv is available.

@birdsarah
Copy link
Contributor Author

Thanks so much for this feedback @sammacbeth @ecnmst. This is extremely helpful.
I'll leave this open and plan to make the addition to loader.py that @ecnmst proposes.

@birdsarah
Copy link
Contributor Author

But if, on reflection, you don't want the update to loader.py feel free to close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants