Skip to content

Releases: MuMiN-dataset/mumin-build

v1.0.1

05 Dec 18:54
Compare
Choose a tag to compare

Fixed

  • Added in the POSTED relation, as leaving this out effectively meant that
    all the new tweets were filtered out during compilation.

v1.0.0

03 Dec 11:30
Compare
Choose a tag to compare

Changed

  • Added new version of the dataset, which now includes a sample of ~100
    timeline tweets for every user. This approximately doubles the dataset size,
    to ~200MB before compilation. This new dataset includes different
    train/val/test splits as well, which is now 80/10/10 rather than 60/10/30.
    This means that the training dataset will see a much more varied amount of
    events (6-7) compared to the previous 2.

v0.7.0

02 Dec 14:57
Compare
Choose a tag to compare

Changed

  • Changed include_images to include_tweet_images, which now only includes
    the images from the tweets themselves. Further, include_user_images is
    changed to include_extra_images, which now includes both profile pictures
    and the top images from articles. The tweet pictures are included by default,
    and the extras are not. This is to reduce the size of the default dataset, to
    make it easier to use.

v0.6.0

01 Dec 11:16
Compare
Choose a tag to compare

Changed

  • Split up the include_images into include_images and
    include_user_images, with the former including images from tweets and
    articles, and the latter being profile pictures. The former has been set to
    True by default, and the latter False. This is due to the large amount of
    profile pictures making the dataset excessively large.

Fixed

  • Now catches connection errors when attempting to rehydrate tweets.

v0.5.3

26 Nov 18:12
Compare
Choose a tag to compare

Fixed

  • Masks have been changed to boolean tensors, as otherwise indexing did not
    work properly.
  • In the case where a claim/tweet does not have any label, this produces NaN
    values in the mask- and label tensors. These are now substituted for zeroes.
    This means that they will always be masked out, and so the label will not
    matter anyway.

v0.5.2

24 Nov 14:38
Compare
Choose a tag to compare

Fixed

  • Now converting masks to long tensors, which is required for them to be used
    as indexing tensors in PyTorch.

Changed

  • Now only dumping dataset once while adding embeddings, where previously it
    dumped the dataset after adding embeddings to each node type. This is done to
    add embeddings faster, as the dumping of the dataset can take quite a long
    time.
  • Now blanket catching all errors when processing images and articles, as there
    were too many edge cases.

v0.5.1

24 Nov 14:38
Compare
Choose a tag to compare

Fixed

  • When encountering HTTP status 401 (unauthorized) during rehydration, we skip
    that batch of tweets.
  • Image relations were extracted incorrectly, due to a wrong treatment of the
    images coming directly from the tweets via the media_key identifier, and
    the images coming from URLs present in the tweets themselves. Both are now
    correctly included in a uniform fashion.
  • Datatypes are now only set for a given node if the node is included in the
    dataset. For instance, datatypes for the article features are only set if
    include_articles == True.

v0.5.0

08 Nov 17:45
Compare
Choose a tag to compare

Added

  • The Claim nodes now have language, keywords, cluster_keywords and
    cluster attributes.
  • Now sets datatypes for all the dataframes, to reduce memory usage.

Fixed

  • Updated README to a single zip file, rather than stating that the dataset
    is saved as a bunch of CSV files.
  • Fixed image embedding shape from (1, 768) to (768,).
  • Article embeddings are now computed correctly.
  • Catch IndexError and LocationParseError when processing images.

Changed

  • Now dumping files incrementally rather than keeping all of them in memory, to
    avoid out-of-memory issues when saving the dataset.
  • Dataset size argument now defaults to 'small', rather than 'large'.
  • Updated the dataset. This is still not the final version: timelines of users
    are currently missing.
  • Now storing the dataset in a zip file of Pickle files instead of HDF. This is
    because of HDF requiring extra installation, and there being maximal storage
    requirements in the dataframes when storing as HDF. The resulting zip file of
    Pickle files is stored with protocol 4, making it compatible with Python 3.4
    and newer. Further, the dataset being downloaded has been heavily compressed,
    taking up a quarter of the disk space compared to the previous CSV approach.
    When the dataset has been downloaded it will be converted to a less
    compressed version, taking up more space but making loading and saving much
    faster.

v0.4.0

26 Oct 12:56
Compare
Choose a tag to compare

Fixed

  • All embeddings are now extracted from the pooler output, corresponding to the
    [CLS] tag.
  • Ensured that train/val/test masks are boolean tensors when exporting to DGL,
    as opposed to binary integers.
  • Content embeddings for articles were not aggregated per chunk, but now a mean
    is taken across all content chunks.
  • Assign zero embeddings to user descriptions if they are not available.

Changed

  • The DGL graph returned by the to_dgl method now returns a bidirectional
    graph.
  • The verbose argument of MuminDataset now defaults to True.
  • Now storing the dataset as a single HDF file instead of a zipped folder of
    CSV files, primarily because data types are being preserved in this way, and
    that HDF is a binary format supported by Pandas which can handle
    multidimensional ndarrays as entries in a dataframe.
  • The default models used to embed texts and images are now xlm-roberta-base
    and google/vit-base-patch16-224-in21k.

Removed

  • Removed the poll and place nodes, as they were too few to matter.
  • Removed the (:User)-[:HAS_PINNED]->(:Tweet) relation, as there were too few
    of them to matter.

v0.3.1

19 Oct 10:52
Compare
Choose a tag to compare

Fixed

  • Fixed the shape of the user description embeddings.