Releases: MuMiN-dataset/mumin-build
Releases · MuMiN-dataset/mumin-build
v1.0.1
Fixed
- Added in the
POSTED
relation, as leaving this out effectively meant that
all the new tweets were filtered out during compilation.
v1.0.0
Changed
- Added new version of the dataset, which now includes a sample of ~100
timeline tweets for every user. This approximately doubles the dataset size,
to ~200MB before compilation. This new dataset includes different
train/val/test splits as well, which is now 80/10/10 rather than 60/10/30.
This means that the training dataset will see a much more varied amount of
events (6-7) compared to the previous 2.
v0.7.0
Changed
- Changed
include_images
toinclude_tweet_images
, which now only includes
the images from the tweets themselves. Further,include_user_images
is
changed toinclude_extra_images
, which now includes both profile pictures
and the top images from articles. The tweet pictures are included by default,
and the extras are not. This is to reduce the size of the default dataset, to
make it easier to use.
v0.6.0
Changed
- Split up the
include_images
intoinclude_images
and
include_user_images
, with the former including images from tweets and
articles, and the latter being profile pictures. The former has been set to
True by default, and the latter False. This is due to the large amount of
profile pictures making the dataset excessively large.
Fixed
- Now catches connection errors when attempting to rehydrate tweets.
v0.5.3
Fixed
- Masks have been changed to boolean tensors, as otherwise indexing did not
work properly. - In the case where a claim/tweet does not have any label, this produces NaN
values in the mask- and label tensors. These are now substituted for zeroes.
This means that they will always be masked out, and so the label will not
matter anyway.
v0.5.2
Fixed
- Now converting masks to long tensors, which is required for them to be used
as indexing tensors in PyTorch.
Changed
- Now only dumping dataset once while adding embeddings, where previously it
dumped the dataset after adding embeddings to each node type. This is done to
add embeddings faster, as the dumping of the dataset can take quite a long
time. - Now blanket catching all errors when processing images and articles, as there
were too many edge cases.
v0.5.1
Fixed
- When encountering HTTP status 401 (unauthorized) during rehydration, we skip
that batch of tweets. - Image relations were extracted incorrectly, due to a wrong treatment of the
images coming directly from the tweets via themedia_key
identifier, and
the images coming from URLs present in the tweets themselves. Both are now
correctly included in a uniform fashion. - Datatypes are now only set for a given node if the node is included in the
dataset. For instance, datatypes for the article features are only set if
include_articles == True
.
v0.5.0
Added
- The
Claim
nodes now havelanguage
,keywords
,cluster_keywords
and
cluster
attributes. - Now sets datatypes for all the dataframes, to reduce memory usage.
Fixed
- Updated
README
to a single zip file, rather than stating that the dataset
is saved as a bunch of CSV files. - Fixed image embedding shape from (1, 768) to (768,).
- Article embeddings are now computed correctly.
- Catch
IndexError
andLocationParseError
when processing images.
Changed
- Now dumping files incrementally rather than keeping all of them in memory, to
avoid out-of-memory issues when saving the dataset. - Dataset
size
argument now defaults to 'small', rather than 'large'. - Updated the dataset. This is still not the final version: timelines of users
are currently missing. - Now storing the dataset in a zip file of Pickle files instead of HDF. This is
because of HDF requiring extra installation, and there being maximal storage
requirements in the dataframes when storing as HDF. The resulting zip file of
Pickle files is stored with protocol 4, making it compatible with Python 3.4
and newer. Further, the dataset being downloaded has been heavily compressed,
taking up a quarter of the disk space compared to the previous CSV approach.
When the dataset has been downloaded it will be converted to a less
compressed version, taking up more space but making loading and saving much
faster.
v0.4.0
Fixed
- All embeddings are now extracted from the pooler output, corresponding to the
[CLS]
tag. - Ensured that train/val/test masks are boolean tensors when exporting to DGL,
as opposed to binary integers. - Content embeddings for articles were not aggregated per chunk, but now a mean
is taken across all content chunks. - Assign zero embeddings to user descriptions if they are not available.
Changed
- The DGL graph returned by the
to_dgl
method now returns a bidirectional
graph. - The
verbose
argument ofMuminDataset
now defaults toTrue
. - Now storing the dataset as a single HDF file instead of a zipped folder of
CSV files, primarily because data types are being preserved in this way, and
that HDF is a binary format supported by Pandas which can handle
multidimensional ndarrays as entries in a dataframe. - The default models used to embed texts and images are now
xlm-roberta-base
andgoogle/vit-base-patch16-224-in21k
.
Removed
- Removed the
poll
andplace
nodes, as they were too few to matter. - Removed the
(:User)-[:HAS_PINNED]->(:Tweet)
relation, as there were too few
of them to matter.
v0.3.1
Fixed
- Fixed the shape of the user description embeddings.