Skip to content

Releases: MuMiN-dataset/mumin-build

v1.10.0

31 Jul 14:39
ff3f564
Compare
Choose a tag to compare

Added

  • Added n_jobs and chunksize arguments to MuminDataset, to enable customisation
    of these.

Changed

  • Lowered the default value of chunksize from 50 to 10, which also lowers the memory
    requirements when processing articles and images, as fewer of these are kept in
    memory at a time.
  • Now stores all images as uint8 NumPy arrays rather than int64, reducing memory
    usage of images significantly.

v1.9.0

22 Jul 10:53
29abdc0
Compare
Choose a tag to compare

Added

  • Added checkpoint after rehydration. This means that if compilation fails for whatever
    reason after this point, the next compilation will resume after the rehydration
    process.
  • Added some more unit tests.

Fixed

  • Fixed bug on Windows where some tweet IDs were negative.
  • Fixed another bug on Windows where the timeout decorator did not work, due to the use
    of signals, which are not available on Windows machines.
  • Fixed bug on MacOS causing Python to crash during parallel extraction of articles and
    images.

Changed

  • Refactored repository to use the more modern pyproject.toml with poetry.

v1.8.0

14 Apr 12:04
Compare
Choose a tag to compare

Changed

  • Now allows instantiation of MuminDataset without having any Twitter bearer
    token, neither as an explicit argument nor as an environment variable, which
    is useful for pre-compiled datasets. If the dataset needs to be compiled then
    a RuntimeError will be raised when calling the compile method.

v1.7.0

24 Mar 10:05
Compare
Choose a tag to compare

Added

  • Now allows setting twitter_bearer_token=None in the constructor of
    MuminDataset, which uses the environment variable TWITTER_API_KEY
    instead, which can be stored in a separate .env file. This is now the
    default value of twitter_bearer_token.

Changed

  • Replaced DataFrame.append calls with pd.concat, as the former is
    deprecated and will be removed from pandas in the future.

v1.6.2

21 Mar 18:57
Compare
Choose a tag to compare

Fixed

  • Now removes claims that are only connected to deleted tweets when calling
    to_dgl. This previously caused a bug that was due to a mismatch between
    nodes in the dataset (which includes deleted ones) and nodes in the DGL graph
    (which does not contain the deleted ones).

v1.6.1

17 Mar 13:13
Compare
Choose a tag to compare

Fixed

  • Now correctly catches JSONDecodeError during rehydration.

v1.6.0

10 Mar 11:52
Compare
Choose a tag to compare

v1.5.0

19 Feb 20:04
Compare
Choose a tag to compare

Changed

  • Now using dicts rather than Series in to_dgl. This improved the wall time
    from 1.5 hours to 2 seconds!

Fixed

  • There was a bug in the call to dgl.data.utils.load_graphs causing
    load_dgl_graph to fail. This is fixed now.

v1.4.1

19 Feb 17:03
Compare
Choose a tag to compare

Changed

  • Now only saves dataset at the end of add_embeddings if any embeddings were
    added.

v1.4.0

19 Feb 16:16
Compare
Choose a tag to compare

Added

  • The to_dgl method is now being parallelised, speeding export up
    significantly.
  • Added convenience functions save_dgl_graph and load_dgl_graph, which
    stores the Boolean train/val/test masks as unsigned 8-bit integers and
    handles the conversion. Using the dgl-native save_graphs and
    load_graphs causes an error, as it cannot handle Boolean tensors. These two
    convenience functions can be loaded simply as
    from mumin import save_dgl_graph, load_dgl_graph.