Recent Releases of mumin
mumin - v1.10.0
Added
- Added
n_jobsandchunksizearguments toMuminDataset, to enable customisation of these.
Changed
- Lowered the default value of
chunksizefrom 50 to 10, which also lowers the memory requirements when processing articles and images, as fewer of these are kept in memory at a time. - Now stores all images as
uint8NumPy arrays rather thanint64, reducing memory usage of images significantly.
- Python
Published by saattrupdan over 3 years ago
mumin - v1.9.0
Added
- Added checkpoint after rehydration. This means that if compilation fails for whatever reason after this point, the next compilation will resume after the rehydration process.
- Added some more unit tests.
Fixed
- Fixed bug on Windows where some tweet IDs were negative.
- Fixed another bug on Windows where the timeout decorator did not work, due to the use of signals, which are not available on Windows machines.
- Fixed bug on MacOS causing Python to crash during parallel extraction of articles and images.
Changed
- Refactored repository to use the more modern
pyproject.tomlwithpoetry.
- Python
Published by saattrupdan over 3 years ago
mumin - v1.8.0
Changed
- Now allows instantiation of
MuminDatasetwithout having any Twitter bearer token, neither as an explicit argument nor as an environment variable, which is useful for pre-compiled datasets. If the dataset needs to be compiled then aRuntimeErrorwill be raised when calling thecompilemethod.
- Python
Published by saattrupdan almost 4 years ago
mumin - v1.7.0
Added
- Now allows setting
twitter_bearer_token=Nonein the constructor ofMuminDataset, which uses the environment variableTWITTER_API_KEYinstead, which can be stored in a separate.envfile. This is now the default value oftwitter_bearer_token.
Changed
- Replaced
DataFrame.appendcalls withpd.concat, as the former is deprecated and will be removed frompandasin the future.
- Python
Published by saattrupdan almost 4 years ago
mumin - v1.6.2
Fixed
- Now removes claims that are only connected to deleted tweets when calling
to_dgl. This previously caused a bug that was due to a mismatch between nodes in the dataset (which includes deleted ones) and nodes in the DGL graph (which does not contain the deleted ones).
- Python
Published by saattrupdan almost 4 years ago
mumin - v1.6.0
- Changed the download link from Git-LFS to the official data.bris data repository, with URI https://doi.org/10.5523/bris.23yv276we2mll25fjakkfim2ml.
- Python
Published by saattrupdan almost 4 years ago
mumin - v1.4.0
Added
- The
to_dglmethod is now being parallelised, speeding export up significantly. - Added convenience functions
save_dgl_graphandload_dgl_graph, which stores the Boolean train/val/test masks as unsigned 8-bit integers and handles the conversion. Using thedgl-nativesave_graphsandload_graphscauses an error, as it cannot handle Boolean tensors. These two convenience functions can be loaded simply asfrom mumin import save_dgl_graph, load_dgl_graph.
- Python
Published by saattrupdan about 4 years ago
mumin - v1.2.0
Changed
- If tweets have been deleted (and thus cannot be rehydrated) then we keep them along with their related entities, just without being able to populate their features. When exporting to DGL then neither these tweets nor their replies are included.
Added
- Now includes a check that tweets are actually rehydrated, and raises an error if they are not. Such an error is usually due to the inputted Twitter Bearer Token being invalid.
- Python
Published by saattrupdan about 4 years ago
mumin - v1.1.0
Fixed
- Updated the dataset with deduplicated entries. The deduplication is done such
that the duplicate with the largest
relevanceparameter is kept. - Include checks of whether nodes and relations exist, before extracting data from them.
Added
- Added
include_timelinesoption, which allows one to not include all the extra tweets in the timelines if not needed. As this greatly increases the amount of tweets needed to rehydrate, it defaults to False.
- Python
Published by saattrupdan about 4 years ago
mumin - v1.0.2
Fixed
- Removed the relations from the dump which we are getting through compilation anyway.
- Updated the filtering mechanism, so that the
relevanceparameter is built in to all nodes and relations upon download. - Deal with the situation where no relations exist of a certain type, above a specified threshold.
- Python
Published by saattrupdan about 4 years ago
mumin - v1.0.0
Changed
- Added new version of the dataset, which now includes a sample of ~100 timeline tweets for every user. This approximately doubles the dataset size, to ~200MB before compilation. This new dataset includes different train/val/test splits as well, which is now 80/10/10 rather than 60/10/30. This means that the training dataset will see a much more varied amount of events (6-7) compared to the previous 2.
- Python
Published by saattrupdan about 4 years ago
mumin - v0.7.0
Changed
- Changed
include_imagestoinclude_tweet_images, which now only includes the images from the tweets themselves. Further,include_user_imagesis changed toinclude_extra_images, which now includes both profile pictures and the top images from articles. The tweet pictures are included by default, and the extras are not. This is to reduce the size of the default dataset, to make it easier to use.
- Python
Published by saattrupdan about 4 years ago
mumin - v0.6.0
Changed
- Split up the
include_imagesintoinclude_imagesandinclude_user_images, with the former including images from tweets and articles, and the latter being profile pictures. The former has been set to True by default, and the latter False. This is due to the large amount of profile pictures making the dataset excessively large.
Fixed
- Now catches connection errors when attempting to rehydrate tweets.
- Python
Published by saattrupdan about 4 years ago
mumin - v0.5.3
Fixed
- Masks have been changed to boolean tensors, as otherwise indexing did not work properly.
- In the case where a claim/tweet does not have any label, this produces NaN values in the mask- and label tensors. These are now substituted for zeroes. This means that they will always be masked out, and so the label will not matter anyway.
- Python
Published by saattrupdan about 4 years ago
mumin - v0.5.1
Fixed
- When encountering HTTP status 401 (unauthorized) during rehydration, we skip that batch of tweets.
- Image relations were extracted incorrectly, due to a wrong treatment of the
images coming directly from the tweets via the
media_keyidentifier, and the images coming from URLs present in the tweets themselves. Both are now correctly included in a uniform fashion. - Datatypes are now only set for a given node if the node is included in the
dataset. For instance, datatypes for the article features are only set if
include_articles == True.
- Python
Published by saattrupdan about 4 years ago
mumin - v0.5.2
Fixed
- Now converting masks to long tensors, which is required for them to be used as indexing tensors in PyTorch.
Changed
- Now only dumping dataset once while adding embeddings, where previously it dumped the dataset after adding embeddings to each node type. This is done to add embeddings faster, as the dumping of the dataset can take quite a long time.
- Now blanket catching all errors when processing images and articles, as there were too many edge cases.
- Python
Published by saattrupdan about 4 years ago
mumin - v0.5.0
Added
- The
Claimnodes now havelanguage,keywords,cluster_keywordsandclusterattributes. - Now sets datatypes for all the dataframes, to reduce memory usage.
Fixed
- Updated
READMEto a single zip file, rather than stating that the dataset is saved as a bunch of CSV files. - Fixed image embedding shape from (1, 768) to (768,).
- Article embeddings are now computed correctly.
- Catch
IndexErrorandLocationParseErrorwhen processing images.
Changed
- Now dumping files incrementally rather than keeping all of them in memory, to avoid out-of-memory issues when saving the dataset.
- Dataset
sizeargument now defaults to 'small', rather than 'large'. - Updated the dataset. This is still not the final version: timelines of users are currently missing.
- Now storing the dataset in a zip file of Pickle files instead of HDF. This is because of HDF requiring extra installation, and there being maximal storage requirements in the dataframes when storing as HDF. The resulting zip file of Pickle files is stored with protocol 4, making it compatible with Python 3.4 and newer. Further, the dataset being downloaded has been heavily compressed, taking up a quarter of the disk space compared to the previous CSV approach. When the dataset has been downloaded it will be converted to a less compressed version, taking up more space but making loading and saving much faster.
- Python
Published by saattrupdan over 4 years ago
mumin - v0.4.0
Fixed
- All embeddings are now extracted from the pooler output, corresponding to the
[CLS]tag. - Ensured that train/val/test masks are boolean tensors when exporting to DGL, as opposed to binary integers.
- Content embeddings for articles were not aggregated per chunk, but now a mean is taken across all content chunks.
- Assign zero embeddings to user descriptions if they are not available.
Changed
- The DGL graph returned by the
to_dglmethod now returns a bidirectional graph. - The
verboseargument ofMuminDatasetnow defaults toTrue. - Now storing the dataset as a single HDF file instead of a zipped folder of CSV files, primarily because data types are being preserved in this way, and that HDF is a binary format supported by Pandas which can handle multidimensional ndarrays as entries in a dataframe.
- The default models used to embed texts and images are now
xlm-roberta-baseandgoogle/vit-base-patch16-224-in21k.
Removed
- Removed the
pollandplacenodes, as they were too few to matter. - Removed the
(:User)-[:HAS_PINNED]->(:Tweet)relation, as there were too few of them to matter.
- Python
Published by saattrupdan over 4 years ago
mumin - v0.3.0
Fixed
- Now catches
SSLErrorandOSErrorwhen processing images. - Now catches
ReadTimeoutErrorwhen processing articles. - The
(:Tweet)-[:MENTIONS]->(:User)was missing in the dataset. It has now been added back in. - Added tokenizer truncation when adding node embeddings.
- Fixed an issue with embedding user descriptions when the description is not available.
Changed
- Changed the download link to the dataset, which now fetches the dataset from a specific commit, enabling proper dataset versioning.
- Changed the timeout parameter when downloading images from five seconds to ten seconds.
- Now processing 50 articles and images on each worker, compared to the previous 5.
- When loading in an existing dataset, auxilliaries and islands are removed.
This ensures that
to_dglworks properly.
Removed
- Removed the review warning from the
READMEand when initialising the dataset. The dataset is still not complete, in the sense that we will add retweets and timelines, but we will instead just keep versioning the dataset until we have included these extra features.
- Python
Published by saattrupdan over 4 years ago
mumin - v0.2.0
Added
- Added claim embeddings to Claim nodes, being the transformer embeddings of the claims translated to English, as described in the paper.
- Added train/val/test split to claim nodes. When exporting to DGL using the
to_dglmethod, the Claim and Tweet nodes will havetrain_mask,val_maskandtest_maskattributes that can be used to control loss and metric calculation. These are consistent, meaning that tweets connected to claims will always belong to the same split. - Added labels to both Tweet and Claim nodes.
Fixed
- Properly embeds reviewers of claims in case a claim has been reviewed by multiple reviewers.
- Load claim embeddings properly.
- Catches
TooManyRequestsexception when extracting images. - Load dataset CSVs with Python engine, as the C engine caused errors.
- Disable tokenizer parallelism, which caused warning messages during rehydration of tweets.
- Ensure proper quoting of strings when dumping dataset to CSVs.
- Enable truncation of strings before tokenizing, when embedding texts.
- Convert masks to integers, which caused an issue when exporting to a DGL graph.
- Bug when computing reviewer embeddings for claims.
- Now properly shows
compiled=Truewhen printing the dataset, after compilation.
Changed
- Changed disclaimer about review period.
- Python
Published by saattrupdan over 4 years ago
mumin - v0.1.2
Fixed
- The replies were not reduced correctly when the
smallormediumvariants of the dataset was compiled. - The reply features were not filtered and renamed properly, to keep them consistent with the tweet nodes.
- Users without any description now gets assigned a zero vector as their description embedding.
- If a relation does not have any node pairs then do not try to create a corresponding DGL relation.
- Reset
nodesandrelsattributes when loading dataset. - Add embeddings for
Replynodes.
- Python
Published by saattrupdan over 4 years ago