https://github.com/amazon-science/sumren

https://github.com/amazon-science/sumren

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: amazon-science
  • License: other
  • Language: Python
  • Default Branch: main
  • Size: 435 KB
Statistics
  • Stars: 7
  • Watchers: 1
  • Forks: 1
  • Open Issues: 3
  • Releases: 0
Created over 3 years ago · Last pushed over 1 year ago
Metadata Files
Readme Contributing License Code of conduct

README.md

This repository contains the code used for creating the dataset in the paper:

SumREN: Summarizing Reported Speech about Events in News Revanth Gangi Reddy, Heba Elfardy, Hou Pong Chan, Kevin Small, and Heng Ji. AAAI 2023.

Installation

Please follow the steps below for installation:

conda create --name sumren python=3.8.15 conda activate sumren pip install -r requirements.txt

Getting the data

1. Training Data

The gold training data was scraped from Wayback machines. You can create the training data using the following script in the parent directory of data folder.

python expand_train.py

This generates expanded_train.json which contains the gold training data with news article text and gol summaries.

2. Evaluation Data

The evaluation data, which comprises the dev and test sets, contains articles from 2017 - 2021 obtained from CC-News. Getting the news corpus for the dev and test sets involves first downloading the CC-News dump for these years and then extracting news articles for the URLs in the eval data.

Downloading CC-News

We note that CC-News corpus requires considerable storage space (up to 25 TB) and we suggest that you run the below scripts on a cloud provider.

We also recommend downloading each year's data into a separate directory/volume since it might not be possible to create a single storage volume with size up to 25 TB.

Installing and configuring AWS CLI

Before starting, you will need to install AWS CLI to be able to download the CC-News from S3. To do so, please follow the instructions here: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

Using the S3 bucket to download CC-News requires your AWS CLI to be authenticated. Please follow: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config

Downloading CC-News from s3

Run the script download_cc.sh with the corresponding output directory for each year bash download_cc.sh 2017 <output_dir_for_2017> bash download_cc.sh 2018 <output_dir_for_2018> bash download_cc.sh 2019 <output_dir_for_2019> bash download_cc.sh 2020 <output_dir_for_2020> bash download_cc.sh 2021 <output_dir_for_2021>

Mapping URLs in SumREN evaluation to CC-News

To extract news articles from CC-News corresponding to the URLs in SumREN, run the below script for each year

(Note: Please make sure to run the below scripts in the parent directory of the data folder.)

bash map_cc.sh <dir_for_cc_download_2017> <out_dir> bash map_cc.sh <dir_for_cc_download_2018> <out_dir> bash map_cc.sh <dir_for_cc_download_2019> <out_dir> bash map_cc.sh <dir_for_cc_download_2020> <out_dir> bash map_cc.sh <dir_for_cc_download_2022> <out_dir>

python expand_eval.py

This script generates expanded_dev.json and expanded_test.json which comprise the dev and test sets respectively with the news article text.

If expand_eval.py outputs that some files are missing from a particular year, this indicates that CC-News wasn't fully downloaded (i.e. some files are missing) for this year. To resolve this, please re-run download_cc.sh and map_cc.sh for the year with the missing files.

License

The code is licensed under the license here and the data is licensed under the license here.

Owner

  • Name: Amazon Science
  • Login: amazon-science
  • Kind: organization

GitHub Events

Total
  • Delete event: 1
  • Issue comment event: 1
  • Pull request event: 3
  • Create event: 2
Last Year
  • Delete event: 1
  • Issue comment event: 1
  • Pull request event: 3
  • Create event: 2

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 29
  • Average time to close issues: N/A
  • Average time to close pull requests: 30 days
  • Total issue authors: 0
  • Total pull request authors: 3
  • Average comments per issue: 0
  • Average comments per pull request: 0.1
  • Merged pull requests: 24
  • Bot issues: 0
  • Bot pull requests: 27
Past Year
  • Issues: 0
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 2 months
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.5
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 2
Top Authors
Issue Authors
Pull Request Authors
  • dependabot[bot] (42)
  • helfardy (1)
  • AndreSlavescu (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (41)

Dependencies

requirements.txt pypi
  • GitPython ==3.1.41
  • Jinja2 ==3.1.3
  • MarkupSafe ==2.1.2
  • Pillow ==10.0.1
  • PyYAML ==6.0
  • Pygments ==2.15.0
  • allennlp ==2.10.1
  • attrs ==22.2.0
  • base58 ==2.1.1
  • beautifulsoup4 ==4.11.1
  • blis ==0.7.9
  • boto3 ==1.26.71
  • botocore ==1.29.71
  • cached-path ==1.1.6
  • cachetools ==5.3.0
  • catalogue ==2.0.8
  • charset-normalizer ==2.1.1
  • click ==8.1.3
  • commonmark ==0.9.1
  • cssselect ==1.2.0
  • cymem ==2.0.7
  • dill ==0.3.6
  • docker-pycreds ==0.4.0
  • exceptiongroup ==1.1.0
  • fairscale ==0.4.6
  • feedfinder2 ==0.0.4
  • feedparser ==6.0.10
  • filelock ==3.7.1
  • gitdb ==4.0.10
  • google-api-core ==2.11.0
  • google-auth ==2.16.0
  • google-cloud-core ==2.3.2
  • google-cloud-storage ==2.7.0
  • google-crc32c ==1.5.0
  • google-resumable-media ==2.4.1
  • googleapis-common-protos ==1.58.0
  • h5py ==3.8.0
  • huggingface-hub ==0.10.1
  • idna ==3.4
  • iniconfig ==2.0.0
  • install ==1.3.5
  • jieba3k ==0.35.1
  • jmespath ==1.0.1
  • joblib ==1.2.0
  • jsonnet ==0.19.1
  • langcodes ==3.3.0
  • lmdb ==1.4.0
  • lxml ==4.9.2
  • more-itertools ==9.0.0
  • murmurhash ==1.0.9
  • newspaper3k ==0.2.8
  • nltk ==3.8
  • numpy ==1.24.2
  • packaging ==23.0
  • pathtools ==0.1.2
  • pathy ==0.10.1
  • pluggy ==1.0.0
  • preshed ==3.0.8
  • promise ==2.3
  • protobuf ==3.20.3
  • psutil ==5.9.4
  • pyasn1 ==0.4.8
  • pyasn1-modules ==0.2.8
  • pydantic ==1.8.2
  • pytest ==7.2.1
  • python-dateutil ==2.8.2
  • regex ==2022.10.31
  • requests ==2.31.0
  • requests-file ==1.5.1
  • rich ==12.6.0
  • rsa ==4.9
  • s3transfer ==0.6.0
  • sacremoses ==0.0.53
  • scikit-learn ==1.2.1
  • scipy ==1.10.0
  • sentencepiece ==0.1.97
  • sentry-sdk ==1.15.0
  • setproctitle ==1.3.2
  • sgmllib3k ==1.0.0
  • shortuuid ==1.0.11
  • six ==1.16.0
  • smart-open ==6.3.0
  • smmap ==5.0.0
  • soupsieve ==2.3.2.post1
  • spacy ==3.3.2
  • spacy-legacy ==3.0.12
  • spacy-loggers ==1.0.4
  • srsly ==2.4.5
  • tensorboardX ==2.6
  • termcolor ==1.1.0
  • thinc ==8.0.17
  • threadpoolctl ==3.1.0
  • tinysegmenter ==0.3
  • tldextract ==3.4.0
  • tokenizers ==0.12.1
  • tomli ==2.0.1
  • torch ==1.13.1
  • torchvision ==0.13.1
  • tqdm ==4.64.1
  • traitlets ==5.9.0
  • transformers ==4.36.0
  • typer ==0.4.2
  • typing_extensions ==4.5.0
  • urllib3 ==1.26.18
  • wandb ==0.12.21
  • warcio ==1.7.4
  • wasabi ==0.10.1