https://github.com/arianna-bienati/uid_np

Code to extract NPs from the RSC and calculate surprisal-based complexity measures

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.4%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Code to extract NPs from the RSC and calculate surprisal-based complexity measures

Basic Info

Host: GitHub
Owner: arianna-bienati
License: cc0-1.0
Language: Python
Default Branch: main
Size: 42 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created 12 months ago · Last pushed 10 months ago

Metadata Files

Readme License

uid_np

Code to extract NPs from the RSC and calculate surprisal-based complexity measures

To run the code:

create a virtual environment, install numpy (it should be the only dependency)
in terminal write:

bash python get_NP_data.py <your_input_folder> <your_output_folder/csv_file> input and output folders should already exist before running the pipeline.

TODO:

[x] write core function identify_NPs_in_sentence (Isa) -> see getNPdata.py for a full implementation
[x] clean up paths and use pathlib for better path handling and folder creation (Ari)
[x] use argparse instead of sys for better cli (Ari)
[ ] set up requirements / package the thing (Ari)
[x] check fluctuation cpx in light of Paolo's corrections (Ari)

Decisions

20250730 meeting: * Model(s) specification: UIDdev ~ yearcentered + NPlength + headsyntrole + (1|author) + (1|headlemma) * UIDdev and IFC are calculated only for NPs with number of constituents >= 3. Since UIDdev and IFC are based on the difference between information contents of preceding and following tokens, it is necessary at least to have two transitions in order to compute UID_dev or IFC meaningfully. * To keep the model parsimonious, we concentrate only on Series A journals. * Threshold for lemma frequency: 5. * Coordination: TBD (cases like "Isabell and Arianna have submitted an abstract and a presentation" are for now excluded and only the first NP gets extracted (Isabell; an abstract); coordination remains in relative clauses such as "Isabell and Arianna who have submitted an abstract and a presentation will go to Siena" --> (Isabell who have submitted an abstract and a presentation)).

Logical next steps: - [ ] write some tests to be sure about the functionality of the NPs extractor and its adherence to definition of NPs. - [ ] prepare slides - [ ] run statistical analysis

Owner

Name: Arianna Bienati
Login: arianna-bienati
Kind: user
Company: Institute for Applied Linguistics, Eurac Research

Repositories: 1
Profile: https://github.com/arianna-bienati

GitHub Events

Total

Push event: 5

Last Year

Push event: 5

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/arianna-bienati/uid_np

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

uid_np

Decisions

Owner

GitHub Events

Total

Last Year