https://github.com/arianna-bienati/uid_np
Code to extract NPs from the RSC and calculate surprisal-based complexity measures
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.4%) to scientific vocabulary
Repository
Code to extract NPs from the RSC and calculate surprisal-based complexity measures
Basic Info
- Host: GitHub
- Owner: arianna-bienati
- License: cc0-1.0
- Language: Python
- Default Branch: main
- Size: 42 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
uid_np
Code to extract NPs from the RSC and calculate surprisal-based complexity measures
To run the code:
- create a virtual environment, install numpy (it should be the only dependency)
- in terminal write:
bash
python get_NP_data.py <your_input_folder> <your_output_folder/csv_file>
input and output folders should already exist before running the pipeline.
TODO:
- [x] write core function
identify_NPs_in_sentence(Isa) -> see getNPdata.py for a full implementation - [x] clean up paths and use pathlib for better path handling and folder creation (Ari)
- [x] use argparse instead of sys for better cli (Ari)
- [ ] set up requirements / package the thing (Ari)
- [x] check fluctuation cpx in light of Paolo's corrections (Ari)
Decisions
20250730 meeting: * Model(s) specification: UIDdev ~ yearcentered + NPlength + headsyntrole + (1|author) + (1|headlemma) * UIDdev and IFC are calculated only for NPs with number of constituents >= 3. Since UIDdev and IFC are based on the difference between information contents of preceding and following tokens, it is necessary at least to have two transitions in order to compute UID_dev or IFC meaningfully. * To keep the model parsimonious, we concentrate only on Series A journals. * Threshold for lemma frequency: 5. * Coordination: TBD (cases like "Isabell and Arianna have submitted an abstract and a presentation" are for now excluded and only the first NP gets extracted (Isabell; an abstract); coordination remains in relative clauses such as "Isabell and Arianna who have submitted an abstract and a presentation will go to Siena" --> (Isabell who have submitted an abstract and a presentation)).
Logical next steps: - [ ] write some tests to be sure about the functionality of the NPs extractor and its adherence to definition of NPs. - [ ] prepare slides - [ ] run statistical analysis
Owner
- Name: Arianna Bienati
- Login: arianna-bienati
- Kind: user
- Company: Institute for Applied Linguistics, Eurac Research
- Repositories: 1
- Profile: https://github.com/arianna-bienati
GitHub Events
Total
- Push event: 5
Last Year
- Push event: 5