xml-parser
A tool for parsing XML-encoded texts to obtain textual metadata
https://github.com/australian-text-analytics-platform/xml-parser
Science Score: 65.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
✓Institutional organization owner
Organization australian-text-analytics-platform has institutional domain (atap.edu.au) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary
Repository
A tool for parsing XML-encoded texts to obtain textual metadata
Basic Info
- Host: GitHub
- Owner: Australian-Text-Analytics-Platform
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://binderhub.atap-binder.cloud.edu.au/v2/gh/Australian-Text-Analytics-Platform/xml-parser/main?labpath=parser.ipynb
- Size: 795 KB
Statistics
- Stars: 0
- Watchers: 4
- Forks: 0
- Open Issues: 0
- Releases: 2
Metadata Files
README.md
XML Parser
The XML Parser is a tool for parsing XML-encoded texts to obtain the content and speaker/author of utterances within the text.
Inputs
The XML corpus can be provided either as a collection of text files or as an Excel/CSV table. The loader currently supports loading a corpus from the following file types: txt, odt, docx, csv, tsv, xlsx, ods, xml
Following Hardie (2014), utterances must be contained by 'u' tags and must include a 'who' attribute. Additionally, the 'who' attribute must be the first attribute in the 'u' tag. Only utterances that follow this format will be included. An example of valid utterances is as follows:
xml
<u who="PETER">Hello, world.</u>
<u who="WORLD">Hello, Peter.</u>
<u who="PETER">Wow!</u>
Output
Once a corpus has been parsed, it can be exported in one of three formats: csv, xlsx, or zip. The zip format provides a zip file containing each utterance as a txt file, with a metadata.csv containing the corpus metadata. The csv and xlsx formats are structured as a table where each row represents an utterance. The table for the above input example would look as follows:
| document_ | speaker | |---------------|---------| | Hello, world. | PETER | | Hello, Peter. | WORLD | | Wow! | PETER |
Instructions
- Upload your document files to the 'corpus_data' directory.
- Run the cell below and use the Corpus Loader to build a corpus from your selected documents.
- Once the corpus is built, navigate to the 'XML Parser' tab. Here, select your corpus in the dropdown and click 'Parse XML'.
- When parsing is complete, navigate to the 'Corpus Overview' tab to export the parsed corpus.
See the user guide for detailed instructions and hover over the tooltips in the loader for simplified instructions on how to load and build the corpus.
Notes
- The XML Parser keeps all corpus metadata but adds a metadata called 'speaker'. If there is already a metadata column called 'speaker' it will be overwritten.
- When parsing utterances, the XML Parser will skip any utterance that does not have a speaker.
- When parsing utterances, the XML Parser will remove any XML tags within the contents of the utterance.
Demo
Click the button below to access a demo deployed on Binderhub.
Authors
- Hamish Croser - h-croser
License
This project is licensed under the MIT License - see the LICENSE file for details.
Owner
- Name: Australian-Text-Analytics-Platform
- Login: Australian-Text-Analytics-Platform
- Kind: organization
- Website: https://atap.edu.au
- Repositories: 9
- Profile: https://github.com/Australian-Text-Analytics-Platform
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: XML Parser
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Hamish
family-names: Croser
email: hamish.croser@sydney.edu.au
affiliation: Sydney Informatics Hub
repository-code: >-
https://github.com/Australian-Text-Analytics-Platform/xml-parser
abstract: >-
The XML Parser is a tool for parsing XML-encoded texts to
obtain the content and speaker/author of utterances within
the text.
keywords:
- xml
- corpus linguistics
license: MIT
version: 1.0.0
date-released: '2024-12-16'
GitHub Events
Total
- Release event: 2
- Push event: 6
- Create event: 5
Last Year
- Release event: 2
- Push event: 6
- Create event: 5
Dependencies
- 164 dependencies
- ipywidgets ~=8.1.0 develop
- jupyterlab ~=4.0.0 develop
- atap-corpus-loader ~=1.7.6
- panel ~=1.4
- python >=3.10,<3.12
- regex ~=2024.9.11
- annotated-types ==0.7.0
- asttokens ==2.4.1
- atap-corpus ==0.1.15
- atap-corpus-loader ==1.7.6
- bleach ==6.1.0
- blis ==1.0.1
- bokeh ==3.4.3
- bottleneck ==1.4.2
- catalogue ==2.0.10
- certifi ==2024.8.30
- chardet ==5.2.0
- charset-normalizer ==3.4.0
- click ==8.1.7
- cloudpathlib ==0.20.0
- colorama ==0.4.6
- colorlog ==6.8.2
- comm ==0.2.2
- confection ==0.1.5
- contourpy ==1.3.0
- coolname ==2.2.0
- cymem ==2.0.8
- decorator ==5.1.1
- defusedxml ==0.7.1
- et-xmlfile ==2.0.0
- exceptiongroup ==1.2.2
- executing ==2.1.0
- idna ==3.10
- ipython ==8.29.0
- ipywidgets ==8.1.5
- jedi ==0.19.1
- jinja2 ==3.1.4
- joblib ==1.4.2
- jupyterlab-widgets ==3.0.13
- langcodes ==3.4.1
- language-data ==1.2.0
- linkify-it-py ==2.0.3
- llvmlite ==0.43.0
- lxml ==5.3.0
- marisa-trie ==1.2.1
- markdown ==3.7
- markdown-it-py ==3.0.0
- markupsafe ==3.0.2
- matplotlib-inline ==0.1.7
- mdit-py-plugins ==0.4.2
- mdurl ==0.1.2
- murmurhash ==1.0.10
- numba ==0.60.0
- numexpr ==2.10.1
- numpy ==2.0.2
- odfpy ==1.4.1
- openpyxl ==3.1.5
- packaging ==24.1
- pandas ==2.2.3
- panel ==1.4.5
- param ==2.1.1
- parso ==0.8.4
- pexpect ==4.9.0
- pillow ==11.0.0
- preshed ==3.0.9
- prompt-toolkit ==3.0.48
- ptyprocess ==0.7.0
- pure-eval ==0.2.3
- pyarrow ==17.0.0
- pydantic ==2.9.2
- pydantic-core ==2.23.4
- pygments ==2.18.0
- python-calamine ==0.2.3
- python-dateutil ==2.9.0.post0
- python-docx ==1.1.2
- pytz ==2024.2
- pyviz-comms ==3.0.3
- pyxlsb ==1.0.10
- pyyaml ==6.0.2
- regex ==2024.9.11
- requests ==2.32.3
- rich ==13.9.3
- scikit-learn ==1.5.2
- scipy ==1.14.1
- setuptools ==75.2.0
- shellingham ==1.5.4
- six ==1.16.0
- smart-open ==7.0.5
- spacy ==3.8.2
- spacy-legacy ==3.0.12
- spacy-loggers ==1.0.5
- srsly ==2.4.8
- stack-data ==0.6.3
- thinc ==8.3.2
- threadpoolctl ==3.5.0
- tornado ==6.4.1
- tqdm ==4.66.5
- traitlets ==5.14.3
- typer ==0.12.5
- typing-extensions ==4.12.2
- tzdata ==2024.2
- uc-micro-py ==1.0.3
- urllib3 ==2.2.3
- wasabi ==1.1.3
- wcwidth ==0.2.13
- weasel ==0.4.1
- webencodings ==0.5.1
- widgetsnbextension ==4.0.13
- wrapt ==1.16.0
- xlrd ==2.0.1
- xlsxwriter ==3.2.0
- xyzservices ==2024.9.0