source-inclusion-analysis
This repository contains code related to the study "Source Inclusion in Synthesis Writing: An NLP Approach to Understanding Argumentation, Sourcing, and Essay Quality" (Scott Crossley, Qian Wan, Laura Allen, Danielle McNamara, 2021) published in the Reading and Writing journal.
Science Score: 18.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary
Repository
This repository contains code related to the study "Source Inclusion in Synthesis Writing: An NLP Approach to Understanding Argumentation, Sourcing, and Essay Quality" (Scott Crossley, Qian Wan, Laura Allen, Danielle McNamara, 2021) published in the Reading and Writing journal.
Basic Info
- Host: GitHub
- Owner: wanqian0202
- License: mit
- Language: Python
- Default Branch: main
- Size: 11.7 KB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Paper Information
Title: Source Inclusion in Synthesis Writing: An NLP Approach
Authors: Scott Crossley, Qian Wan, Laura Allen, Danielle McNamara
Publication Date: November 11, 2021
Journal: Reading and Writing
Pages: 1-31
Publisher: Springer Netherlands
Link to Paper: https://files.eric.ed.gov/fulltext/ED619918.pdf
Description: This repository contains code related to the study "Source Inclusion in Synthesis Writing: An NLP Approach to Understanding Argumentation, Sourcing, and Essay Quality" published in the Reading and Writing journal. Synthesis writing, a vital skill across domains, requires writers to integrate information from source materials. This study investigates how the integration of source material influences writing quality for synthesis tasks. Approximately 900 source-based essays, scored for holistic quality, argumentation, and source use, were analyzed using hand-crafted natural language processing (NLP) features. This repository provides access to the code used in the study, facilitating further research and exploration in the field of synthesis writing and NLP analysis.
Setup Instructions
Install Required Packages: ```bash pip install -r requirements.txt
Update Paths:
- Open the
main.pyfile in a text editor. - Ensure that the text files in the folder to be processed are in .txt format.
- Specify the path to the folder containing the text files to be analyzed in the
main.pyfile. - Specify the path to the source text file (currently, only one .txt file can be used as the source file) in the
main.pyfile. - Specify the path for the output CSV file where the results will be stored in the
main.pyfile. - Save the changes.
- Open the
Run the Code:
Run the
main.pyfile to start the analysis.
Owner
- Name: Qian Wan
- Login: wanqian0202
- Kind: user
- Repositories: 1
- Profile: https://github.com/wanqian0202
Citation (Citation.py)
import re, string
import numpy as np
from collections import Counter
from statistics import mean, stdev
def split_into_sentences(text):
'''
This function tokenize a string into sentences
:param text: a string
:return: sentences: a list of sentences
'''
alphabets = "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|me|edu)"
text = " " + text + " "
text = text.replace("\n"," ")
text = re.sub(prefixes,"\\1<prd>",text)
text = re.sub(websites,"<prd>\\1",text)
if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
if "”" in text: text = text.replace(".”","”.")
if "\"" in text: text = text.replace(".\"","\".")
if "!" in text: text = text.replace("!\"","\"!")
if "?" in text: text = text.replace("?\"","\"?")
if "..." in text: text = text.replace("...", "<prd><prd><prd>")
if "......" in text: text = text.replace("......", "<prd><prd><prd><prd><prd><prd>")
if "e.g." in text: text = text.replace("e.g.", "e<prd>g<prd>")
if "i.e." in text: text = text.replace("i.e.", "i<prd>e<prd>")
text = text.replace(".",".<stop>")
text = text.replace("?","?<stop>")
text = text.replace("!","!<stop>")
text = text.replace("<prd>",".")
sentences = text.split("<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
if len(sentences) == 0:
sentences.append(text.strip())
sentences = [ s for s in sentences if len(s) > 2 ]
return sentences
def identify_citation(target):
'''
This function identify citations according to fixed format (e.g., Source A, (Source A))
or keywords from the source text
:return: lists: direct_citation, indirect_citation, all_citation_name
'''
keywords = ["source", "text", "article", "essay", "report", "blog", "post", "book", "chapter", "editorial",
"excerpt", "interview", "journal", "lecture", "magazine", "newspaper", "paper", "passage",
"quote", "research", "study", "speech", "website"]
def citation_filter_1(target, expression):
filtered_list = [s for s in expression.findall(target) if s.split()[1] != "I"]
# print("list1: ", filtered_list)
return filtered_list
def citation_filter_2(expression):
filtered_list_2 = [s for s in expression if
s.translate(str.maketrans('', '', string.punctuation)).split()[0].lower() in keywords]
# print("list2: ", filtered_list_2)
return filtered_list_2
direct_list_1 = citation_filter_2(citation_filter_1(target, expression_1))
direct_list_2 = citation_filter_2(citation_filter_1(target, expression_2))
direct_list_3 = citation_filter_2(citation_filter_1(target, expression_3))
direct_list_4 = citation_filter_2(citation_filter_1(target, expression_4))
direct_list_5 = citation_filter_1(target, expression_5)
direct_list_6 = citation_filter_1(target, expression_6)
direct_citation = direct_list_1 + direct_list_2 + direct_list_3 + direct_list_4 + direct_list_5 + direct_list_6
return direct_citation
def build_sent_dict(content):
'''
This function build dicts for each sentence in an essay, and the dicts contain info of:
(1) the type of the sentence (citation/non-citation)
(2) position of the sentence in the paragraph and in the essay
:return: a list of dicts for all sentences in an essay
'''
def tokenize(content):
'''
This function tokenize the content of an essay into lists of paragraphs and sentences:
[[sentence1, sentence2, sentence...][sentence1, sentence2, ...]]
:return: content_list: list of paragraphs and sentences
'''
content_list = list(filter(bool, content.splitlines()))
content_list = [para.strip() for para in content_list]
content_list = [split_into_sentences(para) for para in content_list]
return content_list
sent_dict_list = []
content_list = tokenize(content)
# a list that directly comprised of sentences in the essay
all_sent_list = [ sent for para in content_list for sent in para ]
how_many_sent_in_essay = len(all_sent_list)
how_many_para_in_essay = len(content_list)
sent_no = 0
# for each paragraph
for i in range(len(content_list)):
in_which_para = i + 1
norm_para_position = in_which_para / how_many_para_in_essay
# for each sentence
for j in range(len(content_list[i])):
sent_no += 1
sent_dict = {}
raw_location_in_para = j + 1
how_many_sent_in_para = len(content_list[i])
norm_sent_position_in_para = raw_location_in_para / how_many_sent_in_para
raw_location_in_essay = sent_no
norm_sent_position_in_essay = raw_location_in_essay / how_many_sent_in_essay
# identify the type of the sentence: contain citations / not contain citations
citation_all = identify_citation(content_list[i][j])
if len(citation_all) != 0:
sent_type = 'citation'
else:
sent_type = 'non-citation'
# info about the sentence (type and position)
sent_dict['sentence'] = content_list[i][j]
sent_dict['sent_type'] = sent_type
sent_dict['in_which_para'] = in_which_para
sent_dict['how_many_para'] = how_many_para_in_essay
sent_dict['norm_para_position'] = norm_para_position
sent_dict['raw_location_in_para'] = raw_location_in_para
sent_dict['how_many_sent_in_para'] = how_many_sent_in_para
sent_dict['norm_sent_position_in_para'] = norm_sent_position_in_para
sent_dict['raw_location_in_essay'] = raw_location_in_essay
sent_dict['how_many_sent_in_essay'] = how_many_sent_in_essay
sent_dict['norm_sent_position_in_essay'] = norm_sent_position_in_essay
sent_dict_list.append(sent_dict)
return sent_dict_list
def citation_sent_position(result_dict, input):
'''
This fuction calculate the position of citation sentences in the paragraph and in the essay
:return: None (the function add in the positional info to the feature dict)
'''
def average_sentence_position(citation_data_list, number_unique_para, number_para):
'''
Calculate average sentence position of citations in paragraph and in the essay according the df of citations
:param citation_df:
:return: modified feature dict
'''
dataframe = np.array(citation_data_list)
# calculate the means
result_dict['average_raw_citation_sentence_location_in_essay'] = np.mean(dataframe, axis=0)[0]
result_dict['average_norm_citation_sentence_location_in_essay'] = np.mean(dataframe, axis=0)[1]
result_dict['average_raw_citation_sentence_location_in_para'] = np.mean(dataframe, axis=0)[2]
result_dict['average_norm_citation_sentence_location_in_para'] = np.mean(dataframe, axis=0)[3]
# calculate the SD of the position info, if citation number < 2 in an essay, fill SD value with "0"
if len(citation_data_list) >= 2:
result_dict['sd_raw_citation_sentence_location_in_essay'] = np.std(dataframe, axis=0)[0]
result_dict['sd_norm_citation_sentence_location_in_essay'] = np.std(dataframe, axis=0)[1]
result_dict['sd_raw_citation_sentence_location_in_para'] = np.std(dataframe, axis=0)[2]
result_dict['sd_norm_citation_sentence_location_in_para'] = np.std(dataframe, axis=0)[3]
else:
result_dict['sd_norm_citation_sentence_location_in_essay'] = 0
result_dict['sd_raw_citation_sentence_location_in_essay'] = 0
result_dict['sd_raw_citation_sentence_location_in_para'] = 0
result_dict['sd_norm_citation_sentence_location_in_para'] = 0
# calculate how many percent of paragraphs contain citations
result_dict['percentage_of_paragraphs_with_citations'] = number_unique_para / number_para
sent_dict_list = build_sent_dict(input)
# only extract info from the sentences that contain citations
citation_data_list = []
para_position_list = []
number_para_list = []
for sent_dict in sent_dict_list:
if sent_dict['sent_type'] == 'citation':
sent_position_list = []
para_position_list.append(sent_dict['in_which_para'])
number_para_list.append(sent_dict['how_many_para'])
sent_position_list.append(sent_dict['raw_location_in_essay'])
sent_position_list.append(sent_dict['norm_sent_position_in_essay'])
sent_position_list.append(sent_dict['raw_location_in_para'])
sent_position_list.append(sent_dict['norm_sent_position_in_para'])
citation_data_list.append(sent_position_list)
if len(citation_data_list) != 0:
# # calculate average citation position in essay based on sentences
number_unique_para = len(set(para_position_list))
number_para = number_para_list[0]
average_sentence_position(citation_data_list, number_unique_para, number_para)
else:
# if there is no citation in an essay at all, fill the positions with "0"
result_dict['average_raw_citation_sentence_location_in_essay'] = 0
result_dict['sd_raw_citation_sentence_location_in_essay'] = 0
result_dict['average_norm_citation_sentence_location_in_essay'] = 0
result_dict['sd_norm_citation_sentence_location_in_essay'] = 0
result_dict['average_raw_citation_sentence_location_in_para'] = 0
result_dict['sd_raw_citation_sentence_location_in_para'] = 0
result_dict['average_norm_citation_sentence_location_in_para'] = 0
result_dict['sd_norm_citation_sentence_location_in_para'] = 0
result_dict['percentage_of_paragraphs_with_citations'] = 0
def citation_word_position(result_dict, content):
'''
This function calculate word-based position of citations in the essay
:return: None (the function adds in the positional info to the feature dict)
'''
def replace_citations(rep, content):
# replace the original mark of citations with a new mark that starts with "***"
for citation, marked_citation in rep.items():
content = content.replace(citation, marked_citation)
# print(content)
return content
# get the direct and indirect citations in each essay
direct_citation = identify_citation(content)
direct_citation_set = set(direct_citation)
direct_citation = list(direct_citation_set)
# dicts that show how the mark of citations should be replaced
rep = {citation: "***" + citation for citation in direct_citation}
# replace the mark of citations in the essay
if len(direct_citation) != 0:
text = replace_citations(rep, content)
else:
text = content
# remove parentheses in the essay
text = text.replace('(', '').replace(')', '')
# print(text)
tokens = text.split()
# get the position index of words that start with "***", which should be citations
index = [i for i in range(len(tokens)) if tokens[i].startswith('***')]
# calculate mean and SD of word-based position of citations in essay
if len(index) > 0:
result_dict["average_citation_word_location_in_essay"] = mean(index)
if len(index) >= 2:
result_dict["sd_citation_word_location_in_essay"] = stdev(index)
else:
result_dict["sd_citation_word_location_in_essay"] = 0
else:
result_dict["average_citation_word_location_in_essay"] = 0
result_dict["sd_citation_word_location_in_essay"] = 0
def citation_char_position(result_dict, content):
'''
This fuction calculate character-based position of citations and add the result to the feature dict
:return: None (the function adds in the positional info to the feature dict)
'''
# get all character-based position indices of citations in an essay
all_citations = identify_citation(content)
all_char_position = []
for s in set(all_citations):
char_position = [m.start() for m in re.finditer(s, content)]
for element in char_position:
all_char_position.append(element)
all_char_position = list(set(all_char_position))
# calculate the mean and SD of the character-based position of citations in the essay
if len(all_char_position) > 0:
result_dict["average_citation_character_location_in_essay"] = mean(all_char_position)
if len(all_char_position) >= 2:
result_dict["sd_citation_character_location_in_essay"] = stdev(all_char_position)
else:
result_dict["sd_citation_character_location_in_essay"] = 0
else:
result_dict["average_citation_character_location_in_essay"] = 0
result_dict["sd_citation_character_location_in_essay"] = 0
def source_citation_coverage(result_dict, content):
'''
This function calculates percentages of most cited source, usage of sources, and the frequency of citations
:return: None (the function adds in the frequency info to the feature dict)
'''
# get all citations in the essay
direct_citation = identify_citation(content)
number_citation = len(direct_citation)
# number of citations in the essay
result_dict["count_of_citations"] = number_citation
# calculate the frequency of citations in the essays
word_count = len(content.split())
if word_count != 0:
frequency_citation = number_citation / word_count
result_dict["frequency_of_citations"] = frequency_citation
else:
result_dict["frequency_of_citations"] = 0
# calculate the percentage of the most common cited source text,
# and the percentage of how many source texts have been cited in the essay
if number_citation != 0:
cleaned_citation_list = [re.sub(r'[^\w\s]', '', citation.lower()) for citation in direct_citation]
# print(cleaned_citation_list)
citation_freq_dict = Counter(cleaned_citation_list)
most_common_source = citation_freq_dict.most_common(1)
most_common_percent = most_common_source[0][1] / number_citation
result_dict["percent_most_common_cited_source"] = most_common_percent
citation_set = set(cleaned_citation_list)
type_citation = len(citation_set)
result_dict["number_of_unique_citations"] = type_citation
else:
result_dict["percent_most_common_cited_source"] = 0
result_dict["number_of_unique_citations"] = 0
def citation_features(input):
global expression_1, expression_2, expression_3, expression_4, expression_5, expression_6
expression_1 = re.compile(r"([a-zA-Z]+ [A-Z]) ") # Letters A-Z
expression_2 = re.compile(r"\([a-zA-Z]+\s[0-9]+\)", re.IGNORECASE) # (Letters 0-9)
expression_3 = re.compile(r"([a-zA-Z]+\s[0-9] )", re.IGNORECASE) # Letters 0-9
expression_4 = re.compile(r"\([a-zA-Z]+ [A-Z]\)", re.IGNORECASE) # (Letters A-Z)
expression_5 = re.compile(r"\([a-zA-Z]+ et al\., \d{4}\)", re.IGNORECASE) # (Letters et al., 4-digit number)
expression_6 = re.compile(r"\([a-zA-Z]+, \d{4}\)", re.IGNORECASE) # (Letters, 4-digit number)
result_dict = {}
citation_sent_position(result_dict, input)
citation_word_position(result_dict, input)
citation_char_position(result_dict, input)
source_citation_coverage(result_dict, input)
# print("Citation result: ", result_dict)
return result_dict
GitHub Events
Total
Last Year
Dependencies
- Flask ==2.3.3
- Jinja2 ==3.1.2
- MarkupSafe ==2.1.1
- Pillow ==9.2.0
- Werkzeug ==2.3.7
- amqp ==5.1.1
- annotated-types ==0.5.0
- async-timeout ==4.0.3
- billiard ==4.1.0
- blinker ==1.6.2
- blis ==0.7.9
- catalogue ==2.0.8
- celery ==5.3.4
- certifi ==2022.9.24
- cffi ==1.15.1
- charset-normalizer ==2.1.1
- click ==8.1.3
- click-didyoumean ==0.3.0
- click-plugins ==1.1.1
- click-repl ==0.3.0
- confection ==0.0.3
- contourpy ==1.1.1
- cycler ==0.11.0
- cymem ==2.0.7
- distlib ==0.3.7
- filelock ==3.12.4
- fonttools ==4.42.1
- gensim ==4.2.0
- gevent ==23.9.1
- greenlet ==3.0.0
- idna ==3.4
- importlib-metadata ==6.8.0
- itsdangerous ==2.1.2
- joblib ==1.2.0
- kiwisolver ==1.4.5
- kombu ==5.3.2
- langcodes ==3.3.0
- matplotlib ==3.8.0
- murmurhash ==1.0.9
- numpy ==1.23.4
- packaging ==21.3
- pandas ==1.5.1
- pathy ==0.6.2
- platformdirs ==3.10.0
- preshed ==3.0.8
- prompt-toolkit ==3.0.39
- pycparser ==2.21
- pydantic ==1.10.2
- pydantic_core ==0.42.0
- pyparsing ==3.0.9
- python-dateutil ==2.8.2
- pytz ==2022.5
- redis ==4.6.0
- requests ==2.28.1
- rpy2 ==3.5.1
- scikit-learn ==1.1.2
- scipy ==1.9.3
- seaborn ==0.12.2
- six ==1.16.0
- smart-open ==5.2.1
- spacy ==3.4.2
- spacy-legacy ==3.0.10
- spacy-loggers ==1.0.3
- spylls ==0.1.7
- srsly ==2.4.5
- syllapy ==0.7.2
- thinc ==8.1.5
- threadpoolctl ==3.1.0
- tqdm ==4.64.1
- ttkthemes ==3.2.2
- typer ==0.4.2
- typing_extensions ==4.4.0
- tzdata ==2023.3
- tzlocal ==5.0.1
- urllib3 ==1.26.12
- vaderSentiment ==3.3.2
- vine ==5.0.0
- virtualenv ==20.24.5
- wasabi ==0.10.1
- wcwidth ==0.2.6
- zipp ==3.16.2
- zope.event ==5.0
- zope.interface ==6.0