Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: cshnican
- License: mit
- Language: R
- Default Branch: main
- Size: 94.5 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 4
Metadata Files
README.md
Cross-cultural structures of personal name systems reflect general communicative principles
Michael Ramscar, Sihan Chen, Richard Futrell, Kyle Mahowald

This Github repository contains the data, the analysis script, and the figures in the paper.
A description of the repository content
Note: as mentioned in the paper, we call the element in a personal that goes first prefix-name and the element that goes second byname. For example, in the name John Smith, John is the prefix-name and Smith is the byname. In the name Li Xiaoping, Li is the prefix-name and Xiaoping is the byname.
Scripts
To reproduce all the main figures, download finnish_data_selected.csv from here, put the file inside the Data folder. Run prediction1.R, top3_english.R, prediction2.R, prediction3.R first, and then run plot_merged_figs.R.
A description of the scripts:
- prediction1.R: code for the first analysis and Figure 2a
- prediction2.R: code for the second analysis, generating Figure 3, Supplementary Figure 1 and Supplementary Figure 2
- prediction3.R: code for the third analysis, generating Figure 4
- plot_merged_figs.R: code to generate multi-panel figures (Figure 2)
- top3_english.R: code for Figure 2b
- make_vietnamese_names_from_US_census.R: code to extract Vietnamese prefix-name data from the 2010 US Census
- supp_transcription.R: code to compare the entropy calculated from the original data (containing name and spelling variations) with that from the pre-processed data (no such variations). See more details in Table 5 in Methods section.
Data
This folder contains the data in our analysis.
Data availability statement
The American, Korean, and Taiwanese name census data, as well as the birth records of parishes in Scotland and northern England, have been deposited in Github (github.com/cshnican/names) and Zenodo (https://zenodo.org/doi/10.5281/zenodo.13755110). The Finnish birth records can be obtained from Malmi (2018) via this link. A preprocessed dataset can be made available on request. The scientist name data is not available due to data privacy concerns but can be accessed on request.
American
US_census_all: US baby name data, a dataset published by the Social Security Administration (SSA). The dataset is downloaded from here (seeState-specific data). The dataset was made available by the SSA for the express use of "researchers interested in naming trends" (see their statement).AK.txt, ...,WY.txt: the data in each state and DCStateReadMe.pdf: a pdf describing the dataset
CA.txt,DE.txt: copies of the same two files from theUS_census_allfolder - baby name data from California (the most populous US state) and Delaware (one of the least populous US states). The dataset was made available by the SSA for the express use of "researchers interested in naming trends" (see their statement).us_census_name_2010: a folder containing the byname data from the 2020 US Census, downloaded from the US Census Bureau (link). The terms on their website expressly state that this data is made available for use in research (see their statement).- the file used in this study is
./surnames_appearing_more_than_100_times/Names_2010Census.csv.
- the file used in this study is
us_population_change: containing the filepopulation_change_data.csv, a dataset containing the population census data in each state (plus DC and Puerto Rico) every 10 years. The dataset is downloaded from here. The terms on their website expressly state that this data is made available for use in research (see their statement).
Taiwanese
Chinese_name_data: a folder containing the raw files related to Chinese namestaiwan_2018.csv: a list of 500 most common Taiwanese prefix-names. The data is manually extracted from the 2018 population census conducted by the Taiwanese Ministry of Interior (link. See Table 57, pp.282-304). The Taiwanese government allows their published data to be freely used for noncommercial purposes (see the statement here).taiwan_givenname.csv; a list of 100 most common Taiwanese men bynames and 100 most common Taiwanese women bynames. The data is manually extracted from the 2018 population census conducted by the Taiwanese Ministry of Interior (link. See Table 51, pp.264-265). The Taiwanese government allows their published data to be freely used for noncommercial purposes (see the statement here).
Korean
Korea: a folder containing the raw files related to Korean nameskorea_2015_hanja: a list of Korean prefix-names with a population greater than 5. The data is taken from the 2015 population census data published by Korean Statistical Information Service (link). One major function for the Korean National Statistic Office is to make statistical data available for researchers (see original statement here at the Dissemination of Statistical Information paragraph).
England
english-names-pop.csv: the population of England between 1801 and 1901 and the portion of population having the 3 most popular prefix-names in each gender.- the population data is taken from populationdata.org.uk. Population Data UK is a site dedicated to providing information about the population of the United Kingdom. Our use of this data is in full compliance with the sites stated terms.
- the prefix-name data is taken from Table 1 in Douglas A Galbi. Long-term trends in personal given name frequencies in the UK. Available at SSRN 366240 (2002). One purpose of such dataset, according to Galbi, is "to spur further analysis of given names". Our use of this data is in full compliance of this (See the original statement here).
northern_england.csv: names from two pre-modern English counties for the period between 1700 and 1800. Extracted from George Bells parish marriage register transcriptions for Northumberland and Durham between 1701 and 1800 link. This dataset, comprising obtained from public records, is made available explicitly and intended for academic use by Douglas Galbi. One purpose of such dataset, according to Galbi, is "to spur further analysis of given names". Our use of this data is in full compliance of this (See the original statement here).
Scottish
Scotland: prefix-names from four pre-modern Scottish parishes for the period between 1700-1800, extracted by Alice Crook (2012) from the National Records of Scotland (https://www.nrscotland.gov.uk/). All contents in the National Records of Scotland operates are available under the Open Government License v3.0, which allows its data to be copied, published, distributed, and transmitted as long as the source is acknowledged [source](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/}. This data is contained as an appendix of Crook's MPhil Thesis work [Crook, Alice Louise (2012) Personal naming patterns in Scotland, 1700 - 1800: a comparative study of the parishes of Beith, Dingwall, Earlston, and Govan. MPhil(R) thesis]. The thesis is publically available here.2012crookmphil.pdf: a copy of Crook's thesisbeith.csv,beith.xlsx: prefix-name distribution in Beithdingwall.csv,dingwall.xlsx: prefix-name distribution in Dingwallearlstone.csv,earlstone.xlsx: prefix-name distribution in Earlstonegovan.csv,govan.xlsx: prefix-name distribution in Govanscot_source_info.txt: a .txt file summarizing the content above
Vietnamese-American
Vietnam (US 2010): containing the filevietnamese_american_data.csv, a subset of Vietnamese-American bynames pulled from the US 2010 Census
Finnish
finnish_data_selected.csv: Finnish birth records from 1700 to 1917. This data was made available by Eric Malmi and was built for Malmi et al. (2018) off the HisKi Finnish genealogical data set. This data set was gathered and cleaned, as described in Malmi et al. (2018). link to Malmi et al., (2018). We focused in particular on birth name data and for all names used standardized spellings as in Malmi et al. (2018). According to Malmi (private communication), permission was given to Malmi and other researchers to use the HisKi dataset. There are other papers that use the HisKi database (e.g. this) or Malmi's database (e.g. this).
Scientist names
downsampled_scinames.csv(not available): a list of 2550 scientist names from the national academy of 6 countries (we sampled 425 names from each). As indicated below, we sourced American, Chinese, and French scientist names from Wikipedia, which is available for reuse under a CC-BY-SA license. The Finnish scientist names are considered by us as public information under the privacy policy of the Finnish Academy of Science and Letters, since members have the rights to prohibit access to their information (see the original statement in Finnish here, at "6. Snnnmukaiset tietolhteet".), and further correspondence with the academy confirmed this. However, we did not find explicit licensing information on the website of the Korean Academic of Sciences or on the website of the Russian Academy of Sciences. We emailed both instututions but did not receive a response. Because of this, we decided to not make this dataset publicly available. However, it can be provided upon request.- 425 names from the National Academy of Sciences (USA) link
- 425 names from the Chinese Academy of Sciences link
- 425 names from the French Academy of Sciences link
- 425 names from the Finnish Academy of Science and Letters link
- 425 names from the Russian Academy of Sciences link
- 425 names from the Korean National Academy of Sciences link
plot_dupes_helper.csv: scientist names indexed in different styles, as an illustration (Figure 4a).
Data generated by R sctipts
exp3plots.RData,fig_population_relative_to_top.RData: RData files generated by R scripts, in order to make multi-panel figures.
imgs
This folder contains all the images, in PDF format
Figure 1
The main file is figure1.pdf
Figure 2
The main file is figure2.pdf
Figure 3
The main file is figure3.pdf
Figure 4
The main file is figure4.pdf
Supplementary Figure 1
The main file is supp_fig1.pdf
Supplementary Figure 2
The main file is supp_fig2.pdf
Figure 2a (for the cover image of this repository)
The main file is proportion_relative_to_top.png
Owner
- Name: Sihan Chen
- Login: cshnican
- Kind: user
- Location: Cambridge, MA
- Company: MIT Brain and Cognitive Sciences
- Twitter: cshnican
- Repositories: 2
- Profile: https://github.com/cshnican
GitHub Events
Total
- Release event: 1
- Push event: 7
- Create event: 1
Last Year
- Release event: 1
- Push event: 7
- Create event: 1