names

https://github.com/cshnican/names

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: cshnican
License: mit
Language: R
Default Branch: main
Size: 94.5 MB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 4

Created almost 2 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

Cross-cultural structures of personal name systems reflect general communicative principles

Michael Ramscar, Sihan Chen, Richard Futrell, Kyle Mahowald

Figure 2A

This Github repository contains the data, the analysis script, and the figures in the paper.

A description of the repository content

Note: as mentioned in the paper, we call the element in a personal that goes first prefix-name and the element that goes second byname. For example, in the name John Smith, John is the prefix-name and Smith is the byname. In the name Li Xiaoping, Li is the prefix-name and Xiaoping is the byname.

Scripts

To reproduce all the main figures, download finnish_data_selected.csv from here, put the file inside the Data folder. Run prediction1.R, top3_english.R, prediction2.R, prediction3.R first, and then run plot_merged_figs.R.

A description of the scripts: - prediction1.R: code for the first analysis and Figure 2a - prediction2.R: code for the second analysis, generating Figure 3, Supplementary Figure 1 and Supplementary Figure 2 - prediction3.R: code for the third analysis, generating Figure 4 - plot_merged_figs.R: code to generate multi-panel figures (Figure 2) - top3_english.R: code for Figure 2b - make_vietnamese_names_from_US_census.R: code to extract Vietnamese prefix-name data from the 2010 US Census - supp_transcription.R: code to compare the entropy calculated from the original data (containing name and spelling variations) with that from the pre-processed data (no such variations). See more details in Table 5 in Methods section.

Data

This folder contains the data in our analysis.

Data availability statement

The American, Korean, and Taiwanese name census data, as well as the birth records of parishes in Scotland and northern England, have been deposited in Github (github.com/cshnican/names) and Zenodo (https://zenodo.org/doi/10.5281/zenodo.13755110). The Finnish birth records can be obtained from Malmi (2018) via this link. A preprocessed dataset can be made available on request. The scientist name data is not available due to data privacy concerns but can be accessed on request.

American

US_census_all: US baby name data, a dataset published by the Social Security Administration (SSA). The dataset is downloaded from here (see State-specific data). The dataset was made available by the SSA for the express use of "researchers interested in naming trends" (see their statement).
- AK.txt, ..., WY.txt: the data in each state and DC
- StateReadMe.pdf: a pdf describing the dataset
CA.txt, DE.txt: copies of the same two files from the US_census_all folder - baby name data from California (the most populous US state) and Delaware (one of the least populous US states). The dataset was made available by the SSA for the express use of "researchers interested in naming trends" (see their statement).
us_census_name_2010: a folder containing the byname data from the 2020 US Census, downloaded from the US Census Bureau (link). The terms on their website expressly state that this data is made available for use in research (see their statement).
- the file used in this study is ./surnames_appearing_more_than_100_times/Names_2010Census.csv.
us_population_change: containing the file population_change_data.csv, a dataset containing the population census data in each state (plus DC and Puerto Rico) every 10 years. The dataset is downloaded from here. The terms on their website expressly state that this data is made available for use in research (see their statement).

Taiwanese

Chinese_name_data: a folder containing the raw files related to Chinese names
- taiwan_2018.csv: a list of 500 most common Taiwanese prefix-names. The data is manually extracted from the 2018 population census conducted by the Taiwanese Ministry of Interior (link. See Table 57, pp.282-304). The Taiwanese government allows their published data to be freely used for noncommercial purposes (see the statement here).
- taiwan_givenname.csv; a list of 100 most common Taiwanese men bynames and 100 most common Taiwanese women bynames. The data is manually extracted from the 2018 population census conducted by the Taiwanese Ministry of Interior (link. See Table 51, pp.264-265). The Taiwanese government allows their published data to be freely used for noncommercial purposes (see the statement here).

Korean

Korea: a folder containing the raw files related to Korean names
- korea_2015_hanja: a list of Korean prefix-names with a population greater than 5. The data is taken from the 2015 population census data published by Korean Statistical Information Service (link). One major function for the Korean National Statistic Office is to make statistical data available for researchers (see original statement here at the Dissemination of Statistical Information paragraph).

England

english-names-pop.csv: the population of England between 1801 and 1901 and the portion of population having the 3 most popular prefix-names in each gender.
- the population data is taken from populationdata.org.uk. Population Data UK is a site dedicated to providing information about the population of the United Kingdom. Our use of this data is in full compliance with the sites stated terms.
- the prefix-name data is taken from Table 1 in Douglas A Galbi. Long-term trends in personal given name frequencies in the UK. Available at SSRN 366240 (2002). One purpose of such dataset, according to Galbi, is "to spur further analysis of given names". Our use of this data is in full compliance of this (See the original statement here).
northern_england.csv: names from two pre-modern English counties for the period between 1700 and 1800. Extracted from George Bells parish marriage register transcriptions for Northumberland and Durham between 1701 and 1800 link. This dataset, comprising obtained from public records, is made available explicitly and intended for academic use by Douglas Galbi. One purpose of such dataset, according to Galbi, is "to spur further analysis of given names". Our use of this data is in full compliance of this (See the original statement here).

Scottish

Scotland: prefix-names from four pre-modern Scottish parishes for the period between 1700-1800, extracted by Alice Crook (2012) from the National Records of Scotland (https://www.nrscotland.gov.uk/). All contents in the National Records of Scotland operates are available under the Open Government License v3.0, which allows its data to be copied, published, distributed, and transmitted as long as the source is acknowledged [source](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/}. This data is contained as an appendix of Crook's MPhil Thesis work [Crook, Alice Louise (2012) Personal naming patterns in Scotland, 1700 - 1800: a comparative study of the parishes of Beith, Dingwall, Earlston, and Govan. MPhil(R) thesis]. The thesis is publically available here.
- 2012crookmphil.pdf: a copy of Crook's thesis
- beith.csv, beith.xlsx: prefix-name distribution in Beith
- dingwall.csv, dingwall.xlsx: prefix-name distribution in Dingwall
- earlstone.csv, earlstone.xlsx: prefix-name distribution in Earlstone
- govan.csv, govan.xlsx: prefix-name distribution in Govan
- scot_source_info.txt: a .txt file summarizing the content above

Vietnamese-American

Vietnam (US 2010): containing the file vietnamese_american_data.csv, a subset of Vietnamese-American bynames pulled from the US 2010 Census

Finnish

finnish_data_selected.csv: Finnish birth records from 1700 to 1917. This data was made available by Eric Malmi and was built for Malmi et al. (2018) off the HisKi Finnish genealogical data set. This data set was gathered and cleaned, as described in Malmi et al. (2018). link to Malmi et al., (2018). We focused in particular on birth name data and for all names used standardized spellings as in Malmi et al. (2018). According to Malmi (private communication), permission was given to Malmi and other researchers to use the HisKi dataset. There are other papers that use the HisKi database (e.g. this) or Malmi's database (e.g. this).

Scientist names

downsampled_scinames.csv (not available): a list of 2550 scientist names from the national academy of 6 countries (we sampled 425 names from each). As indicated below, we sourced American, Chinese, and French scientist names from Wikipedia, which is available for reuse under a CC-BY-SA license. The Finnish scientist names are considered by us as public information under the privacy policy of the Finnish Academy of Science and Letters, since members have the rights to prohibit access to their information (see the original statement in Finnish here, at "6. Snnnmukaiset tietolhteet".), and further correspondence with the academy confirmed this. However, we did not find explicit licensing information on the website of the Korean Academic of Sciences or on the website of the Russian Academy of Sciences. We emailed both instututions but did not receive a response. Because of this, we decided to not make this dataset publicly available. However, it can be provided upon request.
- 425 names from the National Academy of Sciences (USA) link
- 425 names from the Chinese Academy of Sciences link
- 425 names from the French Academy of Sciences link
- 425 names from the Finnish Academy of Science and Letters link
- 425 names from the Russian Academy of Sciences link
- 425 names from the Korean National Academy of Sciences link
plot_dupes_helper.csv: scientist names indexed in different styles, as an illustration (Figure 4a).

Data generated by R sctipts

exp3plots.RData, fig_population_relative_to_top.RData: RData files generated by R scripts, in order to make multi-panel figures.

imgs

This folder contains all the images, in PDF format

Figure 1

The main file is figure1.pdf

Figure 2

The main file is figure2.pdf

Figure 3

The main file is figure3.pdf

Figure 4

The main file is figure4.pdf

Supplementary Figure 1

The main file is supp_fig1.pdf

Supplementary Figure 2

The main file is supp_fig2.pdf

Figure 2a (for the cover image of this repository)

The main file is proportion_relative_to_top.png

Owner

Name: Sihan Chen
Login: cshnican
Kind: user
Location: Cambridge, MA
Company: MIT Brain and Cognitive Sciences

Twitter: cshnican
Repositories: 2
Profile: https://github.com/cshnican

GitHub Events

Total

Release event: 1
Push event: 7
Create event: 1

Last Year

Release event: 1
Push event: 7
Create event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

names

Science Score: 49.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Cross-cultural structures of personal name systems reflect general communicative principles

A description of the repository content

Scripts

Data

Data availability statement

American

Taiwanese

Korean

England

Scottish

Vietnamese-American

Finnish

Scientist names

Data generated by R sctipts

imgs

Figure 1

Figure 2

Figure 3

Figure 4

Supplementary Figure 1

Supplementary Figure 2

Figure 2a (for the cover image of this repository)

Owner

GitHub Events

Total

Last Year