ingredient_cooccurrence

https://github.com/kmcooper/ingredient_cooccurrence

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: kmcooper
License: cc-by-sa-4.0
Language: Perl
Default Branch: master
Size: 7.38 MB

Statistics

Stars: 2
Watchers: 0
Forks: 2
Open Issues: 0
Releases: 0

Created almost 7 years ago · Last pushed over 3 years ago

Metadata Files

Readme License Citation

The Ingredient Co-occurrence Network of packaged foods distributed in the United States

Title: "Open Food Database: Ingredient Co-Occurence Network"
Author: "Kate Cooper"
Creation date: "Feb-28-2019"
Last updated: "Jun-27-2019"
Output: html_document
License for use: License

The following presents a detailed and reproducible* methodology of how the ingredient co-occurrence network from the Open Food Database was created. While the code below was not used to generate all the tables and figures in the manuscript, it was used to generate a majority of them. The code used to generate specific tables and figures is noted where applicable. Please direct any questions, comments, or concerns to kmcooper [at] unomaha [dot] edu.

*=hopefully

This work has been reproduced using the USDA Food Data Central database (and converting my Perl scripts to Python) by Ricky Flores and can be found here: https://github.com/r-flores/Ingredient_Classification

1. Data Download

The data used for this analysis was downloaded from https://world.openfoodfacts.org/data on 03-21-2019 at 9:57am as a CSV file. The file folder hierarchy I have set up assumes you are in your home directory and have created a folder called ingredient_network with subfolders ingredient_network/data and ingredient_network/code. All code needed for the ingredient_network/code folder is available on the Github.

Environment setup:
1. Set your working directory as the variable workingDir
2. The code assumes in your working directory, you have subfolders: *workingDir/data* and *workingDir/code*

You will need to set each of the following variables: * workingDir: The main directory where you want to work; I used /ingredient_network/ * dataDir: This should point to the location of the workingDir/data/ directory * codeDir: This should point to the location of the workingDir/code/ directory * hash_infile: The name of the file you will use to create the ingredient network * hash_outfile: The name of the file containing a hash of ingredients as keys and count as values * network_rawfile: The name of the network .sif file generated from the code below

```python

Set variables

Preferred working directory

workingDir = "~/ingredient_network/"

Where data and code are stored

dataDir = "~/ingredientnetwork/data/" codeDir = "~/ingredientnetwork/code/"

Names you would like to give these files, including

The ingredients file output from the R script (after pre-processing)

hash_infile = "ingredients.tab"

The ingredient hash file (counting occurrence of ingredients)

hash_outfile = "ingredients.hash.txt"

The name of the co-occurrence network

network_rawfile = "ingredients.network.sif"

The name of the co-occurrence network that has been filtered for only highly co-occurring ingredients

parsednetworkrawfile = "ingredients.network.parsed.sif"

The threshold to be used to determine when an ingredient is "highly co-occurring"

edgeweightthreshold = 2000

```

At time of download (March 2019), the CSV file downloaded was 2.21GB, so plan accordingly if you need space. The file itself contained 798,919 lines by wordcount and is tab-delimited.

```python

Change to the data directory and download the file from the Open Food Database

setwd(paste0(workingDir,"data")) ret = download.file("https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv","rawdata.csv") system('wc -l rawdata.csv')

```

The dataset itself contains a lot of information; at this exploratory stage we only would like to investigate ingredients per product.
We also have to check the following: * Are there dupliate products in the data? * Are all ingredients named in the same way?

First, we will check if there are duplicates by barcode, the first column in the dataset. We will search only for foods sold in the United States (column 32). We also want to include the ingredients (columnt 35), and any allergens that may be noted (columns 36 and 37).

python system('cut -f 1,32,35,36,37 raw_data.csv | grep \'United States\' | uniq | wc -l')

We want to look at product by barcode as a unique ID for the product and we will make our ingredient network by comparing ingredients from the "ingredients" text in column 35, so in the next steps we extract columns 1 and 35 only from the data, and remove duplicates. This cuts our file, now named raw_ingredients.txt down to a much more manageable size of 38.5MB.

python system('cut -f 1,32,35 raw_data.csv | grep \'^\\d*\tUnited States\t.*\' | cut -f 1,3 > raw_ingredients.txt ')

We note that there are 174,785 rows (unique barcodes for foods in the United States) listed in our raw_ingredients file. It is assumed that entries into the Open Food Database are not reviewed for correctness [citation needed], but the 2004 Food Allergen Labeling Consumer Protection Act (FALCPA), which took effect in January 2006, requires all food labels in the United States to identify if a product contains one of the eight major allergens. (Source)

python system('cut -f 1 raw_ingredients.txt | uniq | wc -l')

2. Data Pre-processing

Next, we want to remove any barcodes that do not have ingredients associated with them.
To do this, lets read the file into R and begin manipulating it in memory.

python raw_ingredients = read.csv("raw_ingredients.txt",sep="\t",header = TRUE) names(raw_ingredients) <- c("barcode","ingredients_text")

This should result in a dataframe called raw_ingredients that contains 499,879 observations of 2 variables. Next, we remove duplicates by removing rows for which ingredients_text is empty.

```python ingredients <- as.data.frame(rawingredients[-which(rawingredients$ingredients_text == ""), ])

You can now remove the raw_ingredients variable

if you are feeling confident in the reproducibility of your project

To do this, uncomment the command below

remove(raw_ingredients)

```

Once this is complete, we can begin to investigate and compare ingredients. Taking a brief look at the data, the first challenge to overcome is the case of the text; evaluating "Salt" and "salt" as equal ingredients will be easier if they are written in the same case. So next we change the ingredients list to all lowercase.

python ingredients$ingredients_text <- tolower(ingredients$ingredients_text)

Certain ingredients are followed by a percentage (i.e. "milk chocolate 32.7%"). For purposes of building our co-occurence network, we will disregard these percentages and remove them, both for US (32.7%) and European (32,7%) formatting styles.

python ingredients$ingredients_text <-gsub('\\d+[.,]*\\d*\\s{0,1}%','',ingredients$ingredients_text)

Further, we want to make similar changes in formatting of special characters to typical characters to make comparability of ingredients similar. For example, we want to remove the term "organic" from ingredients as organic is not regulated by the FDA.

python ingredients$ingredients_text <-gsub('organic ','',ingredients$ingredients_text, ignore.case = TRUE) ingredients$ingredients_text <-gsub('org ','',ingredients$ingredients_text, ignore.case = TRUE) ingredients$ingredients_text <-gsub('certified organic','',ingredients$ingredients_text, ignore.case = TRUE)

2.1 Remove special letters and characters and replace with standard [a-z] or otherwise

Note: The next few steps of substituting x for y are clearly more of a "brute force" approach for purposes of proof-of-concept. These steps we plan to improve in the future using more sophisticated IR approaches.

python ingredients$ingredients_text <-gsub('[éèë]','e',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('ï','i',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('[âà]','a',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('ô','o',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('_','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('?','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('\\[\\]','',ingredients$ingredients_text)

2.2 Remove the term "ingredients" as it is redundant

python ingredients$ingredients_text <-gsub('ingredient[s]*\\s{0,1}\\:*','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('ingrédient[s]*\\s{0,1}\\:*','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('ingrédiente[ns]*\\s{0,1}\\:*','',ingredients$ingredients_text)

2.3 Remove preparatory, quantities, or provenance terms

python ingredients$ingredients_text <-gsub('amount','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('serving','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('nourishment','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('nourishment','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('made from ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('only ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('pure ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('contains less than','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('vital ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('cultured','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('pasteurized','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('distilled','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('california','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('grade a+','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('extra','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('virgin','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('free range','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('french','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('high fiber','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('low fat','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('whole ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('rolled ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('expeller ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('pressed ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('raw ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('evaporated ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('cow\'s milk','milk',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('non-gmo','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('refined ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('milled ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('soft ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('hard ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('toasted ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('sliced ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('mediterranean ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('peeled ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('cored ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('dry ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('roasted ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('mountain ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('spring ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('diced ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('sparkling','carbonated',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('cooked ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('freshly made from: ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('concentrated ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('stone-ground ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('partially hydrogenated','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('steel-cut ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('steel cut ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('thick cut ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('thick-cut ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('gourmet ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('fresh ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('frozen ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('natural ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('non-genetically modified ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('genetically modified','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('and nothing else','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('love ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('nothing ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('italian ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('imported ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('hand-picked ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('dried ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('crushed ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('added to preserve freshness','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('for freshness','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('to preserve freshness','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('mechanically','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('mini','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('naturally','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('h2o','water',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('\\*','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('as a preservative','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('juice from','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('exclusively','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('ground','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('shaved','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('kosher','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('ripe','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('\\.','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('nitrogen-flushed to maintain freshness','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('farm-raised','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('boneless','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('skinless','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('wild caught','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('parboiled','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('preparboiled','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('individually wrapped','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('all nat','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('wq all nat','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('young','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('aged','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('premium','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('added to retain color','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('blend of','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('in shell','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('with their juices','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('selected ','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('with calcium ascorbate to promote whiteness','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('to promote freshness','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('not from concentrate','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('for color','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('real','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('propellant-free','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('gluten-free','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('grass-fed','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('grass fed','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('modified','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('pitted','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('reconstituted','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('cage\\-free','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('cage free','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('free\\-range','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('freeze-','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('wild-harvested','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('contains:','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('flavors','flavor',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('less than','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('of:','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('filtered','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('contents:','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('pre-washed','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('in-shell','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('in-the-shell','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('carragenan','carrageenan',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('carrageena','carrageenan',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('carrageenen','carrageenan',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('carrageen','carrageenan',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('enriched','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('unbleached','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('bleached','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('mono and diglycerides','monoglycerides and diglycerides',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('or less of the following','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('or less of each of the following','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('or less of','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('or less','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('distributed by','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('the','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('contains','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('with','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('textured','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('more of','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('following','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('product','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('strach','starch',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('emulsidiers','emulsifiers',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('occurring','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('or of following','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('\$\$','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('defatted','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('fractionated','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('hydrogenated','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('preserved','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('dry','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('modified','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('partially','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('cultured','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('or less of each of the following','',ingredients$ingredients_text)

2.4 Remove names

```python ingredients$ingredientstext <-gsub('bragg\'s ','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('bart \& judy\'s','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('nagai\'s','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('best annie\'s friends','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('annie\'s','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('what\'s inside?','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('newman\'s own','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('dr\.\s*kefir\'s','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('rockin\' poppin\'','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('hellmann\'s','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('us grown ','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('new zealand ','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('orville redenbacher\'s ','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('opal ','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('of modena','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('(sourced from the united kingdom)','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub(' may contain a blend of united states, brazilian, mexican, belize and south african concentrates.','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('from emilia romagna.','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('spanish','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('himalayan','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('irish\-style','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('scott\'s','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('pedro ximenez','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('best','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('friends','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('teacher\'s ','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('barley\'s','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('baker\'s','bakers',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('brewer\'s','brewers',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('sheep\'s','sheeps',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('kroger co','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('confectioner\'s','confectioners',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('trader joe\'s','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('m\&m\'s','',ingredients$ingredientstext) ingredients$ingredientstext <-gsub('m\,m\'s','',ingredients$ingredientstext)

```

2.5 Prepare for hashing by removing non-alphabetic special characters

python ingredients$ingredients_text <-gsub('&','and',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('\\]',')',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('\\[',')',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('\\}',')',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('\\{',')',ingredients$ingredients_text) ingredients$ingredients_text <-gsub(';',',',ingredients$ingredients_text) ingredients$ingredients_text <-gsub(':',',',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('-','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('-','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('@','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('#','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('!','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('\\/',' or ',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('\\|','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('\\%','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('†','',ingredients$ingredients_text) ingredients$ingredients_text <-gsub('`','',ingredients$ingredients_text)

3. Create a simple co-occurrence network

Next, we want to create a co-occurrence network, where nodes represent ingredients, and an edge drawn between the two nodes will represent a food where the two ingredients co-occur in the ingredients list. We will keep track of the number of times each co-occurrence edge occurs with a simple count. I have written two scripts (ingredient_hash.pl and ingredient_network.pl) located in the workingDir/code/ directory that will create a hash of all ingredients and a count of their overall occurrence, as well as the co-occurrence network.

3.1 Write the ingredients to a .tab file and run the perl scripts

```python write.table(ingredients, file = hash_infile,sep='\t',eol='\n',quote=FALSE,append=FALSE,row.names = FALSE)

Create an ingredient hash and an ingredient network with the scripts provided on this repository

makehashcommand = paste0("perl ",codeDir,"ingredienthash.pl ",dataDir,hashinfile," ",dataDir,hashoutfile) system(makehashcommand) makenetworkcommand = paste0("perl ",codeDir,"ingredientnetwork.pl ",dataDir,hashinfile," ",dataDir,networkrawfile) system(makenetworkcommand) ```

4. Basic Network Analysis in R with igraph

Next, we want to read in the network and find out what its like! To do so, we will utilize igraph's network analysis capabilities in R. If you do not have igraph installed, you will need to do so before you can successfully continue by uncommenting the line below:

igraph Online Documentation: Source

```python

Remember, only perform this step if you do not have igraph already installed

install.packages("igraph") ```

4.1 Read in the network and remove duplicate edges and self-loops

```python library(igraph) gdata <- read.csv(networkrawfile,sep = " ",header = F,as.is = T,col.names = c("v1","v2","weight")) g <- graphfromdataframe(gdata)

Check that the graph is weighted

Should say TRUE

is_weighted(g)

Check that the graph is not directed

Should say FALSE

is.directed(g)

Is the graph totally connected?

Should say FALSE

is.connected(g)

Remove loops and multiple edges

g <-simplify(g,remove.multiple = T,remove.loops = T) ```

4.2 Gather network descriptives (Table 3)

```python

Get number of nodes, edges, and edge density

length(V(g)) length(E(g)) graph.density(g)

Get the transitivity or clustering coefficient

cc <- transitivity(g,type="global")

Is the graph totally connected? We expect not.

is.connected(g)

Centrality Measures

Assortativity

ass <- assortativity_degree(g, directed=F)

Closeness

close <- closeness(g, mode="all", weights=NA)

Eigenvector centrality

eigen <- eigen_centrality(g, directed=FALSE, weights=NA)

Note: The commands below take time since the network is large,

so only uncomment if you can step away

Get the diameter, using edge weights

diam <- diameter(g,directed=FALSE,weights=E(g)$weight)

Betweenness centrality

bet <- betweenness(g, directed=FALSE, weights=NA)

```

4.3 Examine and plot the degree distribution (Figure 1A)

```python deg.dist <- degree.distribution(g,cumulative = TRUE,mode="all")

Plot the degree distribution as calculated

This will plot the graph shown in Figure 1A

plot(deg.dist, main = "Degree Distribution, only top 1000 nodes by degree shown", col="darkblue",pch=10,cex=0.75,xlab="Degree", ylab="Cumulative Frequency",xlim=c(0,1000))

Plot the log-log of the degree distribution

This is just a re-statement of the figure in 1A but is not presented in the manuscript

plot(log(deg.dist,base=10), main = "Degree Distribution of the Ingredient Co-Occurence Network", col="darkblue",pch=10,cex=0.75,xlab="Log10(Degree)", ylab="Log10(Cumulative Frequency)") ```

We can perform a number of tests to determine if the network itself is scale-free, although due to the incomplete nature of the source database we do not formally report on the "scale-free" nature of the network itself. One test you may run, however, is the Kolmogorov-Smirnov tests of degree-distributions in simulated scale-free and random networks to see how their distributions compare. We share some simple code to do this below.

```python

Compare the degree distribution of the ingredient co-occurrence network against

a random Erdos Reyni (random) graph and a Barabasi preferential attachment simulated graph

Create the graphs to compare against, an "ab" (Albert-Barabasi) network and an "er"(Erdös-Reyni) network

random.er.graph = erdos.renyi.game(n=as.numeric(length(V(g))), p.or.m = 1/1175,directed = F, loops = F) random.ab.graph = barabasi.game(n=length(V(g)),m=37,directed=F)

Compare vertex count

length(V(g)) length(V(random.ab.graph)) length(V(random.er.graph))

Compare the edge count

length(E(g)) length(E(random.ab.graph)) length(E(random.er.graph))

Compare the histograms

hist(degree(g),xlim=c(0,6000)) hist(degree(random.ab.graph)) hist(degree(random.er.graph))

Generate degree distributions for the random graphs

random.er.degdist = degree.distribution(random.er.graph,cumulative=T,mode='all') random.ab.degdist = degree.distribution(random.ab.graph,cumulative=T,mode='all')

Perform the Kolmogorov Smirnov tests of the real graph against the random graphs

ks.test(deg.dist,random.er.degdist) ks.test(deg.dist,random.ab.degdist) ```

4.4 Examine the density of an induced subgraph from top K nodes (Figure 1B)

```python

Get the top X nodes by degree and their edge density

Obtain the degrees for all nodes in the graph

deg <- as.data.frame(degree(g, mode="all")) names(deg)= c("degree") deg$vertex_id = V(g)

Sort the dataframe by decreasing order (highest degree nodes will be first)

deg <- deg[order(deg$degree,decreasing = T),]

We begin by examining the top 3 nodes and then increase that size by intervals, see below

top_x = 3

Create new empty lists to hold incoming data

hublist = list() #A list for the hub nodes klist = list() #A list for keeping track of what values of k we have tried den_list = list() #A list for the densities of each induced subgraph we examine stop = False #A stop condition to tell us when we have sampled enough of the network

This loop will run until we have sampled the top 5000 nodes in intervals

while(stop == FALSE){ #Sanity check print statement to know your progress (its not fast, but not too slow either) print(c("at ",top_x))

#Create a list of the top x nodes by degree for(i in c(1:topx)){ hublist = c(hublist,deg[i,]$vertexid) }

#Get the induced subgraph of the top x hub nodes in the network subg <- inducedsubgraph(g,vids=c(hublist)) subg <- as.undirected(subg, mode= "collapse")

#Reset the hublist if you havent already hublist=list()

#Add this value of k to the list for use in plotting klist = c(klist,top_x)

#Get the density of the induced subgraph used above and add it to our list of subgraph densities for k
denlist = c(denlist,as.numeric(graph.density(subg,loops = F)))

#Start with k = 3, then go to use the top 50 nodes, increase by 50 nodes at each step until we hit 5000 nodes
if(topx ==3){ topx = 50 }else{ topx = topx+50 } if(top_x >= 5000){ stop=TRUE } }

Round the density list for better plotting

round(den_list,digits=2)

Plot the sampled density of the top k nodes

This will result in the plot shown in Figure 1B

plot(klist,denlist,type = "o", xlab="Top k Nodes by Degree", ylab="Density of the Induced Subgraph of the Top k Nodes", col="darkgreen",cex=0.90,pch=19, main = "Sampled Edge Density of the Top k Nodes" ) ```

We can use the lists created above to answer the following questions - when do we hit the (arbitrarily chosen) threshold for subgraphs with 90% density or higher?

```python

Top 400 nodes are 90%+ density; we see the difference at the split between 400 and 450 nodes.

90% density threshold is chosen arbitrarily

denlist[9] klist[9]

denlist[10] klist[10] ```

How many edges are represented in the top 400 nodes in the graph?

```python topx = 400 hublist = list() for(i in c(1:topx)){ hublist = c(hublist,deg[i,]$vertexid) }

Get the induced subgraph of the top i hub nodes in the network

subg <- inducedsubgraph(g,vids=c(hublist)) subg <- as.undirected(subg, mode= "collapse")

This number should be 400

length(V(subg))

How many edges in the induced subgraph by top 400 nodes? Whats the graph density?

graph.density(subg)

```

How many edges are represented in the top 5000 nodes in the graph?

```python

How many edges are represented in the top 5000 nodes in the graph?

topx = 5000 hublist = list() for(i in c(1:topx)){ hublist = c(hublist,deg[i,]$vertexid) }

Get the induced subgraph of the top i hub nodes in the network

subg <- inducedsubgraph(g,vids=c(hublist)) subg <- as.undirected(subg, mode= "collapse")

This number should be 400

length(V(subg))

How many edges in the induced subgraph by top 5000 nodes? Whats the graph density?

graph.density(subg) ```

4.5 Examine the network with only top 20 nodes (Figure 2)

The code below will retrieve the top 20 nodes by degree, create the induced subgraph, and print it out to be imported into Cytoscape for visualization. The file for the network as created in Figure 2 is also given as a .cys Cytoscape file under this repo as /data/top20.cys for users to open and manipulate themselves. We chose to use Cytoscape for network visualization over the plotting capabilities in R only for personal preference for a point-and-click interface at this stage.

```python

Get the induced subgraph for the top 500 nodes by degree and their edge density

deg <- as.data.frame(degree(g, mode="all")) names(deg)= c("degree") deg$vertexid = V(g) deg <- deg[order(deg$degree,decreasing = T),] topx = 20 hublist = list() for(i in c(1:topx)){ hublist = c(hublist,deg[i,]$vertex_id) }

Get the induced subgraph of the top i hub nodes in the network

subg <- inducedsubgraph(g,vids=c(hublist)) subg <- as.undirected(subg, mode= "collapse") write.graph(subg,file = paste0("top",topx,"network.txt"),format="ncol") write.csv(degree(subg),file = paste0("top",topx,"node_degree.txt")) ```

4.6 Examine the network with only edge weights >= 2000 (Figure 3)

The code below will parse the network file for only edge weights >= 2000, analyze it in R with igraph() functions, and output it to imported into Cytoscape for visualization. The file for the network as created in Figure 3 is also given as a .cys Cytoscape file under this repo as /data/parsed.cys for users to open and manipulate themselves. We chose to use Cytoscape for network visualization over the plotting capabilities in R only for personal preference for a point-and-click interface at this stage. Some of the network descriptives calculated below are not reported in the manuscript to avoid being long-winded, but they may be of interest or use so we included them here.

```python

Parse the network so only edges with weights higher than 2000 are used

This step may take a few minutes

parsenetworkcommand = paste0("perl ",codeDir,"parsenetworkbyweight.pl ",dataDir,networkrawfile," ",dataDir,parsednetworkrawfile," ",edgeweightthreshold) system(parsenetworkcommand)

Read in the parsed network

gdata <- read.csv(parsednetworkrawfile,sep = " ",header = F,as.is = T,col.names = c("v1","v2","weight")) g <- graphfromdataframe(g_data,directed = F)

Check that the graph is weighted

Should say TRUE

is_weighted(g)

Check that the graph is not directed

Should say FALSE

is.directed(g)

Is the graph totally connected?

is.connected(g)

Remove loops and multiple edges

g <-simplify(g,remove.multiple = T,remove.loops = T)

Get number of nodes

length(V(g))

Get number of edges

length(E(g))

Edge Density

graph.density(g,loops = FALSE)

Transitivity

cc <- transitivity(g,type="global")

Diameter, using weights

diam <- diameter(g,directed=FALSE)

Diameter, not using weights

diam <- diameter(g,directed=FALSE,weights=NA)

assortativity

ass <- assortativitydegree(g, directed=F) close <- closeness(g, mode="all", weights=NA) eigen <- eigencentrality(g, directed=FALSE, weights=NA) bet <- betweenness(g, directed=FALSE, weights=NA)

Examine and plot the degree distribution

deg.dist <- degree.distribution(g,cumulative = TRUE,mode="all") plot(deg.dist, main = "Degree Distribution, Parsed Co-Occurence Network", col="darkblue",pch=10,cex=0.75,xlab="Degree", ylab="Cumulative Frequency",xlim=c(0,1000)) ```

4.6 Maximal cliques in the parsed network with only edge weights >= 2000 (Table 4)

We also used the code below to identify maximal cliques in the parsed network (Table 4). You will find that there are actually 6 cliques reported, but 2 of these 6 contain a vertex that is not a food item ("or") so they were excluded in our manuscript report.

```python cliq<- max_cliques(g)

Examining the cliq variable allows us to see that the maximal clique size found is 16,

which is used below to investigate their contents

cliq2<-cliques(g,min=16,max=16) length(cliq2) cliq2[1] cliq2[2] cliq2[3] cliq2[4] cliq2[5] cliq2[6] ```

Owner

Name: Kate Cooper | kmcooper
Login: kmcooper
Kind: user

Twitter: katecooperOMA
Repositories: 1
Profile: https://github.com/kmcooper

BS Bioinformatics · PhD Pathology and Microbiology · Current: Assistant Prof w/ interests in microbiome, consumer health informatics, & bioinformatics education

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Cooper"
  given-names: "Kate"
  orcid: "https://orcid.org/0000-0002-6341-9648"
title: "The ingredient co-occurrence network of packaged foods distributed in the United States"
version: 1.0.0
doi: 10.1016/j.jfca.2019.103391
date-released: 2019-12-14
url: "https://github.com/kmcooper/Ingredient_CoOccurrence"

ingredient_cooccurrence

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

The Ingredient Co-occurrence Network of packaged foods distributed in the United States

1. Data Download

Set variables

Preferred working directory

Where data and code are stored

Names you would like to give these files, including

The ingredients file output from the R script (after pre-processing)

The ingredient hash file (counting occurrence of ingredients)

The name of the co-occurrence network

The name of the co-occurrence network that has been filtered for only highly co-occurring ingredients

The threshold to be used to determine when an ingredient is "highly co-occurring"

Change to the data directory and download the file from the Open Food Database

2. Data Pre-processing

You can now remove the raw_ingredients variable

if you are feeling confident in the reproducibility of your project

To do this, uncomment the command below

remove(raw_ingredients)

2.1 Remove special letters and characters and replace with standard [a-z] or otherwise

2.2 Remove the term "ingredients" as it is redundant

2.3 Remove preparatory, quantities, or provenance terms

2.4 Remove names

2.5 Prepare for hashing by removing non-alphabetic special characters

3. Create a simple co-occurrence network

3.1 Write the ingredients to a .tab file and run the perl scripts

Create an ingredient hash and an ingredient network with the scripts provided on this repository

4. Basic Network Analysis in R with igraph

Remember, only perform this step if you do not have igraph already installed

4.1 Read in the network and remove duplicate edges and self-loops

Check that the graph is weighted

Should say TRUE

Check that the graph is not directed

Should say FALSE

Is the graph totally connected?

Should say FALSE

Remove loops and multiple edges

4.2 Gather network descriptives (Table 3)

Get number of nodes, edges, and edge density

Get the transitivity or clustering coefficient

Is the graph totally connected? We expect not.

Centrality Measures

Assortativity

Closeness

Eigenvector centrality

Note: The commands below take time since the network is large,

so only uncomment if you can step away

Get the diameter, using edge weights

diam <- diameter(g,directed=FALSE,weights=E(g)$weight)

Betweenness centrality

bet <- betweenness(g, directed=FALSE, weights=NA)

4.3 Examine and plot the degree distribution (Figure 1A)

Plot the degree distribution as calculated

This will plot the graph shown in Figure 1A

Plot the log-log of the degree distribution

This is just a re-statement of the figure in 1A but is not presented in the manuscript

Compare the degree distribution of the ingredient co-occurrence network against

a random Erdos Reyni (random) graph and a Barabasi preferential attachment simulated graph

Create the graphs to compare against, an "ab" (Albert-Barabasi) network and an "er"(Erdös-Reyni) network

Compare vertex count

Compare the edge count

Compare the histograms

Generate degree distributions for the random graphs

Perform the Kolmogorov Smirnov tests of the real graph against the random graphs

4.4 Examine the density of an induced subgraph from top K nodes (Figure 1B)

Get the top X nodes by degree and their edge density

Obtain the degrees for all nodes in the graph

Sort the dataframe by decreasing order (highest degree nodes will be first)

We begin by examining the top 3 nodes and then increase that size by intervals, see below

Create new empty lists to hold incoming data

This loop will run until we have sampled the top 5000 nodes in intervals

Round the density list for better plotting

Plot the sampled density of the top k nodes

This will result in the plot shown in Figure 1B

Top 400 nodes are 90%+ density; we see the difference at the split between 400 and 450 nodes.