citationgraphs

A go package for citation graphs.

https://github.com/wujunfeng1/citationgraphs

Science Score: 18.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (0.5%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

A go package for citation graphs.

Basic Info
  • Host: GitHub
  • Owner: wujunfeng1
  • License: mit
  • Language: Go
  • Default Branch: main
  • Size: 72.3 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 5 years ago · Last pushed about 5 years ago
Metadata Files
Readme License Citation

README.md

CitationGraphs.go

A go package for citation graphs.

Citation (CitationGraphs.go)

// CitationGraphs.go project CitationGraphs.go
package CitationGraphs

import (
	"bufio"
	"encoding/json"
	"fmt"
	"log"
	"math"
	"math/rand"
	"os"
	"regexp"
	"runtime"
	"sort"
	"strconv"
	"strings"
	"sync"
	"time"

	"github.com/chrisport/go-lang-detector/langdet"
	"github.com/chrisport/go-lang-detector/langdet/langdetdef"
	"github.com/wujunfeng1/ConcurrenceBasedClustering"
	"github.com/wujunfeng1/KeyphraseExtraction"
	"github.com/wujunfeng1/wego/pkg/model/modelutil/vector"
	"github.com/wujunfeng1/wego/pkg/model/word2vec"
)

var reUnicodeHex *regexp.Regexp
var reUnicodeDec *regexp.Regexp
var langDetector langdet.Detector

func init() {
	reUnicodeHex = regexp.MustCompile("&//[Xx]([A-Fa-f0-9])+;")
	reUnicodeDec = regexp.MustCompile("&//([0-9])+;")
	rand.Seed(time.Now().Unix())
	langDetector = langdetdef.NewWithDefaultLanguages()
}

// =================================================================================================
// struct CitationNode
// brief description: The node structure of a citation graph, the data of which are collected
//                    from crawling the website https://academic.microsoft.com , which is the
//                    public service website of MAG(Microsoft Academic Graph)
// fields:
//   id: The MAG(Microsoft Academic Graph) id of the paper at this node.
//       We can access the detail of the paper with the id by navigating the web link:
//       https://academic.microsoft.com/paper/$id
//   year: The year of publication of this paper.
//   title: The title of the paper.
//   labels: The labels from MAG(Microsoft Academic Graph).
//   refs: The references of this paper collected from MAG.
//         Please note that this field could be inaccurate: MAG suffers delay of info.
//         Many new papers with refs don't have refs listed in MAG.
//   cites: The citations to this paper from other papers (also collected from MAG).
//         Please note that this field is as inaccurate as refs due to the same reason.
type CitationNode struct {
	ID     int64
	Year   int64
	Title  string
	Labels []string
	Refs   []int64
	Cites  []int64
}

// =================================================================================================
// struct CitationGraph
// brief description: A data structure of citation graph, the data of which are collected from
//                    the website https://academic.microsoft.com , which is the public service
//                    website of MAG(Microsoft Academic Graph)
// fields:
//   nodes: The node of struct CitationNode stored in a dictionary with key = id
//   toBeAnalyzed: The list of nodes to be analyzed. In ijcai_clustering dataset, these nodes
//                 are those published in IJCAI.
type CitationGraph struct {
	Nodes        map[int64]*CitationNode
	ToBeAnalyzed []int64
	idxMainNodes map[int64]int
}

// =================================================================================================
// struct Corpus
// brief description: the corpus data structure
type Corpus struct {
	Vocab map[string]int // the vocabulary
	Docs  []map[int]int  // keys: docID, wordID, value: word count
}

// =================================================================================================
// struct CorpusX
// brief description: the extended corpus data structure
type CorpusX struct {
	Vocab map[string]int  // the vocabulary
	Docs  [][]map[int]int // keys: docID, wordID, value: word count
}

// =================================================================================================
// struct CorpusSeq
// brief description: the extended corpus data structure
type CorpusSeq struct {
	Vocab map[string]int // the vocabulary
	Docs  [][]int        // keys: docID, wordIdx, value: word id
}

// =================================================================================================
// func NewCorpus
// brief description: create an empty corpus
func NewCorpus() *Corpus {
	return &Corpus{
		Vocab: map[string]int{},
		Docs:  []map[int]int{},
	}
}

// =================================================================================================
// func NewCorpusX
// brief description: create an empty corpus
func NewCorpusX() *CorpusX {
	return &CorpusX{
		Vocab: map[string]int{},
		Docs:  [][]map[int]int{},
	}
}

// =================================================================================================
// func NewCorpusSeq
// brief description: create an empty corpus
func NewCorpusSeq() *CorpusSeq {
	return &CorpusSeq{
		Vocab: map[string]int{},
		Docs:  [][]int{},
	}
}

// =================================================================================================
// func (this *Corpus) AddDoc
// brief description: add one document to corpus with specified docId and word count list, if the
// 	specified docId already exists in corpus, the old doc will be overwritted
func (this *Corpus) AddDoc(words []string) {
	// ---------------------------------------------------------------------------------------------
	// step 1: create a new doc and count word counts
	doc := map[int]int{}
	for _, word := range words {
		wordID, exists := this.Vocab[word]
		if !exists {
			wordID = len(this.Vocab)
			this.Vocab[word] = wordID
		}
		count, exists := doc[wordID]
		if !exists {
			count = 0
		}
		doc[wordID] = count + 1
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: add the new doc into docs
	this.Docs = append(this.Docs, doc)
}

// =================================================================================================
// func (this *CorpusX) AddDoc
// brief description: add one document to corpus with specified docId and word count list, if the
// 	specified docId already exists in corpus, the old doc will be overwritted
func (this *CorpusX) AddDoc(words [][]string) {
	// ---------------------------------------------------------------------------------------------
	// step 1: create a new doc and count word counts
	doc := make([]map[int]int, len(words))
	for idx, wordGroup := range words {
		doc[idx] = map[int]int{}
		for _, word := range wordGroup {
			wordID, exists := this.Vocab[word]
			if !exists {
				wordID = len(this.Vocab)
				this.Vocab[word] = wordID
			}
			count, exists := doc[idx][wordID]
			if !exists {
				count = 0
			}
			doc[idx][wordID] = count + 1
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: add the new doc into docs
	this.Docs = append(this.Docs, doc)
}

// =================================================================================================
// func (this *CorpusSeq) AddDoc
// brief description: add one document to corpus with specified docId and word count list, if the
// 	specified docId already exists in corpus, the old doc will be overwritted
func (this *CorpusSeq) AddDoc(words []string) {
	// ---------------------------------------------------------------------------------------------
	// step 1: create a new doc and count word counts
	doc := make([]int, len(words))
	for idx, word := range words {
		wordID, exists := this.Vocab[word]
		if !exists {
			wordID = len(this.Vocab)
			this.Vocab[word] = wordID
		}
		doc[idx] = wordID
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: add the new doc into docs
	this.Docs = append(this.Docs, doc)
}

// =================================================================================================
// func (this *CorpusSeq) GetConcurrences
// brief description: get concurrences from corpus
func (this *CorpusSeq) GetConcurrences() map[uint]map[uint]float64 {
	// ---------------------------------------------------------------------------------------------
	// step 1: create concurrences
	concurrences := map[uint]map[uint]float64{}

	// ---------------------------------------------------------------------------------------------
	// step 2: spawn goroutines to count concurrences when words are in the same document
	numCPUs := runtime.NumCPU()
	numDocs := len(this.Docs)
	chI := make(chan int)
	chWorkers := make(chan bool)
	lock := sync.RWMutex{}
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		go func() {
			myConcurrences := map[uint]map[uint]float64{}
			for i0 := range chI {
				i1 := i0 + 100
				if i1 > numDocs {
					i1 = numDocs
				}
				for i := i0; i < i1; i++ {
					wordCount := this.Docs[i]
					for w1, count1 := range wordCount {
						concurrencesOfW1, exists := myConcurrences[uint(w1)]
						if !exists {
							concurrencesOfW1 = map[uint]float64{}
							myConcurrences[uint(w1)] = concurrencesOfW1
						}
						for w2, count2 := range wordCount {
							if w1 != w2 {
								count, exists := concurrencesOfW1[uint(w2)]
								if !exists {
									count = 0.0
								}
								concurrencesOfW1[uint(w2)] = count + float64(count1*count2)
							}
						}
					}
				}
			}

			// now we merge the results into concurrences
			lock.Lock()
			defer lock.Unlock()
			for key1, value1 := range myConcurrences {
				row, exists := concurrences[key1]
				if !exists {
					concurrences[key1] = value1
				} else {
					for key2, value2 := range value1 {
						oldValue, exists := row[key2]
						if !exists {
							oldValue = 0.0
						}
						row[key2] = oldValue + value2
					}
				}
			}

			// tell main thread that this worker finishes
			chWorkers <- true
		}()
	}

	// ---------------------------------------------------------------------------------------------
	// step	3: send tasks through chI
	for i := 0; i < numDocs; i += 100 {
		chI <- i
	}
	close(chI)

	// ---------------------------------------------------------------------------------------------
	// step 4: wait for all workers
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		<-chWorkers
	}

	// ---------------------------------------------------------------------------------------------
	// step 5: return the concurrences
	return concurrences
}

// =================================================================================================
// func (this *CorpusSeq) Word2Phrase
func (this *CorpusSeq) Word2Phrase(numIters, minFreq int, minScore float64) *CorpusSeq {
	if numIters <= 0 {
		return this
	}

	// ---------------------------------------------------------------------------------------------
	// step 1: count unigram/bigram freqs
	freqs1 := map[int]int{}
	freqs2 := map[int]map[int]int{}
	for _, doc := range this.Docs {
		for idx, wordID := range doc {
			oldFreqUnigram, exists := freqs1[wordID]
			if !exists {
				oldFreqUnigram = 0
			}
			freqs1[wordID] = oldFreqUnigram + 1
			if idx > 0 {
				prevWordID := doc[idx-1]
				freqsOfPrev, exists := freqs2[prevWordID]
				if !exists {
					freqsOfPrev = map[int]int{}
					freqs2[prevWordID] = freqsOfPrev
				}
				oldFreqBigram, exists := freqsOfPrev[wordID]
				if !exists {
					oldFreqBigram = 0
				}
				freqsOfPrev[wordID] = oldFreqBigram + 1
			}
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: clone the vacabulary and creat the reverse map
	vocab := map[string]int{}
	words := make([]string, len(this.Vocab))
	for word, wordID := range this.Vocab {
		vocab[word] = wordID
		words[wordID] = word
	}

	// ---------------------------------------------------------------------------------------------
	// step 3: compute score for all those bigrams occur at least minFreq times
	for prevWordID, freqsOfPrev := range freqs2 {
		for wordID, freq := range freqsOfPrev {
			// skip those occur less than or equal to minFreq times
			if freq <= minFreq {
				continue
			}

			// compute score
			score := float64(freq-minFreq) / float64(freqs1[prevWordID]*freqs1[wordID])

			// skip those with small scores
			if score < minScore {
				continue
			}

			// put the detected phrase into vocab
			phraseID := len(vocab)
			phraseText := words[prevWordID] + " " + words[wordID]
			vocab[phraseText] = phraseID
			fmt.Printf("%s: score %f\n", phraseText, score)
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 4: create docs with these unigrams and selected bigrams
	newDocs := make([][]int, len(this.Docs))
	for idxDoc, doc := range this.Docs {
		newDoc := []int{}
		lenDoc := len(doc)
		idxWord := 0
		for idxWord < lenDoc {
			wordID := doc[idxWord]

			useBigram := true
			if idxWord+1 < lenDoc {
				nextWordID := doc[idxWord+1]
				bigramFreq := freqs2[wordID][nextWordID]
				if bigramFreq > minFreq {
					score := float64(bigramFreq-minFreq) /
						float64(freqs1[wordID]*freqs1[nextWordID])
					if score < minScore {
						useBigram = false
					}
				} else {
					useBigram = false
				}
			} else {
				useBigram = false
			}

			if useBigram {
				nextWordID := doc[idxWord+1]
				bigramText := words[wordID] + " " + words[nextWordID]
				phraseID, exists := vocab[bigramText]
				if !exists {
					log.Fatalln("corrupted vocab")
				}
				newDoc = append(newDoc, phraseID)
				idxWord += 2
			} else {
				newDoc = append(newDoc, wordID)
				idxWord++
			}
		}
		newDocs[idxDoc] = newDoc
	}

	// ---------------------------------------------------------------------------------------------
	// step 5: create a new corpus with unigrams and selected bigrams
	newCorpus := &CorpusSeq{Vocab: vocab, Docs: newDocs}

	// ---------------------------------------------------------------------------------------------
	// step 6: return the result
	if numIters == 1 {
		return newCorpus
	} else {
		return newCorpus.Word2Phrase(numIters-1, minFreq, minScore)
	}
}

// =================================================================================================
// func (this *CorpusSeq) Word2PhraseEx
func (this *CorpusSeq) Word2PhraseEx(numIters, minFreq int, minScore float64) *CorpusSeq {
	if numIters <= 0 {
		return this
	}

	// ---------------------------------------------------------------------------------------------
	// step 1: count unigram/bigram freqs
	freqs1 := map[int]int{}
	freqs2 := map[int]map[int]int{}
	freqs3 := map[int]map[int]int{}
	for _, doc := range this.Docs {
		for idx, wordID := range doc {
			oldFreqUnigram, exists := freqs1[wordID]
			if !exists {
				oldFreqUnigram = 0
			}
			freqs1[wordID] = oldFreqUnigram + 1
			if idx > 0 {
				prevWordID := doc[idx-1]
				freqsOfPrev, exists := freqs2[prevWordID]
				if !exists {
					freqsOfPrev = map[int]int{}
					freqs2[prevWordID] = freqsOfPrev
				}
				oldFreqBigram, exists := freqsOfPrev[wordID]
				if !exists {
					oldFreqBigram = 0
				}
				freqsOfPrev[wordID] = oldFreqBigram + 1

				freqsOfCurr, exists := freqs3[wordID]
				if !exists {
					freqsOfCurr = map[int]int{}
					freqs3[wordID] = freqsOfCurr
				}
				oldFreqBigramRev, exists := freqsOfCurr[prevWordID]
				if !exists {
					oldFreqBigramRev = 0
				}
				freqsOfCurr[prevWordID] = oldFreqBigramRev + 1
			}
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: clone the vacabulary and creat the reverse map
	vocab := map[string]int{}
	words := make([]string, len(this.Vocab))
	for word, wordID := range this.Vocab {
		vocab[word] = wordID
		words[wordID] = word
	}

	// ---------------------------------------------------------------------------------------------
	// step 3: compute score for all those bigrams occur at least minFreq times
	for prevWordID, freqsOfPrev := range freqs2 {
		for wordID, freq := range freqsOfPrev {
			// skip those occur less than or equal to minFreq times
			if freq <= minFreq {
				continue
			}

			// compute score
			cond1 := float64(freq) / float64(freqs1[prevWordID])
			cond2 := float64(freq) / float64(freqs1[wordID])
			avg1 := 1.0 / float64(len(freqs2[prevWordID]))
			avg2 := 1.0 / float64(len(freqs3[wordID]))
			score := cond1 * cond2 / (avg1 * avg2)

			// skip those with small scores
			if score < minScore {
				continue
			}

			// put the detected phrase into vocab
			phraseID := len(vocab)
			phraseText := words[prevWordID] + " " + words[wordID]
			vocab[phraseText] = phraseID
			//fmt.Printf("%s: score %f\n", phraseText, score)
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 4: create docs with these unigrams and selected bigrams
	newDocs := make([][]int, len(this.Docs))
	for idxDoc, doc := range this.Docs {
		newDoc := []int{}
		lenDoc := len(doc)
		idxWord := 0
		for idxWord < lenDoc {
			wordID := doc[idxWord]

			useBigram := true
			if idxWord+1 < lenDoc {
				nextWordID := doc[idxWord+1]
				bigramFreq := freqs2[wordID][nextWordID]
				if bigramFreq > minFreq {
					cond1 := float64(bigramFreq) / float64(freqs1[wordID])
					cond2 := float64(bigramFreq) / float64(freqs1[nextWordID])
					avg1 := 1.0 / float64(len(freqs2[wordID]))
					avg2 := 1.0 / float64(len(freqs3[nextWordID]))
					score := cond1 * cond2 / (avg1 * avg2)
					if score < minScore {
						useBigram = false
					}
				} else {
					useBigram = false
				}
			} else {
				useBigram = false
			}

			if useBigram {
				nextWordID := doc[idxWord+1]
				bigramText := words[wordID] + " " + words[nextWordID]
				phraseID, exists := vocab[bigramText]
				if !exists {
					log.Fatalln("corrupted vocab")
				}
				newDoc = append(newDoc, phraseID)
				idxWord += 2
			} else {
				newDoc = append(newDoc, wordID)
				idxWord++
			}
		}
		newDocs[idxDoc] = newDoc
	}

	// ---------------------------------------------------------------------------------------------
	// step 5: create a new corpus with unigrams and selected bigrams
	newCorpus := &CorpusSeq{Vocab: vocab, Docs: newDocs}

	// ---------------------------------------------------------------------------------------------
	// step 6: return the result
	if numIters == 1 {
		return newCorpus
	} else {
		return newCorpus.Word2PhraseEx(numIters-1, minFreq, minScore)
	}
}

// =================================================================================================
// func (this *Corpus) GetConcurrences
// brief description: get concurrences from corpus
func (this *Corpus) GetConcurrences() map[uint]map[uint]float64 {
	// ---------------------------------------------------------------------------------------------
	// step 1: create concurrences
	concurrences := map[uint]map[uint]float64{}

	// ---------------------------------------------------------------------------------------------
	// step 2: spawn goroutines to count concurrences when words are in the same document
	numCPUs := runtime.NumCPU()
	numDocs := len(this.Docs)
	chI := make(chan int)
	chWorkers := make(chan bool)
	lock := sync.RWMutex{}
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		go func() {
			myConcurrences := map[uint]map[uint]float64{}
			for i0 := range chI {
				i1 := i0 + 100
				if i1 > numDocs {
					i1 = numDocs
				}
				for i := i0; i < i1; i++ {
					wordCount := this.Docs[i]
					for w1, count1 := range wordCount {
						concurrencesOfW1, exists := myConcurrences[uint(w1)]
						if !exists {
							concurrencesOfW1 = map[uint]float64{}
							myConcurrences[uint(w1)] = concurrencesOfW1
						}
						for w2, count2 := range wordCount {
							if w1 != w2 {
								count, exists := concurrencesOfW1[uint(w2)]
								if !exists {
									count = 0.0
								}
								concurrencesOfW1[uint(w2)] = count + float64(count1*count2)
							}
						}
					}
				}
			}

			// now we merge the results into concurrences
			lock.Lock()
			defer lock.Unlock()
			for key1, value1 := range myConcurrences {
				row, exists := concurrences[key1]
				if !exists {
					concurrences[key1] = value1
				} else {
					for key2, value2 := range value1 {
						oldValue, exists := row[key2]
						if !exists {
							oldValue = 0.0
						}
						row[key2] = oldValue + value2
					}
				}
			}

			// tell main thread that this worker finishes
			chWorkers <- true
		}()
	}

	// ---------------------------------------------------------------------------------------------
	// step	3: send tasks through chI
	for i := 0; i < numDocs; i += 100 {
		chI <- i
	}
	close(chI)

	// ---------------------------------------------------------------------------------------------
	// step 4: wait for all workers
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		<-chWorkers
	}

	// ---------------------------------------------------------------------------------------------
	// step 5: return the concurrences
	return concurrences
}

// =================================================================================================
// func (this *CorpusX) GetExclusions
// brief description: get exclusions from corpus
func (this *CorpusX) GetExclusions() map[uint]map[uint]bool {
	// ---------------------------------------------------------------------------------------------
	// step 1: create empty exclusions
	exclusions := map[uint]map[uint]bool{}

	// ---------------------------------------------------------------------------------------------
	// step 2: create texts
	numTexts := len(this.Vocab)
	texts := make([]string, numTexts)
	for text, id := range this.Vocab {
		texts[id] = text
	}

	// ---------------------------------------------------------------------------------------------
	// step 3: spawn goroutines to find exclusions when words are in the same group
	numCPUs := runtime.NumCPU()
	numDocs := len(this.Docs)
	chI := make(chan int)
	chExclusions := make(chan map[uint]map[uint]bool)
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		go func() {
			myExclusions := map[uint]map[uint]bool{}
			for i0 := range chI {
				i1 := i0 + 100
				if i1 > numDocs {
					i1 = numDocs
				}
				for i := i0; i < i1; i++ {
					wordGroups := this.Docs[i]
					for _, wordCount := range wordGroups {
						for w1, _ := range wordCount {
							exclusionsOfW1, exists := myExclusions[uint(w1)]
							if !exists {
								exclusionsOfW1 = map[uint]bool{}
								myExclusions[uint(w1)] = exclusionsOfW1
							}
							text1 := texts[w1]
							for w2, _ := range wordCount {
								if w2 != w1 {
									text2 := texts[w2]
									if KeyphraseExtraction.Overlaps(text1, text2) {
										exclusionsOfW1[uint(w2)] = true
									}
								}
							}
						}
					}

				}
			}

			// give main thread my results
			chExclusions <- myExclusions
		}()
	}

	// ---------------------------------------------------------------------------------------------
	// step 4: send tasks through chI
	for i := 0; i < numDocs; i += 100 {
		chI <- i
	}
	close(chI)

	// ---------------------------------------------------------------------------------------------
	// step 5: merge results
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		myExclusions := <-chExclusions
		for w1, deltaExclusionsOfW1 := range myExclusions {
			exclusionsOfW1, exists := exclusions[w1]
			if exists {
				for w2, _ := range deltaExclusionsOfW1 {
					exclusionsOfW1[w2] = true
				}
			} else {
				exclusions[w1] = deltaExclusionsOfW1
			}
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 6: return results
	return exclusions
}

// =================================================================================================
// func (this *CorpusX) GetConcurrences
// brief description: get concurrences from corpus
func (this *CorpusX) GetConcurrences() map[uint]map[uint]float64 {
	// ---------------------------------------------------------------------------------------------
	// step 1: create empty concurrences
	concurrences := map[uint]map[uint]float64{}

	// ---------------------------------------------------------------------------------------------
	// step 2: spawn goroutines to count concurrences when words are in the same document
	numCPUs := runtime.NumCPU()
	numDocs := len(this.Docs)
	chI := make(chan int)
	chWorkers := make(chan bool)
	lock := sync.RWMutex{}
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		go func() {
			myConcurrences := map[uint]map[uint]float64{}
			for i0 := range chI {
				i1 := i0 + 100
				if i1 > numDocs {
					i1 = numDocs
				}
				for i := i0; i < i1; i++ {
					wordGroups := this.Docs[i]
					for idx1, wordCount1 := range wordGroups {
						for w1, count1 := range wordCount1 {
							concurrencesOfW1, exists := myConcurrences[uint(w1)]
							if !exists {
								concurrencesOfW1 = map[uint]float64{}
								myConcurrences[uint(w1)] = concurrencesOfW1
							}
							for idx2, wordCount2 := range wordGroups {
								// skip same word group
								if idx1 == idx2 {
									continue
								}

								for w2, count2 := range wordCount2 {
									if w1 != w2 {
										count, exists := concurrencesOfW1[uint(w2)]
										if !exists {
											count = 0.0
										}
										concurrencesOfW1[uint(w2)] = count + float64(count1*count2)
									}
								}
							}
						}
					}

				}
			}

			// now we merge the results into concurrences
			lock.Lock()
			defer lock.Unlock()
			for key1, value1 := range myConcurrences {
				row, exists := concurrences[key1]
				if !exists {
					concurrences[key1] = value1
				} else {
					for key2, value2 := range value1 {
						oldValue, exists := row[key2]
						if !exists {
							oldValue = 0.0
						}
						row[key2] = oldValue + value2
					}
				}
			}

			// tell main thread that this worker finishes
			chWorkers <- true
		}()
	}

	// ---------------------------------------------------------------------------------------------
	// step	3: send tasks through chI
	for i := 0; i < numDocs; i += 100 {
		chI <- i
	}
	close(chI)

	// ---------------------------------------------------------------------------------------------
	// step 4: wait for all workers
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		<-chWorkers
	}

	// ---------------------------------------------------------------------------------------------
	// step 5: return the concurrences
	return concurrences
}

// =================================================================================================
// func (this *CorpusX) GetConcurrences
// brief description: get concurrences from corpus
func (this *CorpusX) GetDocConcurrences() map[uint]map[uint]float64 {
	// ---------------------------------------------------------------------------------------------
	// step 1: create empty concurrences
	concurrences := map[uint]map[uint]float64{}

	// ---------------------------------------------------------------------------------------------
	// step 2: spawn goroutines to count concurrences when words are in the same document
	numCPUs := runtime.NumCPU()
	numDocs := len(this.Docs)
	chI := make(chan int)
	chWorkers := make(chan bool)
	lock := sync.RWMutex{}
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		go func() {
			myConcurrences := map[uint]map[uint]float64{}
			for i0 := range chI {
				i1 := i0 + 100
				if i1 > numDocs {
					i1 = numDocs
				}
				for i := i0; i < i1; i++ {
					wordGroups := this.Docs[i]
					for idx1, wordCount1 := range wordGroups {
						for w1, _ := range wordCount1 {
							concurrencesOfW1, exists := myConcurrences[uint(w1)]
							visited := map[int]bool{}
							if !exists {
								concurrencesOfW1 = map[uint]float64{}
								myConcurrences[uint(w1)] = concurrencesOfW1
							}
							for idx2, wordCount2 := range wordGroups {
								// skip same word group
								if idx1 == idx2 {
									continue
								}

								for w2, _ := range wordCount2 {
									_, w2Visited := visited[w2]
									if w1 != w2 && !w2Visited {
										count, exists := concurrencesOfW1[uint(w2)]
										if !exists {
											count = 0.0
										}
										concurrencesOfW1[uint(w2)] = count + 1.0
										visited[w2] = true
									}
								}
							}
						}
					}

				}
			}

			// now we merge the results into concurrences
			lock.Lock()
			defer lock.Unlock()
			for key1, value1 := range myConcurrences {
				row, exists := concurrences[key1]
				if !exists {
					concurrences[key1] = value1
				} else {
					for key2, value2 := range value1 {
						oldValue, exists := row[key2]
						if !exists {
							oldValue = 0.0
						}
						row[key2] = oldValue + value2
					}
				}
			}

			// tell main thread that this worker finishes
			chWorkers <- true
		}()
	}

	// ---------------------------------------------------------------------------------------------
	// step	3: send tasks through chI
	for i := 0; i < numDocs; i += 100 {
		chI <- i
	}
	close(chI)

	// ---------------------------------------------------------------------------------------------
	// step 4: wait for all workers
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		<-chWorkers
	}

	// ---------------------------------------------------------------------------------------------
	// step 5: return the concurrences
	return concurrences
}

// =================================================================================================
// func (this *Corpus) translate
func (this *Corpus) translate(subCorpus *Corpus) *Corpus {
	oldSubVocab := subCorpus.Vocab
	translation := map[int]int{}
	for word, oldID := range oldSubVocab {
		newID, exists := this.Vocab[word]
		if !exists {
			log.Fatal("not a sub corpus")
		}
		translation[oldID] = newID
	}

	newDocs := make([]map[int]int, len(subCorpus.Docs))
	for docID, oldDoc := range subCorpus.Docs {
		newDoc := map[int]int{}
		for oldID, count := range oldDoc {
			newDoc[translation[oldID]] = count
		}
		newDocs[docID] = newDoc
	}

	return &Corpus{Vocab: this.Vocab, Docs: newDocs}
}

// =================================================================================================
// func (this *CorpusX) translate
func (this *CorpusX) translate(subCorpus *CorpusX) *CorpusX {
	oldSubVocab := subCorpus.Vocab
	translation := map[int]int{}
	for word, oldID := range oldSubVocab {
		newID, exists := this.Vocab[word]
		if !exists {
			log.Fatal("not a sub corpus")
		}
		translation[oldID] = newID
	}

	newDocs := make([][]map[int]int, len(subCorpus.Docs))
	for docID, oldDoc := range subCorpus.Docs {
		newDoc := make([]map[int]int, len(oldDoc))
		for idxGroup, oldWordGroup := range oldDoc {
			newDoc[idxGroup] = map[int]int{}
			for oldID, count := range oldWordGroup {
				newDoc[idxGroup][translation[oldID]] = count
			}
		}
		newDocs[docID] = newDoc
	}

	return &CorpusX{Vocab: this.Vocab, Docs: newDocs}
}

// =================================================================================================
// interface LDAModel
// brief description: the common interface of LDA models
type TopicModel interface {
	// train model for iter iteration
	Train(numIters int)
	// do inference for new doc with its wordCounts
	Infer(doc map[int]int) []float64
	// compute entropy
	ComputeEntropy() float64
	// compute relative entropy
	ComputeRelativeEntropy() float64
}

// =================================================================================================
// struct DocWord
type DocWord struct {
	DocId  int
	WordId int
}

// =================================================================================================
// struct LDA
// brief description: the data structure of LDA model with Collapsed Gibbs Sampler
// note:
//	The fast collapsed gibbs sampler algorithm can be found in reference:
//	Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., & Welling, M. (2008, August). Fast
//	collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the 14th ACM SIGKDD
//	international conference on Knowledge discovery and data mining (pp. 569-577).
type LDA struct {
	Alpha     float64 // document topic mixture hyperparameter
	Beta      float64 // topic word mixture hyperparameter
	NumTopics int     // number of topics

	Data *Corpus // the input corpus

	WordTopicCount [][]int           // word-topic count table
	DocTopicCount  [][]int           // doc-topic count table
	TopicCountSum  []int             // word-topic-sum count table
	DocWordToTopic map[DocWord][]int // doc-word-topic count table
}

// =================================================================================================
// func NewLDA
// brief description: create an LDA instance with collapsed gibbs sampler
func NewLDA(numTopics int, alpha float64, beta float64, data *Corpus) *LDA {
	// ---------------------------------------------------------------------------------------------
	// step 1: check parameters
	if data == nil {
		log.Fatalln("corpus is nil")
	}

	if numTopics <= 0 {
		log.Fatalln("numTopics cannot <= 0.")
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: create wordTopicCount
	numWords := len(data.Vocab)
	wordTopicCount := make([][]int, numWords)
	for w := 0; w < numWords; w++ {
		topicCountOfW := make([]int, numTopics)
		wordTopicCount[w] = topicCountOfW
	}

	// ---------------------------------------------------------------------------------------------
	// step 3: create docTopicCount
	numDocs := len(data.Docs)
	docTopicCount := make([][]int, numDocs)
	for doc := 0; doc < numDocs; doc++ {
		topicCountOfDoc := make([]int, numTopics)
		docTopicCount[doc] = topicCountOfDoc
	}

	// ---------------------------------------------------------------------------------------------
	// step 4: create topicCountSum
	topicCountSum := make([]int, numTopics)

	// ---------------------------------------------------------------------------------------------
	// step 5: create docWordToTopic
	docWordToTopic := map[DocWord][]int{}
	for doc, wordCounts := range data.Docs {
		for w, count := range wordCounts {
			docWord := DocWord{doc, w}
			toTopic := make([]int, count)
			docWordToTopic[docWord] = toTopic
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 6: assemble the result
	result := &LDA{
		Alpha:     alpha,
		Beta:      beta,
		NumTopics: numTopics,
		Data:      data,

		WordTopicCount: wordTopicCount,
		DocTopicCount:  docTopicCount,
		TopicCountSum:  topicCountSum,
		DocWordToTopic: docWordToTopic,
	}

	// ---------------------------------------------------------------------------------------------
	// step 7: return the result
	return result
}

// =================================================================================================
// func (this *LDA) updateCounters
func (this *LDA) updateCounters() {
	// ---------------------------------------------------------------------------------------------
	// step 1: initialize Counters
	numTopics := this.NumTopics
	for idxTopic := 0; idxTopic < numTopics; idxTopic++ {
		this.TopicCountSum[idxTopic] = 0
	}

	numWords := len(this.Data.Vocab)
	for w := 0; w < numWords; w++ {
		topicCountOfW := this.WordTopicCount[w]
		for idxTopic := 0; idxTopic < numTopics; idxTopic++ {
			topicCountOfW[idxTopic] = 0
		}
	}

	numDocs := len(this.Data.Docs)
	for doc := 0; doc < numDocs; doc++ {
		topicCountOfDoc := this.DocTopicCount[doc]
		for idxTopic := 0; idxTopic < numTopics; idxTopic++ {
			topicCountOfDoc[idxTopic] = 0
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: count them!
	for doc, wordCounts := range this.Data.Docs {
		for w, count := range wordCounts {
			toTopic := this.DocWordToTopic[DocWord{doc, w}]
			for i := 0; i < count; i++ {
				// sample word topic
				k := toTopic[i]

				// update word topic count
				this.WordTopicCount[w][k]++

				// update doc topic count
				this.DocTopicCount[doc][k]++

				// update topic count sum
				this.TopicCountSum[k]++
			}
		}
	}
}

// =================================================================================================
// func (this *LDA) Init
func (this *LDA) Init() {
	// ---------------------------------------------------------------------------------------------
	// step 1: randomly assign topic to word
	for doc, wordCounts := range this.Data.Docs {
		for w, count := range wordCounts {
			toTopic := this.DocWordToTopic[DocWord{doc, w}]
			for i := 0; i < count; i++ {
				// sample word topic
				k := rand.Intn(this.NumTopics)

				// record the topic assignment
				toTopic[i] = k
			}
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: update counters
	this.updateCounters()
}

// =================================================================================================
// func (this *LDA) probTopicOfDocWord
func (this *LDA) probTopicOfDocWord(doc int, w int, k int, idxK int) float64 {
	numWords := float64(len(this.Data.Vocab))
	myDocTopicCount := float64(this.DocTopicCount[doc][idxK])
	myWordTopicCount := float64(this.WordTopicCount[w][idxK])
	myTopicCountSum := float64(this.TopicCountSum[idxK])
	if idxK == k {
		myDocTopicCount--
		myWordTopicCount--
		myTopicCountSum--
	}

	docPart := this.Alpha + myDocTopicCount
	wordPart := (this.Beta + myWordTopicCount) / (this.Beta*numWords + myTopicCountSum)
	return docPart * wordPart
}

// =================================================================================================
// func (this *LDA) ResampleTopics
func (this *LDA) ResampleTopics(numIters int) {
	// ---------------------------------------------------------------------------------------------
	// step 1: resample topics in parallel
	numCPUs := runtime.NumCPU()
	for idxIter := 0; idxIter < numIters; idxIter++ {
		log.Printf("iter %5d, relative entropy %f", idxIter, this.ComputeRelativeEntropy())

		// (1.1) create channels for input and output
		chDocs := make(chan []int)
		chWorkers := make(chan bool)

		// (1.2) create goroutines for resampling
		for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
			go func() {
				prefixSum := make([]float64, this.NumTopics)
				for docs := range chDocs {
					for doc := range docs {
						wordCounts := this.Data.Docs[doc]
						for w, count := range wordCounts {
							docWord := DocWord{doc, w}
							toTopic := this.DocWordToTopic[docWord]
							for i := 0; i < count; i++ {
								k := toTopic[i]

								// (1.2.1) compute resampling probabilities
								for idxK := 0; idxK < this.NumTopics; idxK += 1 {
									prob := this.probTopicOfDocWord(doc, w, k, idxK)
									if idxK == 0 {
										prefixSum[idxK] = prob
									} else {
										prefixSum[idxK] = prefixSum[idxK-1] + prob
									}
								}

								// (1.2.2) use resampling probabilities to resample topic
								u := rand.Float64() * prefixSum[this.NumTopics-1]
								for idxK := 0; idxK < this.NumTopics; idxK++ {
									if u < prefixSum[idxK] {
										k = idxK
										break
									}
								}

								// (1.2.3) record the new topic
								toTopic[i] = k
							}
						}
					}
				}
				chWorkers <- true
			}()
		}

		// (1.3) generate inputs
		numDocs := len(this.Data.Docs)
		for doc := 0; doc < numDocs; doc += 100 {
			lenDocs := 100
			if doc+lenDocs > numDocs {
				lenDocs = numDocs - doc
			}
			docs := make([]int, lenDocs)
			for i := 0; i < lenDocs; i++ {
				docs[i] = doc + i
			}
			chDocs <- docs
		}
		close(chDocs)

		// (1.4) wait for all workers to complete their jobs
		for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
			<-chWorkers
		}

		// (1.5) update the counters
		this.updateCounters()
	}

	// step 2: report entropy
	log.Printf("final relative entropy %f", this.ComputeRelativeEntropy())
}

// =================================================================================================
// func (this *LDA) Train
// brief description: train model
func (this *LDA) Train(numIters int) {
	// randomly init
	this.Init()

	// resample topics
	this.ResampleTopics(numIters)
}

// =================================================================================================
// func (this *LDA) Infer
// brief description: infer topics on new documents
func (this *LDA) Infer(wordCounts map[int]int) []float64 {
	// ---------------------------------------------------------------------------------------------
	// step 1: compute unscaled probabilities
	probs := make([]float64, this.NumTopics)
	sumProbs := float64(0.0)
	numWords := len(this.Data.Vocab)
	for idxK := 0; idxK < this.NumTopics; idxK++ {
		myProb := float64(0.0)
		for w, count := range wordCounts {
			myWordTopicCount := float64(this.WordTopicCount[w][idxK])
			myTopicCountSum := float64(this.TopicCountSum[idxK])
			myProb += float64(count) * (this.Beta + myWordTopicCount) /
				(this.Beta*float64(numWords) + myTopicCountSum)
		}
		sumProbs += myProb
		probs[idxK] = myProb
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: scale the probs to make their sum == 1.0
	if sumProbs == 0.0 {
		sumProbs = 1.0
	}
	for idxK := 0; idxK < this.NumTopics; idxK++ {
		probs[idxK] /= sumProbs
	}

	// ---------------------------------------------------------------------------------------------
	// step 3: return the result
	return probs
}

// =================================================================================================
// func (this *LDA) ComputeEntropy
// brief description: compute entropy
func (this *LDA) ComputeEntropy() float64 {
	entropy := 0.0
	sumCount := 0
	for doc, _ := range this.Data.Docs {
		myTopicCount := this.DocTopicCount[doc]
		mySumCount := 0
		for k := 0; k < this.NumTopics; k++ {
			mySumCount += myTopicCount[k]
		}
		myEntropy := 0.0
		for k := 0; k < this.NumTopics; k++ {
			if myTopicCount[k] > 0 {
				myP := float64(myTopicCount[k]) / float64(mySumCount)
				myEntropy -= myP * math.Log(myP)
			}
		}
		sumCount += mySumCount
		entropy += float64(mySumCount) * myEntropy
	}
	entropy /= float64(sumCount)
	return entropy
}

// =================================================================================================
// func (this *LDA) ComputeEntropy
// brief description: compute entropy
func (this *LDA) ComputeRelativeEntropy() float64 {
	entropy := this.ComputeEntropy()
	maxEntropy := -math.Log(1.0 / float64(this.NumTopics))
	return entropy / maxEntropy
}

// =================================================================================================
// function convertUnicodeHex
// brief description:
// 	Convert a hex representation string of unicode to the unicode.
// input:
//	s: The hex representation.
// output:
//	The unicode.
func convertUnicodeHex(s []byte) []byte {
	ss := s[3 : len(s)-1]
	u, err := strconv.ParseInt(string(ss), 16, 64)
	if err != nil {
		return []byte("<?>")
	} else {
		return []byte(fmt.Sprintf("%c", u))
	}
}

// =================================================================================================
// function convertUnicodeDec
// brief description:
// 	Convert a dec representation string of unicode to the unicode.
// input:
//	s: The dec representation.
// output:
//	The unicode.
func convertUnicodeDec(s []byte) []byte {
	ss := s[2 : len(s)-1]
	u, err := strconv.ParseInt(string(ss), 10, 64)
	if err != nil {
		return []byte("<?>")
	} else {
		return []byte(fmt.Sprintf("%c", u))
	}
}

// =================================================================================================
// function TidyTitle
// brief description: Make a title more tidy before using it. The procedure include the
//                    steps:
//                    (1) remove the spaces at the head and the tail of the title,
//                    (2) replace "&lt;" with "<",
//                    (3) replace "&gt;" with ">",
//                    (4) replace "&amp;" with "&",
//                    (5) replace "&quot;" with "\"",
//                    (6) replace "&apos;" with "'",
//                    (7) replace "&//[number];" with a corresponding unicode
// input:
//   title: The String (text) of a title.
// output:
//   A string of the tidied-up title.
func TidyTitle(title string) string {
	// --------------------------------------------------------------------------------------
	// step 1: remove the spaces at the head and the tail
	result := strings.TrimSpace(title)

	// --------------------------------------------------------------------------------------
	// step 2: replace "&lt;" with "<"
	result = strings.Replace(result, "&lt;", "<", -1)

	// --------------------------------------------------------------------------------------
	// step 3: replace "&gt;" with ">"
	result = strings.Replace(result, "&gt;", ">", -1)

	// --------------------------------------------------------------------------------------
	// step 4: replace "&amp;" with "&"
	result = strings.Replace(result, "&amp;", "&", -1)

	// --------------------------------------------------------------------------------------
	// step 5: replace "&quot;" with "\""
	result = strings.Replace(result, "&quot;", "\"", -1)

	// --------------------------------------------------------------------------------------
	// step 6: replace "&apos;" with "'"
	result = strings.Replace(result, "&apos;", "'", -1)

	// --------------------------------------------------------------------------------------
	// step 7: replace "&//[number];" with a corresponding unicode
	byteResult := []byte(result)
	byteResult = reUnicodeHex.ReplaceAllFunc(byteResult, convertUnicodeHex)
	byteResult = reUnicodeDec.ReplaceAllFunc(byteResult, convertUnicodeDec)
	result = string(byteResult)

	// --------------------------------------------------------------------------------------
	// step 8: return the result
	return result
}

// =================================================================================================
// function LoadCitationGraph
// brief description:
//	Load data from three files (nodes, edges, labels) of a citation graph.
// input:
//   path: The name of the path where the three files are stored.
//   prefix: The prefix of the names of the three files. For example, the prefix of
//           ijcai-citation-graph-nodes.csv is ijcai.
// output:
//    The citation graph represented by the three files.
func LoadCitationGraph(path string, prefix string) *CitationGraph {
	// --------------------------------------------------------------------------------------
	// step 1: Prepare the data structure of the result.
	result := new(CitationGraph)
	result.Nodes = make(map[int64]*CitationNode)
	result.idxMainNodes = map[int64]int{}

	// --------------------------------------------------------------------------------------
	// step 2: Assemble the file names for nodes, edges, and labels.
	fileNameOfNodes := path + "/" + prefix + "-citation-graph-nodes.csv"
	fileNameOfEdges := path + "/" + prefix + "-citation-graph-edges.csv"
	fileNameOfLabels := path + "/" + prefix + "-citation-graph-labels.csv"

	// --------------------------------------------------------------------------------------
	// step 3: Load the nodes.
	// (3.1) open the file of Nodes
	fileOfNodes, err := os.Open(fileNameOfNodes)
	if err != nil {
		log.Fatal(err)
	}
	defer fileOfNodes.Close()
	scannerOfNodes := bufio.NewScanner(fileOfNodes)

	// (3.2) Examine the first line to check whether the file format is correct.
	if !scannerOfNodes.Scan() {
		log.Fatal("Cannot read " + fileNameOfNodes)
	}
	firstLine := scannerOfNodes.Text()
	columnNames := strings.Split(firstLine, ",")
	if len(columnNames) != 4 {
		log.Fatal("Incorrect file format of " + fileNameOfNodes)
	}
	if strings.TrimSpace(columnNames[0]) != "#id" ||
		strings.TrimSpace(columnNames[1]) != "in-"+prefix ||
		strings.TrimSpace(columnNames[2]) != "year" ||
		strings.TrimSpace(columnNames[3]) != "title" {
		log.Fatal("Incorrect file format of " + fileNameOfNodes)
	}

	// (3.3) Read the rest of lines.
	for scannerOfNodes.Scan() {
		line := scannerOfNodes.Text()
		columns := strings.Split(line, ",")
		for i := 1; i < 4; i++ {
			columns[i] = strings.TrimSpace(columns[i])
		}
		id, _ := strconv.ParseInt(columns[0], 10, 64)
		isMainNode, _ := strconv.ParseBool(columns[1])
		year, _ := strconv.ParseInt(columns[2], 10, 64)
		title := strings.ReplaceAll(columns[3], "[comma]", ",")
		node := new(CitationNode)
		node.ID = id
		node.Year = year
		node.Title = title
		result.Nodes[id] = node
		if isMainNode {
			result.idxMainNodes[id] = len(result.ToBeAnalyzed)
			result.ToBeAnalyzed = append(result.ToBeAnalyzed, id)
		}
	}

	// --------------------------------------------------------------------------------------
	// step 4: Load the edges.
	// (4.1) Open the file of edges
	fileOfEdges, err := os.Open(fileNameOfEdges)
	if err != nil {
		log.Fatal(err)
	}
	defer fileOfEdges.Close()
	scannerOfEdges := bufio.NewScanner(fileOfEdges)

	// (4.2) Examine the first line to check whether the file format is correct.
	if !scannerOfEdges.Scan() {
		log.Fatal("Cannot read " + fileNameOfEdges)
	}
	firstLine = scannerOfEdges.Text()
	columnNames = strings.Split(firstLine, ",")
	if len(columnNames) != 2 {
		log.Fatal("Incorrect file format of " + fileNameOfEdges)
	}
	if strings.TrimSpace(columnNames[0]) != "#id" ||
		strings.TrimSpace(columnNames[1]) != "ref-id" {
		log.Fatal("Incorrect file format of " + fileNameOfEdges)
	}

	// (4.3) read the rest of lines
	for scannerOfEdges.Scan() {
		line := scannerOfEdges.Text()
		columns := strings.Split(line, ",")
		for i := 1; i < 2; i++ {
			columns[i] = strings.TrimSpace(columns[i])
		}
		id, _ := strconv.ParseInt(columns[0], 10, 64)
		refID, _ := strconv.ParseInt(columns[1], 10, 64)
		node, _ := result.Nodes[id]
		refNode, _ := result.Nodes[refID]
		node.Refs = append(node.Refs, refID)
		refNode.Cites = append(refNode.Cites, id)
	}

	// --------------------------------------------------------------------------------------
	// step 5: Load the labels.
	// (5.1) Open the file of labels
	fileOfLabels, err := os.Open(fileNameOfLabels)
	if err != nil {
		log.Fatal(err)
	}
	defer fileOfLabels.Close()
	scannerOfLabels := bufio.NewScanner(fileOfLabels)

	// (5.2) Examine the first line to check whether the file format is correct.
	if !scannerOfLabels.Scan() {
		log.Fatal("Cannot read " + fileNameOfLabels)
	}
	firstLine = scannerOfLabels.Text()
	columnNames = strings.Split(firstLine, ",")
	if len(columnNames) != 2 {
		log.Fatal("Incorrect file format of " + fileNameOfLabels)
	}
	if strings.TrimSpace(columnNames[0]) != "#id" ||
		strings.TrimSpace(columnNames[1]) != "label" {
		log.Fatal("Incorrect file format of " + fileNameOfEdges)
	}

	// (5.3) Read the rest of lines.
	for scannerOfLabels.Scan() {
		line := scannerOfLabels.Text()
		columns := strings.Split(line, ",")
		for i := 1; i < 2; i++ {
			columns[i] = strings.TrimSpace(columns[i])
		}
		id, _ := strconv.ParseInt(columns[0], 10, 64)
		label := strings.TrimSpace(columns[1])
		node, _ := result.Nodes[id]
		node.Labels = append(node.Labels, label)
	}

	// --------------------------------------------------------------------------------------
	// step 6: Return the result
	return result
}

// =================================================================================================
// struct GSDMM
// brief description: the data structure of GSDMM model
type GSDMM struct {
	Alpha     float64 // document topic mixture hyperparameter
	Beta      float64 // topic word mixture hyperparameter
	NumTopics int     // number of topics

	Data          *Corpus // the input corpus
	NumWordsInDoc []int   // the number of words in documents

	DocTopic          []int   // doc-topic count table
	TopicWordCount    [][]int // word-topic count table
	TopicWordCountSum []int   // word-topic-sum count table
	TopicDocCount     []int   // topic-doc-sum count table
}

// =================================================================================================
// func NewGSDMM
// brief description: create an LDA instance with collapsed gibbs sampler
func NewGSDMM(numTopics int, alpha float64, beta float64, data *Corpus) *GSDMM {
	// ---------------------------------------------------------------------------------------------
	// step 1: check parameters
	if data == nil {
		log.Fatalln("corpus is nil")
	}

	if numTopics <= 0 {
		log.Fatalln("numTopics cannot <= 0.")
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: create docTopic
	numDocs := len(data.Docs)
	docTopic := make([]int, numDocs)

	// ---------------------------------------------------------------------------------------------
	// step 3: create topicWordCount
	numWords := len(data.Vocab)
	topicWordCount := make([][]int, numTopics)
	for k := 0; k < numTopics; k++ {
		topicWordCount[k] = make([]int, numWords)
	}

	// ---------------------------------------------------------------------------------------------
	// step 4: create topicWordCountSum
	topicWordCountSum := make([]int, numTopics)

	// ---------------------------------------------------------------------------------------------
	// step 5: create topicDocCount
	topicDocCount := make([]int, numTopics)

	// ---------------------------------------------------------------------------------------------
	// step 6: create NumWordsInDoc
	numWordsInDoc := make([]int, numDocs)
	for doc, wordCounts := range data.Docs {
		numWordsInDoc[doc] = 0
		for _, count := range wordCounts {
			numWordsInDoc[doc] += count
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 7: assemble the result
	result := &GSDMM{
		Alpha:         alpha,
		Beta:          beta,
		NumTopics:     numTopics,
		Data:          data,
		NumWordsInDoc: numWordsInDoc,

		DocTopic:          docTopic,
		TopicWordCount:    topicWordCount,
		TopicWordCountSum: topicWordCountSum,
		TopicDocCount:     topicDocCount,
	}

	// ---------------------------------------------------------------------------------------------
	// step 8: return the result
	return result
}

// =================================================================================================
// func (this *GSDMM) updateCounters
func (this *GSDMM) updateCounters() {
	// ---------------------------------------------------------------------------------------------
	// step 1: initialize Counters
	numTopics := this.NumTopics
	numWords := len(this.Data.Vocab)
	for idxTopic := 0; idxTopic < numTopics; idxTopic++ {
		this.TopicWordCountSum[idxTopic] = 0
		this.TopicDocCount[idxTopic] = 0
		wordCountOfTopic := this.TopicWordCount[idxTopic]
		for w := 0; w < numWords; w++ {
			wordCountOfTopic[w] = 0
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: count them!
	for doc, wordCounts := range this.Data.Docs {
		k := this.DocTopic[doc]
		this.TopicDocCount[k]++
		for w, count := range wordCounts {
			this.TopicWordCount[k][w] += count
			this.TopicWordCountSum[k] += count
		}
	}
}

// =================================================================================================
// func (this *GSDMM) Init
func (this *GSDMM) Init() {
	// ---------------------------------------------------------------------------------------------
	// step 1: randomly assign topic to doc
	for doc, _ := range this.Data.Docs {
		// sample document topic
		this.DocTopic[doc] = rand.Intn(this.NumTopics)
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: update counters
	this.updateCounters()
}

// =================================================================================================
// func (this *GSDMM) probTopicOfDoc
func (this *GSDMM) probTopicOfDoc(doc int, k int, idxK int) float64 {
	numTopics := float64(this.NumTopics)
	numDocs := float64(len(this.Data.Docs))
	docCountOfTopic := float64(this.TopicDocCount[idxK])
	if idxK == k {
		docCountOfTopic--
	}
	docPart := (docCountOfTopic + this.Alpha) / (numDocs - 1.0 + this.Alpha*numTopics)

	topicWordCountSum := float64(this.TopicWordCountSum[idxK])
	wordCounts := this.Data.Docs[doc]
	if idxK == k {
		topicWordCountSum -= float64(this.NumWordsInDoc[doc])
	}

	numWords := float64(len(this.Data.Vocab))
	idxWordInDoc := 0
	wordPart := 1.0
	for w, count := range wordCounts {
		wordCountOfTopic := float64(this.TopicWordCount[idxK][w])
		if idxK == k {
			wordCountOfTopic -= float64(count)
		}
		for j := 0; j < count; j++ {
			wordPart *= (wordCountOfTopic + this.Beta + float64(j)) / (topicWordCountSum +
				this.Beta*numWords + float64(idxWordInDoc))
			idxWordInDoc++
		}
	}

	//fmt.Printf("docPart = %f, wordPart = %v, prod = %v\n", docPart, wordPart, docPart*wordPart)
	return docPart * wordPart
}

// =================================================================================================
// func (this *GSDMM) ResampleTopics
func (this *GSDMM) ResampleTopics(numIters int) {
	// ---------------------------------------------------------------------------------------------
	// step 1: resample topics in parallel
	numCPUs := runtime.NumCPU()
	for idxIter := 0; idxIter < numIters; idxIter++ {
		log.Printf("iter %5d, relative entropy %f", idxIter, this.ComputeRelativeEntropy())

		// (1.1) create channels for input and output
		chDocs := make(chan []int)
		chWorkers := make(chan bool)

		// (1.2) create goroutines for resampling
		for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
			go func() {
				prefixSum := make([]float64, this.NumTopics)
				for docs := range chDocs {
					//fmt.Println("-------------------------------------------")
					for _, doc := range docs {
						// (1.2.1) compute resampling probabilities
						k := this.DocTopic[doc]
						for idxK := 0; idxK < this.NumTopics; idxK++ {
							prob := this.probTopicOfDoc(doc, k, idxK)
							if idxK == 0 {
								prefixSum[idxK] = prob
							} else {
								prefixSum[idxK] = prefixSum[idxK-1] + prob
							}
						}
						// if doc < 10 {
						// 	fmt.Printf("%d: %v\n", doc, prefixSum)
						// }

						// (1.2.2) use resampling probabilities to resample topic
						u := rand.Float64() * prefixSum[this.NumTopics-1]
						for idxK := 0; idxK < this.NumTopics; idxK++ {
							if u < prefixSum[idxK] {
								k = idxK
								break
							}
						}

						// (1.2.3) record the new topic
						this.DocTopic[doc] = k
					}
				}
				chWorkers <- true
			}()
		}

		// (1.3) generate inputs
		numDocs := len(this.Data.Docs)
		for doc := 0; doc < numDocs; doc += 100 {
			lenDocs := 100
			if doc+lenDocs > numDocs {
				lenDocs = numDocs - doc
			}
			docs := make([]int, lenDocs)
			for i := 0; i < lenDocs; i++ {
				docs[i] = doc + i
			}
			chDocs <- docs
		}
		close(chDocs)

		// (1.4) wait for all workers to complete their jobs
		for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
			<-chWorkers
		}

		// (1.5) update the counters
		this.updateCounters()
	}

	// step 2: report entropy
	log.Printf("final relative entropy %f", this.ComputeRelativeEntropy())
}

// =================================================================================================
// func (this *GSDMM) Train
// brief description: train model
func (this *GSDMM) Train(numIters int) {
	// randomly init
	this.Init()

	// resample topics
	this.ResampleTopics(numIters)
}

// =================================================================================================
// func (this *GSDMM) Infer
// brief description: infer topics on new documents
func (this *GSDMM) Infer(wordCounts map[int]int) []float64 {
	// ---------------------------------------------------------------------------------------------
	// step 1: compute unscaled probabilities
	probs := make([]float64, this.NumTopics)
	sumProbs := float64(0.0)
	numTopics := float64(len(this.Data.Docs))
	numWords := float64(len(this.Data.Vocab))
	for idxK := 0; idxK < this.NumTopics; idxK++ {
		// -----------------------------------------------------------------------------------------
		// (1.1) compute doc part of probs
		docCountOfTopic := float64(this.TopicDocCount[idxK])
		docPart := (docCountOfTopic + this.Alpha) / (docCountOfTopic - 1.0 + this.Alpha*numTopics)

		// -----------------------------------------------------------------------------------------
		// (1.2) compute word part of probs
		topicWordCountSum := float64(this.TopicWordCountSum[idxK])
		idxWordInDoc := 0
		wordPart := 1.0
		for w, count := range wordCounts {
			wordCountOfTopic := float64(this.TopicWordCount[idxK][w])
			for j := 0; j < count; j++ {
				wordPart *= (wordCountOfTopic + this.Beta + float64(j)) / (topicWordCountSum +
					this.Beta*numWords + float64(idxWordInDoc))
				idxWordInDoc++
			}
		}

		// -----------------------------------------------------------------------------------------
		// (1.3) assemble myProb
		myProb := docPart * wordPart
		sumProbs += myProb
		probs[idxK] = myProb
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: scale the probs to make their sum == 1.0
	if sumProbs == 0.0 {
		sumProbs = 1.0
	}
	for idxK := 0; idxK < this.NumTopics; idxK++ {
		probs[idxK] /= sumProbs
	}

	// ---------------------------------------------------------------------------------------------
	// step 3: return the result
	return probs
}

// =================================================================================================
// func (this *GSDMM) ComputeEntropy
// brief description: compute entropy
func (this *GSDMM) ComputeEntropy() float64 {
	entropy := 0.0
	numDocs := float64(len(this.Data.Docs))
	for doc, _ := range this.Data.Docs {
		k := this.DocTopic[doc]

		myProbs := make([]float64, this.NumTopics)
		mySumProbs := 0.0
		for idxK := 0; idxK < this.NumTopics; idxK++ {
			myProbs[idxK] = this.probTopicOfDoc(doc, k, idxK)
			mySumProbs += myProbs[idxK]
		}

		myEntropy := 0.0
		if mySumProbs > 0.0 {
			for idxK := 0; idxK < this.NumTopics; idxK++ {
				myP := myProbs[idxK] / mySumProbs
				if myP == 0.0 {
					continue
				}
				//fmt.Printf("%d: myP = %f\n", idxK, myP)
				myEntropy -= myP * math.Log(myP)
			}
		}

		entropy += myEntropy / numDocs
	}
	return entropy
}

// =================================================================================================
// func (this *GSDMM) ComputeEntropy
// brief description: compute entropy
func (this *GSDMM) ComputeRelativeEntropy() float64 {
	entropy := this.ComputeEntropy()
	maxEntropy := -math.Log(1.0 / float64(this.NumTopics))
	return entropy / maxEntropy
}

// =================================================================================================
// struct WPDM
// brief description: the data structure of WPDM model
type WPDM struct {
	eps    float64 // radius of neighborhood
	minPts uint    // minimum points for a dense neighborhood in DBScan, set this to 0 for AHC

	Data      *Corpus // the input corpus
	DocTopic  []int   // doc-topic table
	NumTopics int     // number of topics
}

// ===========================================================================================
// function SaveCitationGraph
// brief description: Save data to three files (nodes, edges, labels) of a citation graph.
// input:
//   path: The name of the path where the three files are stored.
//   prefix: The prefix of the names of the three files. For example, the prefix of
//           ijcai-citation-graph-nodes.csv is ijcai.
//   citationGraph: The citation graph represented by the three files.
// output:
//   nothing
func SaveCitationGraph(path string, prefix string, citationGraph *CitationGraph) {
	// --------------------------------------------------------------------------------------
	// step 1: Assemble the file names for nodes, edges, and labels.
	fileNameOfNodes := path + "/" + prefix + "-citation-graph-nodes.csv"
	fileNameOfEdges := path + "/" + prefix + "-citation-graph-edges.csv"
	fileNameOfLabels := path + "/" + prefix + "-citation-graph-labels.csv"

	// --------------------------------------------------------------------------------------
	// step 2: Save the nodes.
	// (2.1) Open the file of Nodes for writing
	fileOfNodes, err := os.Create(fileNameOfNodes)
	if err != nil {
		log.Fatal(err)
	}
	defer fileOfNodes.Close()

	// (2.2) print the first line with the file format info
	_, err = fileOfNodes.WriteString("#id, in-" + prefix + ", year, title\n")
	if err != nil {
		log.Fatal(err)
	}

	// (2.3) save data into the rest of lines
	idSetForAnalysis := make(map[int64]int)
	for i, v := range citationGraph.ToBeAnalyzed {
		idSetForAnalysis[v] = i
	}
	for id, node := range citationGraph.Nodes {
		_, toBeAnalyzed := idSetForAnalysis[id]
		year := node.Year
		title := TidyTitle(strings.ReplaceAll(node.Title, ",", "[comma]"))
		_, err = fileOfNodes.WriteString(fmt.Sprintf("%d, %t, %d, %s\n", id, toBeAnalyzed, year, title))
		if err != nil {
			log.Fatal(err)
		}
	}

	// --------------------------------------------------------------------------------------
	// step 3: Save the edges.
	// (3.1) Open the file of Edges for writing
	fileOfEdges, err := os.Create(fileNameOfEdges)
	if err != nil {
		log.Fatal(err)
	}
	defer fileOfEdges.Close()

	// (3.2) print the first line with the file format info
	_, err = fileOfEdges.WriteString("#id, ref-id\n")
	if err != nil {
		log.Fatal(err)
	}

	// (3.3) save data into the rest of lines
	edgeSet := make(map[int64]map[int64]bool)
	for id, node := range citationGraph.Nodes {
		setOfID, exists := edgeSet[id]
		if !exists {
			setOfID = make(map[int64]bool)
			edgeSet[id] = setOfID
		}
		for _, refID := range node.Refs {
			setOfID[refID] = true
		}
		for _, citeID := range node.Cites {
			setOfCiteID, exists := edgeSet[citeID]
			if !exists {
				setOfCiteID = make(map[int64]bool)
				edgeSet[citeID] = setOfCiteID
			}
			setOfCiteID[id] = true
		}
	}
	for id, setOfID := range edgeSet {
		for refID, _ := range setOfID {
			fileOfEdges.WriteString(fmt.Sprintf("%d, %d\n", id, refID))
		}
	}

	// --------------------------------------------------------------------------------------
	// step 4: Save the labels.
	// (4.1) Open the file of labels for writing
	fileOfLabels, err := os.Create(fileNameOfLabels)
	if err != nil {
		log.Fatal(err)
	}
	defer fileOfLabels.Close()

	// (4.2) print the first line with the file format info
	_, err = fileOfLabels.WriteString("#id, label\n")
	if err != nil {
		log.Fatal(err)
	}

	// (4.3) save data into the rest of lines
	for id, node := range citationGraph.Nodes {
		for _, label := range node.Labels {
			fileOfLabels.WriteString(fmt.Sprintf("%d, %s\n", id, label))
		}
	}
}

// =================================================================================================
// method TFIDF
// brief description: compute the TFIDF of possible key phrases for each main node of a citationGraph
// input:
// 	nothing
// output:
//	the result of TFIDF grouped by the main nodes of the CitationGraph
func (g *CitationGraph) TFIDF() []map[string]float64 {
	candidateTFGroups := []map[string]uint{}
	phraseCandidateGroups := [][]string{}

	for _, id := range g.ToBeAnalyzed {
		// get phrase candidates
		node := g.Nodes[id]
		phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(node.Title)

		// collect a list of auxiliary phrases
		auxPhrases := []string{}
		for _, refID := range node.Refs {
			refPhrases := KeyphraseExtraction.ExtractKeyPhraseCandidates(g.Nodes[refID].Title)
			for _, phrase := range refPhrases {
				auxPhrases = append(auxPhrases, phrase)
			}
		}

		// compute TF
		candidateTFGroups = append(candidateTFGroups, KeyphraseExtraction.TF(phraseCandidates, auxPhrases))

		// append this group of phraseCandidates to the candidate groups
		phraseCandidateGroups = append(phraseCandidateGroups, phraseCandidates)
	}

	// compute IDF
	IDFs := KeyphraseExtraction.IDF(phraseCandidateGroups)

	// TFIDF = TF * IDF
	result := []map[string]float64{}
	for _, candidateTFs := range candidateTFGroups {
		groupResult := map[string]float64{}
		for text, tf := range candidateTFs {
			idf, exists := IDFs[text]
			if !exists {
				log.Fatal(fmt.Sprintf("phrase %s not found in IDFs", text))
			}
			groupResult[text] = float64(tf) * idf
		}
		result = append(result, groupResult)
	}

	// return the result
	return result
}

// =================================================================================================
// method SimTFIDF
// brief description: compute the Fuzzy TFIDF of possible key phrases for each main node of a citationGraph
// input:
// 	phraseSimilarity: similarities of phrases
// output:
//	the result of Fuzzy TFIDF grouped by the main nodes of the CitationGraph
func (g *CitationGraph) SimTFIDF(phraseSimilarity map[string]map[string]float64) []map[string]float64 {
	numMainNodes := len(g.ToBeAnalyzed)
	candidateTFGroups := make([]map[string]float64, numMainNodes)
	phraseCandidateGroups := make([][]string, numMainNodes)

	numCPUs := runtime.NumCPU()
	chI := make(chan int)
	chWorkers := make(chan bool)
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		go func() {
			for idx0 := range chI {
				idx1 := idx0 + 100
				if idx1 > numMainNodes {
					idx1 = numMainNodes
				}
				for idxID := idx0; idxID < idx1; idxID++ {
					// get phrase candidates
					id := g.ToBeAnalyzed[idxID]
					node := g.Nodes[id]
					phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(node.Title)

					// collect a list of auxiliary phrases
					auxPhrases := []string{}
					for _, refID := range node.Refs {
						refPhrases := KeyphraseExtraction.ExtractKeyPhraseCandidates(g.Nodes[refID].Title)
						for _, phrase := range refPhrases {
							auxPhrases = append(auxPhrases, phrase)
						}
					}

					// compute Sim TF
					candidateTFGroups[idxID] = KeyphraseExtraction.SimTF(phraseCandidates, auxPhrases, phraseSimilarity)

					// insert this group of phraseCandidates to the candidate groups
					phraseCandidateGroups[idxID] = phraseCandidates
				}
			}
			chWorkers <- true
		}()
	}
	for i := 0; i < numMainNodes; i += 100 {
		chI <- i
	}
	close(chI)
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		<-chWorkers
	}
	fmt.Println("All sim TF computed")

	// compute IDF
	IDFs := KeyphraseExtraction.IDF(phraseCandidateGroups)
	fmt.Println(("sim IDF computed"))

	// TFIDF = TF * IDF
	result := make([]map[string]float64, numMainNodes)
	for idx, candidateTFs := range candidateTFGroups {
		groupResult := map[string]float64{}
		for text, tf := range candidateTFs {
			idf, exists := IDFs[text]
			if !exists {
				log.Fatal(fmt.Sprintf("phrase %s not found in IDFs", text))
			}
			groupResult[text] = float64(tf) * idf
		}
		sortedGroupResult := KeyphraseExtraction.ArgSort(groupResult)
		selectedGroupResult := map[string]float64{}
		for i := 0; i < len(sortedGroupResult); i++ {
			phraseI := sortedGroupResult[i]
			selected := true
			for phraseJ, _ := range selectedGroupResult {
				if KeyphraseExtraction.Includes(phraseJ, phraseI) ||
					KeyphraseExtraction.Includes(phraseI, phraseJ) {
					selected = false
					break
				}
			}
			if selected {
				selectedGroupResult[phraseI] = groupResult[phraseI]
			}
		}
		result[idx] = selectedGroupResult
	}

	// return the result
	return result
}

// =================================================================================================
// method SimTFSimIDF
// brief description: compute the Fuzzy TFIDF of possible key phrases for each main node of a citationGraph
// input:
// 	nothing
// output:
//	the result of Fuzzy TFIDF grouped by the main nodes of the CitationGraph
func (g *CitationGraph) SimTFSimIDF(phraseSimilarity map[string]map[string]float64) []map[string]float64 {
	candidateTFGroups := []map[string]float64{}
	phraseCandidateGroups := [][]string{}

	for idxID, id := range g.ToBeAnalyzed {
		// get phrase candidates
		node := g.Nodes[id]
		phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(node.Title)

		// collect a list of auxiliary phrases
		auxPhrases := []string{}
		for _, refID := range node.Refs {
			refPhrases := KeyphraseExtraction.ExtractKeyPhraseCandidates(g.Nodes[refID].Title)
			for _, phrase := range refPhrases {
				auxPhrases = append(auxPhrases, phrase)
			}
		}

		// compute Fuzzy TF
		candidateTFGroups = append(candidateTFGroups, KeyphraseExtraction.SimTF(phraseCandidates, auxPhrases, phraseSimilarity))

		// append this group of phraseCandidates to the candidate groups
		phraseCandidateGroups = append(phraseCandidateGroups, phraseCandidates)

		if (idxID+1)%1000 == 0 {
			fmt.Printf("%d of %d sim TF computed\n", idxID+1, len(g.ToBeAnalyzed))
		}
	}
	fmt.Println("All sim TF computed")

	// compute IDF
	IDFs := KeyphraseExtraction.SimIDF(phraseCandidateGroups, phraseSimilarity)
	fmt.Println(("All sim IDF computed"))

	// TFIDF = TF * IDF
	result := []map[string]float64{}
	for _, candidateTFs := range candidateTFGroups {
		groupResult := map[string]float64{}
		for text, tf := range candidateTFs {
			idf, exists := IDFs[text]
			if !exists {
				log.Fatal(fmt.Sprintf("phrase %s not found in IDFs", text))
			}
			groupResult[text] = float64(tf) * idf
		}
		result = append(result, groupResult)
	}

	// return the result
	return result
}

// =================================================================================================
// method GetPhraseSimilarity
// brief description: compute the similarities between pairs of phrases and gives them in form of a
//	sparse matrix using Title Link method
// input:
// 	nothing
// output:
//	the sparse matrix of phrase similarities
// note:
//	The title link method can be found in:
//	Bogomolova, A., Ryazanova, M., & Balk, I. (2021). Cluster approach to analysis of publication
//	titles. In Journal of Physics: Conference Series (Vol. 1727, No. 1, p. 012016). IOP Publishing.
func (g *CitationGraph) GetPhraseSimilarity(simType int) map[string]map[string]float64 {
	// --------------------------------------------------------------------------------------------
	// step 1: create corpus using all nodes
	corpus := g.CreateCorpus(2)

	// --------------------------------------------------------------------------------------------
	// step 2: get concurrences from corpuse
	concurrences := corpus.GetConcurrences()

	// --------------------------------------------------------------------------------------------
	// step 3: create concurrence model from concurrences
	cm := ConcurrenceBasedClustering.NewConcurrenceModel()
	cm.SetConcurrences(uint(len(corpus.Vocab)), concurrences)

	// --------------------------------------------------------------------------------------------
	// step 4: get similarity
	simMat := map[uint]map[uint]float64{}
	switch simType {
	case 0:
		simMat = cm.InduceSimilarities()
	case 1:
		simMat = cm.InduceNormalizedSimilarities()
	case 2:
		simMat = cm.InduceJaccardSimilarities()
	case 3:
		simMat = cm.InduceWeightedJaccardSimilarities()
	case 4:
		simMat = cm.InduceNormalizedJaccardSimilarities()
	}

	// --------------------------------------------------------------------------------------------
	// step 6: conver simMat to the result then return it
	result := map[string]map[string]float64{}
	words := make([]string, len(corpus.Vocab))
	for word, idx := range corpus.Vocab {
		words[idx] = word
	}
	for i, row := range simMat {
		wordI := words[i]
		rowI := map[string]float64{}
		for j, value := range row {
			wordJ := words[j]
			rowI[wordJ] = value
		}
		result[wordI] = rowI
	}
	return result
}

// =================================================================================================
// method GetPhraseSimilarityX
// brief description: compute the similarities between pairs of phrases and gives them in form of a
//	sparse matrix using Title Link method
// input:
// 	nothing
// output:
//	the sparse matrix of phrase similarities
// note:
//	The title link method can be found in:
//	Bogomolova, A., Ryazanova, M., & Balk, I. (2021). Cluster approach to analysis of publication
//	titles. In Journal of Physics: Conference Series (Vol. 1727, No. 1, p. 012016). IOP Publishing.
func (g *CitationGraph) GetPhraseSimilarityX(simType int) map[string]map[string]float64 {
	// --------------------------------------------------------------------------------------------
	// step 1: create corpus using all nodes
	corpus := g.CreateCorpusX(2)

	// --------------------------------------------------------------------------------------------
	// step 2: get concurrences and exclusions from corpuse
	concurrences := corpus.GetConcurrences()
	exclusions := corpus.GetExclusions()

	// --------------------------------------------------------------------------------------------
	// step 3: create concurrence model from concurrences
	cm := ConcurrenceBasedClustering.NewConcurrenceModel()
	cm.SetConcurrences(uint(len(corpus.Vocab)), concurrences)
	cm.SetExclusions(exclusions)

	// --------------------------------------------------------------------------------------------
	// step 4: get similarity
	simMat := map[uint]map[uint]float64{}
	switch simType {
	case 0:
		simMat = cm.InduceSimilarities()
	case 1:
		simMat = cm.InduceNormalizedSimilarities()
	case 2:
		simMat = cm.InduceJaccardSimilarities()
	case 3:
		simMat = cm.InduceWeightedJaccardSimilarities()
	case 4:
		simMat = cm.InduceNormalizedJaccardSimilarities()
	}

	// --------------------------------------------------------------------------------------------
	// step 6: conver simMat to the result then return it
	result := map[string]map[string]float64{}
	words := make([]string, len(corpus.Vocab))
	for word, idx := range corpus.Vocab {
		words[idx] = word
	}
	for i, row := range simMat {
		wordI := words[i]
		rowI := map[string]float64{}
		for j, value := range row {
			wordJ := words[j]
			rowI[wordJ] = value
		}
		result[wordI] = rowI
	}
	return result
}

// =================================================================================================
// func (g *CitationGraph) CreateCorpus
// brief description: create a corpus from a citation graph
// input
//	corpusType:
//		0 for title + ref titles per document for main nodes,
//		1 for title per document for main nodes,
//		2 for title per document for all nodes,
//		3 for labels per document for main nodes,
func (g *CitationGraph) CreateCorpus(corpusType int) *Corpus {
	// ---------------------------------------------------------------------------------------------
	// step 1: create an empty corpus
	corpus := NewCorpus()

	// ---------------------------------------------------------------------------------------------
	// step 2: create goroutines to add documents to the corpuse
	numCPUs := runtime.NumCPU()
	chNodes := make(chan map[int]*CitationNode)
	chWorkers := make(chan bool)
	numNodes := len(g.ToBeAnalyzed)
	if corpusType == 2 {
		numNodes = len(g.Nodes)
	}
	wordsOfNode := make([][]string, numNodes)
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		go func() {
			for nodes := range chNodes {
				for idxNode, node := range nodes {
					// (2.1) create an empty phrase list
					words := []string{}

					// (2.2) extract words of phrases in title and put them in the phrase list
					if corpusType <= 2 {
						phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(node.Title)
						for _, candidate := range phraseCandidates {
							candidateWords := strings.Split(candidate, " ")
							for _, word := range candidateWords {
								words = append(words, word)
							}
						}
					}

					// (2.3) extract words of phrases in reference titles and put them in the
					// phrase list
					if corpusType <= 0 {
						for _, refID := range node.Refs {
							refNode := g.Nodes[refID]
							phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(refNode.Title)
							for _, candidate := range phraseCandidates {
								candidateWords := strings.Split(candidate, " ")
								for _, word := range candidateWords {
									words = append(words, word)
								}
							}
						}
					}

					// (2.4) stem labels of the node and put them in phrase list
					if corpusType == 3 {
						stemmedLabels := KeyphraseExtraction.StemPhrases(node.Labels)
						for _, stemmedLabel := range stemmedLabels {
							words = append(words, stemmedLabel)
						}
					}

					// (2.5) record the phrases
					wordsOfNode[idxNode] = words
				}
			}
			chWorkers <- true
		}()
	}

	// ---------------------------------------------------------------------------------------------
	// step 3: put nodes into the input channel
	nodes := map[int]*CitationNode{}
	nodeIDsSent := map[int]bool{}
	for idx, id := range g.ToBeAnalyzed {
		// (1.1) get a node
		node := g.Nodes[id]

		// (1.2) append nodes with this node
		nodes[idx] = node
		nodeIDsSent[int(id)] = true

		// (1.3) send nodes when nodes are large enough
		if len(nodes) == 100 || idx+1 == len(g.ToBeAnalyzed) {
			chNodes <- nodes
			nodes = map[int]*CitationNode{}
		}
	}
	if corpusType == 2 {
		// send nodes not in main nodes
		idx := len(g.ToBeAnalyzed)
		for id, node := range g.Nodes {
			// (1.4) skips those already sent
			_, alreadySent := nodeIDsSent[int(id)]
			if alreadySent {
				continue
			}

			// (1.5) append nodes with this node
			nodes[idx] = node
			nodeIDsSent[int(id)] = true

			// (1.6) send nodes when nodes are large enough
			if len(nodes) == 100 {
				chNodes <- nodes
				nodes = map[int]*CitationNode{}
			}
			idx++
		}
		// (1.7) send the rest of nodes
		if len(nodes) > 0 {
			chNodes <- nodes
		}
	}
	// (1.8) close channel
	close(chNodes)

	// ---------------------------------------------------------------------------------------------
	// step 4: wait for all workers
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		<-chWorkers
	}
	for _, words := range wordsOfNode {
		corpus.AddDoc(words)
	}
	return corpus
}

// =================================================================================================
// func (g *CitationGraph) CreateCorpusX
// brief description: create a corpusX from a citation graph
// input
//	corpusType:
//		0 for title + ref titles per document for main nodes,
//		1 for title per document for main nodes,
//		2 for title per document for all nodes,
func (g *CitationGraph) CreateCorpusX(corpusType int) *CorpusX {
	// ---------------------------------------------------------------------------------------------
	// step 1: create an empty corpus
	corpus := NewCorpusX()

	// ---------------------------------------------------------------------------------------------
	// step 2: create goroutines to add documents to the corpuse
	numCPUs := runtime.NumCPU()
	chNodes := make(chan map[int]*CitationNode)
	chWorkers := make(chan bool)
	numNodes := len(g.ToBeAnalyzed)
	if corpusType == 2 {
		numNodes = len(g.Nodes)
	}
	wordsOfNode := make([][][]string, numNodes)
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		go func() {
			for nodes := range chNodes {
				for idxNode, node := range nodes {
					// (2.1) create an empty phrase list
					words := [][]string{}

					// (2.2) extract words of phrases in title and put them in the phrase list
					if corpusType <= 2 {
						phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(node.Title)
						for _, candidate := range phraseCandidates {
							candidateWords := KeyphraseExtraction.GetAllPossiblePhrases(candidate)
							words = append(words, candidateWords)
						}
					}

					// (2.3) extract words of phrases in reference titles and put them in the
					// phrase list
					if corpusType <= 0 {
						for _, refID := range node.Refs {
							refNode := g.Nodes[refID]
							phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(refNode.Title)
							for _, candidate := range phraseCandidates {
								candidateWords := KeyphraseExtraction.GetAllPossiblePhrases(candidate)
								words = append(words, candidateWords)
							}
						}
					}

					// (2.4) record the phrases
					wordsOfNode[idxNode] = words
				}
			}
			chWorkers <- true
		}()
	}

	// ---------------------------------------------------------------------------------------------
	// step 3: put nodes into the input channel
	nodes := map[int]*CitationNode{}
	nodeIDsSent := map[int]bool{}
	for idx, id := range g.ToBeAnalyzed {
		// (1.1) get a node
		node := g.Nodes[id]

		// (1.2) append nodes with this node
		nodes[idx] = node
		nodeIDsSent[int(id)] = true

		// (1.3) send nodes when nodes are large enough
		if len(nodes) == 100 || idx+1 == len(g.ToBeAnalyzed) {
			chNodes <- nodes
			nodes = map[int]*CitationNode{}
		}
	}
	if corpusType == 2 {
		// send nodes not in main nodes
		idx := len(g.ToBeAnalyzed)
		for id, node := range g.Nodes {
			// (1.4) skips those already sent
			_, alreadySent := nodeIDsSent[int(id)]
			if alreadySent {
				continue
			}

			// (1.5) append nodes with this node
			nodes[idx] = node
			nodeIDsSent[int(id)] = true

			// (1.6) send nodes when nodes are large enough
			if len(nodes) == 100 {
				chNodes <- nodes
				nodes = map[int]*CitationNode{}
			}
			idx++
		}
		// (1.7) send the rest of nodes
		if len(nodes) > 0 {
			chNodes <- nodes
		}
	}
	// (1.8) close channel
	close(chNodes)

	// ---------------------------------------------------------------------------------------------
	// step 4: wait for all workers
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		<-chWorkers
	}
	for _, words := range wordsOfNode {
		corpus.AddDoc(words)
	}
	return corpus
}

// =================================================================================================
// func (g *CitationGraph) CreateCorpusSeq
// brief description: create a corpus from a citation graph
// input
//	corpusType:
//		0 for title + ref titles per document for main nodes,
//		1 for title per document for main nodes,
//		2 for title per document for all nodes,
//		3 for labels per document for main nodes,
func (g *CitationGraph) CreateCorpusSeq(corpusType int) (corpus *CorpusSeq, years []int,
	nodeIDs []int64, isEnglish []bool) {
	// ---------------------------------------------------------------------------------------------
	// step 1: create an empty corpus and the array for years
	corpus = NewCorpusSeq()
	numNodes := len(g.ToBeAnalyzed)
	if corpusType == 2 {
		numNodes = len(g.Nodes)
	}
	years = make([]int, numNodes)
	nodeIDs = make([]int64, numNodes)
	isEnglish = make([]bool, numNodes)

	// ---------------------------------------------------------------------------------------------
	// step 2: create goroutines to add documents to the corpuse
	numCPUs := runtime.NumCPU()
	chNodes := make(chan map[int]*CitationNode)
	chWorkers := make(chan bool)
	wordsOfNode := make([][]string, numNodes)
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		go func() {
			for nodes := range chNodes {
				for idxNode, node := range nodes {
					// (2.1) create an empty phrase list
					words := []string{}

					// (2.2) extract words of phrases in title and put them in the phrase list
					if corpusType <= 2 {
						phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(node.Title)
						for _, candidate := range phraseCandidates {
							candidateWords := strings.Split(candidate, " ")
							for _, word := range candidateWords {
								words = append(words, word)
							}
						}
					}

					// (2.3) extract words of phrases in reference titles and put them in the
					// phrase list
					if corpusType <= 0 {
						for _, refID := range node.Refs {
							refNode := g.Nodes[refID]
							phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(refNode.Title)
							for _, candidate := range phraseCandidates {
								candidateWords := strings.Split(candidate, " ")
								for _, word := range candidateWords {
									words = append(words, word)
								}
							}
						}
					}

					// (2.4) stem labels of the node and put them in phrase list
					if corpusType == 3 {
						stemmedLabels := KeyphraseExtraction.StemPhrases(node.Labels)
						for _, stemmedLabel := range stemmedLabels {
							words = append(words, stemmedLabel)
						}
					}

					// (2.5) record the phrases
					if !isEnglish[idxNode] {
						lang := langDetector.GetClosestLanguage(strings.ToLower(node.Title))
						if lang == "english" {
							isEnglish[idxNode] = true
						}
					}
					wordsOfNode[idxNode] = words
				}
			}
			chWorkers <- true
		}()
	}

	// ---------------------------------------------------------------------------------------------
	// step 3: put nodes into the input channel
	nodes := map[int]*CitationNode{}
	nodeIDsSent := map[int]bool{}
	for idx, id := range g.ToBeAnalyzed {
		// (1.1) get a node and record its year
		node := g.Nodes[id]
		years[idx] = int(node.Year)
		nodeIDs[idx] = id
		isEnglish[idx] = true

		// (1.2) append nodes with this node
		nodes[idx] = node
		nodeIDsSent[int(id)] = true

		// (1.3) send nodes when nodes are large enough
		if len(nodes) == 100 || idx+1 == len(g.ToBeAnalyzed) {
			chNodes <- nodes
			nodes = map[int]*CitationNode{}
		}
	}
	if corpusType == 2 {
		// send nodes not in main nodes
		idx := len(g.ToBeAnalyzed)
		for id, node := range g.Nodes {
			// (1.4) skips those already sent
			_, alreadySent := nodeIDsSent[int(id)]
			if alreadySent {
				continue
			}

			// (1.5) append nodes with this node and record its year
			nodes[idx] = node
			nodeIDsSent[int(id)] = true
			years[idx] = int(node.Year)
			isEnglish[idx] = false
			nodeIDs[idx] = id

			// (1.6) send nodes when nodes are large enough
			if len(nodes) == 100 {
				chNodes <- nodes
				nodes = map[int]*CitationNode{}
			}
			idx++
		}
		// (1.7) send the rest of nodes
		if len(nodes) > 0 {
			chNodes <- nodes
		}
	}
	// (1.8) close channel
	close(chNodes)

	// ---------------------------------------------------------------------------------------------
	// step 4: wait for all workers
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		<-chWorkers
	}
	for _, words := range wordsOfNode {
		corpus.AddDoc(words)
	}
	return
}

// =================================================================================================
// func (g *CitationGraph) GetIdxMainNode
func (g *CitationGraph) GetIdxMainNode(nodeID int64) uint {
	idx, exists := g.idxMainNodes[nodeID]
	if !exists {
		log.Fatalln(fmt.Sprintf("cannot find %v in main node indices\n", nodeID))
	}
	return uint(idx)
}

// =================================================================================================
// func (g *CitationGraph) ClusterByLDA
// brief description: cluster the main nodes of g with their titles and their reference titles using
//	using LDA.
// input:
//	numTopics: number of topics
//	alpha, beta: the parameters of LDA
//	numIters: number of iterations for LDA
// output:
//	for each main node, gives the likelihoods the node belonging to a cluster.
func (g *CitationGraph) ClusterByLDA(numTopics int, alpha, beta float64, numIters int,
) map[int64][]float64 {
	if numTopics <= 0 || alpha <= 0.0 || beta <= 0.0 || numIters <= 0 { //
		log.Fatalln("all parameters of ClusterByLDA must be > 0")
	}
	// ---------------------------------------------------------------------------------------------
	// step 1: create corpus for the main nodes
	corpus := g.CreateCorpus(0)

	// ---------------------------------------------------------------------------------------------
	// step 2: use the corpus to compute LDA
	LDA := NewLDA(numTopics, alpha, beta, corpus)
	LDA.Train(numIters)

	// ---------------------------------------------------------------------------------------------
	// step 3: infer cluster memberships for each node
	memberships := map[int64][]float64{}
	for docID, doc := range corpus.Docs {
		nodeID := g.ToBeAnalyzed[docID]
		membershipsOfNode := LDA.Infer(doc)
		memberships[nodeID] = membershipsOfNode
	}

	// ---------------------------------------------------------------------------------------------
	// step 4: return the result
	return memberships
}

// =================================================================================================
// func (g *CitationGraph) ClusterTitlesByWPDM
// brief description: cluster the main nodes of g with their titles and their reference titles using
//	using LDA.
// input:
//	eps: the radius of neighborhood.
//	minPts: Only if the neighborhood of a point contains at least minPt points
//		(the center point of the neighborhood included), the neighborhood is
//		called dense. Only dense neighborhoods are connected to communities.
//	simType: the type of similarity, 0 for simple induced similarity, 1 for normalized
//		similarity, 2 for jaccard similarity, 4 for weighted jaccard similarity, 4 for
//		normalized jaccard similarity
//	workSpaceFileName: a file name for intermediate result
// output:
//	for each main node, gives the likelihoods the node belonging to a cluster.
func (g *CitationGraph) ClusterTitlesByWPDM(eps float64, minPts uint, simType int,
	workSpaceFileName string) []map[uint]bool {
	// ---------------------------------------------------------------------------------------------
	// step 1: create corpus for all nodes
	corpus := g.CreateCorpus(2)

	// ---------------------------------------------------------------------------------------------
	// step 2: create concurrences from corpus
	concurrences := corpus.GetConcurrences()

	// ---------------------------------------------------------------------------------------------
	// step 3: create corpus for main nodes
	mainCorpus := g.CreateCorpus(1)

	// ---------------------------------------------------------------------------------------------
	// step 4: translate mainCorpus to corpus
	mainCorpus = corpus.translate(mainCorpus)

	// ---------------------------------------------------------------------------------------------
	// step 5: cluster mainCorpus using WPDM
	cm := ConcurrenceBasedClustering.NewConcurrenceModel()
	cm.SetPairFilter(0.1, 3.0)
	cm.SetConcurrences(uint(len(corpus.Vocab)), concurrences)
	groups := make([]map[uint]bool, len(mainCorpus.Docs))
	for docID, doc := range mainCorpus.Docs {
		group := map[uint]bool{}
		for w, _ := range doc {
			group[uint(w)] = true
		}
		groups[docID] = group
	}
	communities := []map[uint]bool{}
	if minPts > 0 {
		communities = cm.GroupPairDBScan(groups, eps, minPts, simType, workSpaceFileName)
	} else {
		communities = cm.GroupPairAHC(groups, eps, simType, workSpaceFileName)
	}

	// ---------------------------------------------------------------------------------------------
	// step 6: return the result
	return communities
}

// =================================================================================================
// func (g *CitationGraph) ClusterLabelsByWPDM
// brief description: cluster the main nodes of g with their titles and their reference titles using
//	using LDA.
// input:
//	eps: the radius of neighborhood.
//	minPts: Only if the neighborhood of a point contains at least minPt points
//		(the center point of the neighborhood included), the neighborhood is
//		called dense. Only dense neighborhoods are connected to communities.
//	simType: the type of similarity, 0 for simple induced similarity, 1 for normalized
//		similarity, 2 for jaccard similarity, 4 for weighted jaccard similarity, 4 for
//		normalized jaccard similarity
//	workSpaceFileName: a file name for intermediate result
// output:
//	for each main node, gives the likelihoods the node belonging to a cluster.
func (g *CitationGraph) ClusterLabelsByWPDM(eps float64, minPts uint, simType int,
	workSpaceFileName string) []map[uint]bool {
	// ---------------------------------------------------------------------------------------------
	// step 1: create corpus for labels at main nodes
	corpus := g.CreateCorpus(3)

	// ---------------------------------------------------------------------------------------------
	// step 2: create concurrences from corpus
	concurrences := corpus.GetConcurrences()

	// ---------------------------------------------------------------------------------------------
	// step 5: cluster corpuse using WPDM
	cm := ConcurrenceBasedClustering.NewConcurrenceModel()
	cm.SetConcurrences(uint(len(corpus.Vocab)), concurrences)
	groups := make([]map[uint]bool, len(corpus.Docs))
	for docID, doc := range corpus.Docs {
		group := map[uint]bool{}
		for w, _ := range doc {
			group[uint(w)] = true
		}
		groups[docID] = group
	}
	communities := []map[uint]bool{}
	if minPts > 0 {
		communities = cm.GroupPairDBScan(groups, eps, minPts, simType, workSpaceFileName)
	} else {
		communities = cm.GroupPairAHC(groups, eps, simType, workSpaceFileName)
	}

	// ---------------------------------------------------------------------------------------------
	// step 6: return the result
	return communities
}

// =================================================================================================
// func (g *CitationGraph) ClusterTitlesByDBScan
// brief description: cluster the main nodes of g with their titles and their reference titles using
//	using LDA.
// input:
//	eps: the radius of neighborhood.
//	minPts: Only if the neighborhood of a point contains at least minPt points
//		(the center point of the neighborhood included), the neighborhood is
//		called dense. Only dense neighborhoods are connected to communities.
//	simType: the type of similarity, 0 for simple induced similarity, 1 for normalized
//		similarity, 2 for jaccard similarity, 4 for weighted jaccard similarity, 4 for
//		normalized jaccard similarity
// output:
//	for each main node, gives the likelihoods the node belonging to a cluster.
// func (g *CitationGraph) ClusterTitlesByDBScan(eps float64, minPts uint, simType int,
// ) map[int64][]float64 {

// }

// =================================================================================================
// func (g *CitationGraph) ClusterLabelsByDBScan
// brief description: cluster the main nodes of g with their titles and their reference titles using
//	using LDA.
// input:
//	eps: the radius of neighborhood.
//	minPts: Only if the neighborhood of a point contains at least minPt points
//		(the center point of the neighborhood included), the neighborhood is
//		called dense. Only dense neighborhoods are connected to communities.
//	simType: the type of similarity, 0 for simple induced similarity, 1 for normalized
//		similarity, 2 for jaccard similarity, 4 for weighted jaccard similarity, 4 for
//		normalized jaccard similarity
// output:
//	for each main node, gives the likelihoods the node belonging to a cluster.
// func (g *CitationGraph) ClusterTitlesByDBScan(eps float64, minPts uint, simType int,
// ) map[int64][]float64 {

// }

// =================================================================================================
// func (g *CitationGraph) ClusterTitlesByGSDMM
// brief description: cluster the main nodes of g with their titles and their reference titles using
//	using LDA.
// input:
//	simType: the type of similarity, 0 for simple induced similarity, 1 for normalized
//		similarity, 2 for jaccard similarity, 4 for weighted jaccard similarity, 4 for
//		normalized jaccard similarity
//	numTopics: number of topics
//	alpha, beta: the parameters of GSDMM
//	numIters: number of iterations
// output:
//	for each main node, gives the likelihoods the node belonging to a cluster.
func (g *CitationGraph) ClusterTitlesByGSDMM(simType, numTopics int, alpha, beta float64, numIters int,
) map[int64][]float64 {
	if numTopics <= 0 || alpha <= 0.0 || beta <= 0.0 || numIters <= 0 {
		log.Fatalln("all parameters of ClusterByLDA must be > 0")
	}
	// ---------------------------------------------------------------------------------------------
	// step 1: create corpus for the main nodes
	phraseSimilarities := g.GetPhraseSimilarityX(simType)
	tfidfGroups := g.SimTFIDF(phraseSimilarities)
	corpus := NewCorpus()
	for _, tfidfs := range tfidfGroups {
		texts := []string{}
		sumWeights := 0.0
		for _, weight := range tfidfs {
			sumWeights += weight
		}
		meanWeight := sumWeights / float64(len(tfidfs))
		for text, weight := range tfidfs {
			if weight < 0.5*meanWeight {
				continue
			}
			texts = append(texts, text)
		}
		corpus.AddDoc(texts)
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: use the corpus to compute LDA
	GSDMM := NewGSDMM(numTopics, alpha, beta, corpus)
	GSDMM.Train(numIters)

	// ---------------------------------------------------------------------------------------------
	// step 3: infer cluster memberships for each node
	memberships := map[int64][]float64{}
	for docID, doc := range corpus.Docs {
		nodeID := g.ToBeAnalyzed[docID]
		membershipsOfNode := GSDMM.Infer(doc)
		memberships[nodeID] = membershipsOfNode
	}

	// ---------------------------------------------------------------------------------------------
	// step 4: return the result
	return memberships
}

// =================================================================================================
// func (g *CitationGraph) ClusterLabelsByGSDMM
// brief description: cluster the main nodes of g with their titles and their reference titles using
//	using LDA.
// input:
//	eps: the radius of neighborhood.
//	minPts: Only if the neighborhood of a point contains at least minPt points
//		(the center point of the neighborhood included), the neighborhood is
//		called dense. Only dense neighborhoods are connected to communities.
//	simType: the type of similarity, 0 for simple induced similarity, 1 for normalized
//		similarity, 2 for jaccard similarity, 4 for weighted jaccard similarity, 4 for
//		normalized jaccard similarity
// output:
//	for each main node, gives the likelihoods the node belonging to a cluster.
// func (g *CitationGraph) ClusterTitlesByGSDMM(eps float64, minPts uint, simType int,
// ) map[int64][]float64 {

// }
func (g *CitationGraph) ClusterLabelsByGSDMM(numTopics int, alpha, beta float64, numIters int,
) map[int64][]float64 {
	if numTopics <= 0 || alpha <= 0.0 || beta <= 0.0 || numIters <= 0 {
		log.Fatalln("all parameters of ClusterByLDA must be > 0")
	}
	// ---------------------------------------------------------------------------------------------
	// step 1: create corpus for labels of the main nodes
	corpus := g.CreateCorpus(3)
	// reverse corpus Vocab
	reverseVocab := map[int]string{}
	for w, j := range corpus.Vocab {
		reverseVocab[j] = w
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: use the corpus to compute LDA
	GSDMM := NewGSDMM(numTopics, alpha, beta, corpus)
	GSDMM.Train(numIters)

	// ---------------------------------------------------------------------------------------------
	// step 3: infer cluster memberships for each node
	memberships := map[int64][]float64{}
	for docID, doc := range corpus.Docs {
		nodeID := g.ToBeAnalyzed[docID]
		membershipsOfNode := GSDMM.Infer(doc)
		memberships[nodeID] = membershipsOfNode
	}

	// ---------------------------------------------------------------------------------------------
	// step 4: return the result
	return memberships
}

// =================================================================================================
// func (g *CitationGraph) checkMemberships
func (g *CitationGraph) checkMemberships(memberships map[int64][]float64) {
	if len(memberships) != len(g.ToBeAnalyzed) {
		log.Fatalln("length of memberships is incorrect.")
	}

	numDims := 0
	for _, nodeID := range g.ToBeAnalyzed {
		membership, exists := memberships[nodeID]
		if !exists {
			log.Fatalln("some nodes for analysis cannot be found in memberships")
		}
		if numDims == 0 {
			numDims = len(membership)
		} else if numDims != len(membership) {
			log.Fatalln("number of dimensions are inconsistent in memberships")
		}
	}
}

// =================================================================================================
// func (g *CitationGraph) checkCommunities
func (g *CitationGraph) checkCommunities(communities []map[uint]bool) {
	numNodes := 0
	for _, community := range communities {
		numNodes += len(community)
	}
	if numNodes != len(g.ToBeAnalyzed) {
		log.Fatalln("total number of nodes in communities is incorrect.")
	}
}

// =================================================================================================
// func membCos
func membCos(membershipI, membershipJ []float64) float64 {
	if len(membershipI) != len(membershipJ) {
		log.Fatalln("lengthes of membershipI and membershipJ don't match.")
	}
	sim := 0.0
	sizeI := 0.0
	sizeJ := 0.0
	for k, membI := range membershipI {
		membJ := membershipJ[k]
		sim += membI * membJ
		sizeI += membI * membI
		sizeJ += membJ * membJ
	}
	sizeI = math.Sqrt(sizeI)
	sizeJ = math.Sqrt(sizeJ)
	sim /= sizeI * sizeJ
	return sim
}

// =================================================================================================
// func (g *CitationGraph) CompareByModularity
func (g *CitationGraph) CompareByModularity(communities []map[uint]bool,
	memberships map[int64][]float64) float64 {
	// ---------------------------------------------------------------------------------------------
	// step 1: check parameters
	g.checkCommunities(communities)
	g.checkMemberships(memberships)

	// ---------------------------------------------------------------------------------------------
	// step 2: compute sumSims
	n := len(g.ToBeAnalyzed)
	sumSims := make([]float64, n)
	numCPUs := runtime.NumCPU()
	chI := make(chan int)
	chWorkers := make(chan bool)
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		go func() {
			for i0 := range chI {
				i1 := i0 + 100
				if i1 > n {
					i1 = n
				}
				for i := i0; i < i1; i++ {
					nodeIDOfI := g.ToBeAnalyzed[i]
					membershipI := memberships[nodeIDOfI]
					sumSims[i] = 0.0
					for _, nodeIDOfJ := range g.ToBeAnalyzed {
						membershipJ := memberships[nodeIDOfJ]
						sim := membCos(membershipI, membershipJ)
						sumSims[i] += sim
					}
				}
			}
			chWorkers <- true
		}()
	}
	for i := 0; i < n; i += 100 {
		chI <- i
	}
	close(chI)
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		<-chWorkers
	}
	totalSumSims := 0.0
	for i := 0; i < n; i++ {
		totalSumSims += sumSims[i]
	}

	// ---------------------------------------------------------------------------------------------
	// step 3: compute modularity
	modularity := 0.0
	for _, community := range communities {
		for i, _ := range community {
			nodeIDOfI := g.ToBeAnalyzed[i]
			membershipI := memberships[nodeIDOfI]
			for j, _ := range community {
				nodeIDOfJ := g.ToBeAnalyzed[j]
				membershipJ := memberships[nodeIDOfJ]
				sim := membCos(membershipI, membershipJ)
				modularity += sim - sumSims[i]*sumSims[j]/totalSumSims
			}
		}
	}
	modularity /= totalSumSims

	// ---------------------------------------------------------------------------------------------
	// step 4: return result
	return modularity
}

// =================================================================================================
// func (g *CitationGraph) CompareByCPM
func (g *CitationGraph) CompareByCPM(gamma float64, communities []map[uint]bool,
	memberships map[int64][]float64) float64 {
	// ---------------------------------------------------------------------------------------------
	// step 1: check parameters
	g.checkCommunities(communities)
	g.checkMemberships(memberships)

	// ---------------------------------------------------------------------------------------------
	// step 2: compute CPM
	cpm := 0.0
	for _, community := range communities {
		for i, _ := range community {
			nodeIDOfI := g.ToBeAnalyzed[i]
			membershipI := memberships[nodeIDOfI]
			for j, _ := range community {
				nodeIDOfJ := g.ToBeAnalyzed[j]
				membershipJ := memberships[nodeIDOfJ]
				sim := membCos(membershipI, membershipJ)
				cpm += sim - gamma
			}
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 3: return result
	return cpm
}

// =================================================================================================
// func (g *CitationGraph) getCommunitiesFromMemberships
func (g *CitationGraph) GetCommunitiesFromMemberships(memberships map[int64][]float64) []map[uint]bool {
	numCommunities := 0
	for _, membership := range memberships {
		numCommunities = len(membership)
		break
	}
	result := make([]map[uint]bool, numCommunities)
	for i := 0; i < numCommunities; i++ {
		result[i] = map[uint]bool{}
	}
	for nodeID, membership := range memberships {
		maxMemb := 0.0
		idxMaxMemb := 0
		for k := 0; k < numCommunities; k++ {
			memb := membership[k]
			if memb > maxMemb {
				maxMemb = memb
				idxMaxMemb = k
			}
		}
		result[idxMaxMemb][g.GetIdxMainNode(nodeID)] = true
	}
	return result
}

// =================================================================================================
// func getCommunityIDFromCommunities
func getCommunityIDsFromCommunities(n int, communities []map[uint]bool) []int {
	result := make([]int, n)
	for i, community := range communities {
		for u, _ := range community {
			result[u] = i
		}
	}
	return result
}

// =================================================================================================
// func (g *CitationGraph) CompareByRI
func (g *CitationGraph) CompareByRI(communities []map[uint]bool,
	memberships map[int64][]float64) float64 {
	// ---------------------------------------------------------------------------------------------
	// step 1: check parameters
	g.checkCommunities(communities)
	g.checkMemberships(memberships)

	// ---------------------------------------------------------------------------------------------
	// step 2: compute the intersection part
	partIntersect := 0.0
	myCommunities := g.GetCommunitiesFromMemberships(memberships)
	for _, ci := range communities {
		if len(ci) == 0 {
			continue
		}
		for _, cj := range myCommunities {
			if len(cj) == 0 {
				continue
			}
			cu, cv := ci, cj
			if len(cu) > len(cv) {
				cu, cv = cv, cu
			}
			numInBoth := 0
			for u, _ := range cu {
				_, inCv := cv[u]
				if inCv {
					numInBoth++
				}
			}
			partIntersect += float64(numInBoth * (numInBoth - 1) / 2)
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 3: compute the cross-boundary part
	n := len(g.ToBeAnalyzed)
	commuID1s := getCommunityIDsFromCommunities(n, communities)
	commuID2s := getCommunityIDsFromCommunities(n, myCommunities)
	partCross := 0.0
	for u := 0; u < n; u++ {
		c1u := commuID1s[u]
		c2u := commuID2s[u]
		for v := 0; v < n; v++ {
			c1v := commuID1s[v]
			c2v := commuID2s[v]
			if c1u == c1v || c2u == c2v {
				continue
			}
			partCross += 0.5
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 4: normalize and return the result
	partN := float64(n * (n - 1) / 2)
	fmt.Printf("part intersect = %f, partCross = %f, partN = %f\n", partIntersect, partCross, partN)
	result := (partIntersect + partCross) / partN
	return result
}

// =================================================================================================
// func (g *CitationGraph) CompareByARI
func (g *CitationGraph) CompareByARI(communities []map[uint]bool,
	memberships map[int64][]float64) float64 {
	// ---------------------------------------------------------------------------------------------
	// step 1: check parameters
	g.checkCommunities(communities)
	g.checkMemberships(memberships)

	// ---------------------------------------------------------------------------------------------
	// step 2: compute the intersection part
	result := 0.0
	myCommunities := g.GetCommunitiesFromMemberships(memberships)
	for _, ci := range communities {
		if len(ci) == 0 {
			continue
		}
		for _, cj := range myCommunities {
			if len(cj) == 0 {
				continue
			}
			cu, cv := ci, cj
			if len(cu) > len(cv) {
				cu, cv = cv, cu
			}
			numInBoth := 0
			for u, _ := range cu {
				_, inCv := cv[u]
				if inCv {
					numInBoth++
				}
			}
			result += float64(numInBoth * (numInBoth - 1) / 2)
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 3: compute the cross-boundary part
	n := len(g.ToBeAnalyzed)
	partN := float64(n * (n - 1) / 2)
	partI := 0.0
	for _, ci := range communities {
		ni := len(ci)
		if ni <= 1 {
			continue
		}
		partI += float64(ni * (ni - 1) / 2)
	}
	partJ := 0.0
	for _, cj := range myCommunities {
		nj := len(cj)
		if nj <= 1 {
			continue
		}
		partJ += float64(nj * (nj - 1) / 2)
	}
	partCross := partI * partJ / partN

	fmt.Printf("part intersect = %f, partCross = %f, partI = %f, partJ = %f\n", result, partCross, partI, partJ)

	// ---------------------------------------------------------------------------------------------
	// step 4: normalize and return the result
	result = (result - partCross) / (0.5*(partI+partJ) - partCross)
	return result
}

// =================================================================================================
// func (g *CitationGraph) ComputeEntropies
func (g *CitationGraph) ComputeEntropies(communities []map[uint]bool,
	memberships map[int64][]float64) (float64, float64, float64) {
	// ---------------------------------------------------------------------------------------------
	// step 1: check parameters
	g.checkCommunities(communities)
	g.checkMemberships(memberships)

	// ---------------------------------------------------------------------------------------------
	// step 2: compute the cross entropy
	crossEntropy := 0.0
	n := len(g.ToBeAnalyzed)
	myCommunities := g.GetCommunitiesFromMemberships(memberships)
	for _, ci := range communities {
		if len(ci) == 0 {
			continue
		}
		for _, cj := range myCommunities {
			if len(cj) == 0 {
				continue
			}
			cu, cv := ci, cj
			if len(cu) > len(cv) {
				cu, cv = cv, cu
			}
			numInBoth := 0
			for u, _ := range cu {
				_, inCv := cv[u]
				if inCv {
					numInBoth++
				}
			}
			if numInBoth == 0 {
				continue
			}
			p := float64(numInBoth) / float64(n)
			crossEntropy -= p * math.Log(p)
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: compute the entropies
	entropy1 := 0.0
	for _, ci := range communities {
		ni := len(ci)
		if ni == 0 {
			continue
		}
		p := float64(ni) / float64(n)
		entropy1 -= p * math.Log(p)
	}

	entropy2 := 0.0
	for _, cj := range myCommunities {
		nj := len(cj)
		if nj == 0 {
			continue
		}
		p := float64(nj) / float64(n)
		entropy2 -= p * math.Log(p)
	}

	// ---------------------------------------------------------------------------------------------
	// step 3: return the entropies
	return crossEntropy, entropy1, entropy2
}

// =================================================================================================
// func (g *CitationGraph) CompareByMI
func (g *CitationGraph) CompareByMI(communities []map[uint]bool,
	memberships map[int64][]float64) float64 {
	// ---------------------------------------------------------------------------------------------
	// step 1: compute the entropies
	crossEntropy, entropy1, entropy2 := g.ComputeEntropies(communities, memberships)

	// ---------------------------------------------------------------------------------------------
	// step 2: compute and return the result
	return entropy1 + entropy2 - crossEntropy
}

// =================================================================================================
// func (g *CitationGraph) CompareByNMI
func (g *CitationGraph) CompareByNMI(communities []map[uint]bool,
	memberships map[int64][]float64) float64 {
	// ---------------------------------------------------------------------------------------------
	// step 1: compute the entropies
	crossEntropy, entropy1, entropy2 := g.ComputeEntropies(communities, memberships)

	// ---------------------------------------------------------------------------------------------
	// step 2: compute and return the result
	return 2.0 * (entropy1 + entropy2 - crossEntropy) / (entropy1 + entropy2)
}

func numCombs(a ...int) float64 {
	m := len(a)
	m1 := m
	for i := 0; i < m; i++ {
		if a[i] < 0 {
			m1 = i
			break
		}
	}
	if m1 == m {
		log.Fatalln(fmt.Sprintf("in numComb, m1 == m\n"))
	}

	b := make([]int, m)
	n1 := 0
	for i := 0; i < m1; i++ {
		b[i] = a[i]
		n1 += a[i]
	}
	n2 := 0
	for i := m1 + 1; i < m; i++ {
		b[i] = a[i]
		n2 += a[i]
	}

	if n1 != n2 {
		log.Fatalln(fmt.Sprintf("in numComb, n1 = %d != n2 = %d\n", n1, n2))
	}

	result := 1.0
	for k := 0; k < n1; k++ {
		i1Max := 0
		for i1 := 1; i1 < m1; i1++ {
			if b[i1] > b[i1Max] {
				i1Max = i1
			}
		}

		i2Max := m1 + 1
		for i2 := m1 + 2; i2 < m; i2++ {
			if b[i2] > b[i2Max] {
				i2Max = i2
			}
		}

		result *= float64(b[i1Max]) / float64(b[i2Max])
		b[i1Max]--
		b[i2Max]--
	}
	return result
}

// =================================================================================================
// func (g *CitationGraph) ComputeEMI
func (g *CitationGraph) ComputeEMI(communities []map[uint]bool,
	memberships map[int64][]float64) float64 {
	// ---------------------------------------------------------------------------------------------
	// step 1: check parameters
	g.checkCommunities(communities)
	g.checkMemberships(memberships)

	// ---------------------------------------------------------------------------------------------
	// step 2: compute the cross entropy
	result := 0.0
	n := len(g.ToBeAnalyzed)
	myCommunities := g.GetCommunitiesFromMemberships(memberships)
	for _, ci := range communities {
		ni := len(ci)
		if ni == 0 {
			continue
		}
		for _, cj := range myCommunities {
			nj := len(cj)
			if nj == 0 {
				continue
			}
			k0 := ni + nj - n
			if k0 < 1 {
				k0 = 1
			}
			k1 := ni
			if k1 > nj {
				k1 = nj
			}
			for k := k0; k <= k1; k++ {
				result += numCombs(ni, nj, n-ni, n-nj, -1, n, k, ni-k, nj-k, n-ni-nj+k) *
					float64(k) / float64(n) * math.Log(float64(k*n)/float64(ni*nj))
			}
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: return the result
	return result
}

// =================================================================================================
// func (g *CitationGraph) CompareByAMI
func (g *CitationGraph) CompareByAMI(communities []map[uint]bool,
	memberships map[int64][]float64) float64 {
	// ---------------------------------------------------------------------------------------------
	// step 1: compute the entropies
	crossEntropy, entropy1, entropy2 := g.ComputeEntropies(communities, memberships)

	// ---------------------------------------------------------------------------------------------
	// step 2: compute and expected mutual entropy
	emi := g.ComputeEMI(communities, memberships)

	// ---------------------------------------------------------------------------------------------
	// step 3: compute and return the result
	mi := entropy1 + entropy2 - crossEntropy
	return (mi - emi) / (math.Max(entropy1, entropy2) - emi)
}

// =================================================================================================
// func SaveMemberships
func SaveMemberships(memberships map[int64][]float64, fileName string) {
	fileMembership, err := os.Create(fileName)
	if err != nil {
		log.Fatal(err)
	}
	jsonBytes, err := json.Marshal(memberships)
	if err != nil {
		log.Fatal(err)
	}
	jsonString := string(jsonBytes)
	_, err = fileMembership.WriteString(jsonString)
	if err != nil {
		log.Fatal(err)
	}
	fileMembership.Close()
}

// =================================================================================================
// func LoadMemberships
func LoadMemberships(fileName string) map[int64][]float64 {
	memberships := map[int64][]float64{}
	fileMembership, err := os.Open(fileName)
	if err != nil {
		log.Fatal(err)
	}
	bufLen := 1024 * 1024
	bytes := make([]byte, bufLen)
	jsonBytes := []byte{}
	readLen, err := fileMembership.Read(bytes)
	if err != nil {
		log.Fatal(err)
	}
	for readLen == bufLen {
		jsonBytes = append(jsonBytes, bytes...)
		readLen, err = fileMembership.Read(bytes)
		if err != nil {
			log.Fatal(err)
		}
	}
	if readLen > 0 {
		jsonBytes = append(jsonBytes, bytes[0:readLen]...)
	}

	fileMembership.Close()
	json.Unmarshal(jsonBytes, &memberships)
	return memberships
}

// =================================================================================================
// struct PhrasePair
type PhrasePair struct {
	Phrase1, Phrase2 string
}

type PairFreq struct {
	Actual   float64
	Expected float64
}

// =================================================================================================
// func (g *CitationGraph) preparePhrasePairSearch
func (g *CitationGraph) preparePhrasePairSearch() (numDocs int,
	docConcurrences map[uint]map[uint]float64, docFreqsOf []float64, phrases []string) {
	// --------------------------------------------------------------------------------------------
	// step 1: create corpus using all nodes
	corpus := g.CreateCorpusX(2)

	// --------------------------------------------------------------------------------------------
	// step 2: get concurrences and exclusions from corpuse
	docConcurrences = corpus.GetDocConcurrences()

	// --------------------------------------------------------------------------------------------
	// step 3: compute sum of document frequencies for each phrase
	numPhrases := len(corpus.Vocab)
	docFreqsOf = make([]float64, numPhrases)
	for i := 0; i < numPhrases; i++ {
		docFreqsOf[i] = 0.0
	}
	numCPUs := runtime.NumCPU()
	numDocs = len(corpus.Docs)
	chI := make(chan int)
	chResult := make(chan map[int]int)
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		go func() {
			myResult := map[int]int{}

			for i0 := range chI {
				i1 := i0 + 100
				if i1 > numDocs {
					i1 = numDocs
				}
				for i := i0; i < i1; i++ {
					candidates := corpus.Docs[i]

					// first find the set of texts in this document
					docResult := map[int]bool{}
					for _, candidate := range candidates {
						for wordID, _ := range candidate {
							docResult[wordID] = true
						}
					}

					// then update my document frequency with this set
					for wordID, _ := range docResult {
						oldFreq, exists := myResult[wordID]
						if !exists {
							oldFreq = 0
						}
						myResult[wordID] = oldFreq + 1
					}
				}
			}

			chResult <- myResult
		}()
	}
	for i := 0; i < numDocs; i += 100 {
		chI <- i
	}
	close(chI)
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		localResult := <-chResult
		for wordID, freq := range localResult {
			docFreqsOf[wordID] += float64(freq)
		}
	}

	// --------------------------------------------------------------------------------------------
	// step 4: create concurrence model from concurrences
	phrases = make([]string, numPhrases)
	for phrase, idx := range corpus.Vocab {
		phrases[idx] = phrase
	}

	return
}

// =================================================================================================
// func (g *CitationGraph) findStronglyConnectedPhrases
func (g *CitationGraph) findStronglyConnectedPhrases(numDocs int,
	docConcurrences map[uint]map[uint]float64, docFreqsOf []float64, phrases []string,
	thresFreq, thresRatio float64,
) map[PhrasePair]PairFreq {
	// --------------------------------------------------------------------------------------------
	// step 1: apply thresholds in filtering
	n := len(g.Nodes)
	result := map[PhrasePair]PairFreq{}
	for w1, concurrencesOfW1 := range docConcurrences {
		relFreq1 := docFreqsOf[w1] / float64(numDocs)
		for w2, freq := range concurrencesOfW1 {
			if w1 >= w2 {
				continue
			}
			if freq < thresFreq {
				continue
			}
			relFreq2 := docFreqsOf[w2] / float64(numDocs)
			expectedFreq := relFreq1 * relFreq2 * float64(n)
			if freq >= thresRatio*expectedFreq {
				result[PhrasePair{phrases[w1], phrases[w2]}] =
					PairFreq{Actual: freq, Expected: expectedFreq}
			}
		}
	}

	// ---------------------------------------------------------------------------------------------
	// step 2: return the result
	return result
}

// =================================================================================================
// func (g *CitationGraph) GetStronglyConnectedPhrases
func (g *CitationGraph) GetStronglyConnectedPhrases(thresFreq, thresRatio float64) map[PhrasePair]PairFreq {
	// --------------------------------------------------------------------------------------------
	// step 1: get concurrences, sumConcurrencesOf, and phrases
	numDocs, docConcurrences, docFreqsOf, phrases := g.preparePhrasePairSearch()

	// --------------------------------------------------------------------------------------------
	// step 2: apply thresholds in filtering and return the result
	return g.findStronglyConnectedPhrases(numDocs, docConcurrences, docFreqsOf, phrases, thresFreq, thresRatio)

}

// =================================================================================================
// func (g *CitationGraph) SaveWord2VecTrainingData
// brief description:
// 	save tokenized and stemmed publication titles to a file for training of word2vec
func (g *CitationGraph) SaveWord2VecTrainingData(fileNamePrefix string, numIters, minFreq int,
	minScore float64, useW2PEx bool, yearStartFrom int) {
	// ---------------------------------------------------------------------------------------------
	// step 1: create the sequential corpus
	corpus, years, IDs, isEnglish := g.CreateCorpusSeq(2)

	// ---------------------------------------------------------------------------------------------
	// step 2: use word2phrase to detect multi-word phrases
	if useW2PEx {
		corpus = corpus.Word2PhraseEx(numIters, minFreq, minScore)
	} else {
		corpus = corpus.Word2Phrase(numIters, minFreq, minScore)
	}

	// ---------------------------------------------------------------------------------------------
	// step 3: extract a list of phrases from corpus
	words := make([]string, len(corpus.Vocab))
	for word, idxWord := range corpus.Vocab {
		words[idxWord] = word
	}

	// ---------------------------------------------------------------------------------------------
	// step 4: detect year ranges
	minYear := years[0]
	maxYear := years[0]
	lenYears := len(years)
	for idxYear := 1; idxYear < lenYears; idxYear++ {
		if years[idxYear] < minYear {
			minYear = years[idxYear]
		} else if years[idxYear] > maxYear {
			maxYear = years[idxYear]
		}
	}
	if minYear < yearStartFrom {
		minYear = yearStartFrom
	}

	// ---------------------------------------------------------------------------------------------
	// step 5: create a file for each [minYear, year] range
	for year := minYear; year <= maxYear; year++ {
		// (5.1) create the file
		fileName := fileNamePrefix + fmt.Sprintf("-%d.txt", year)
		idFileName := fileNamePrefix + fmt.Sprintf("-IDs-%d.txt", year)
		file, err := os.Create(fileName)
		if err != nil {
			log.Fatalln(fmt.Sprintf("cannot create file %s", fileName))
		}
		defer file.Close()
		idFile, err := os.Create(idFileName)
		if err != nil {
			log.Fatalln(fmt.Sprintf("cannot create file %s", idFileName))
		}
		defer idFile.Close()

		// (5.2) write the file
		for idxDoc, doc := range corpus.Docs {
			if years[idxDoc] > year || !isEnglish[idxDoc] {
				continue
			}
			for _, idxWord := range doc {
				word := strings.Replace(words[idxWord], " ", "_", -1)
				file.WriteString(word + " ")
			}
			file.WriteString("\n")
			idFile.WriteString(fmt.Sprintf("%d\n", IDs[idxDoc]))
		}
	}
}

// =================================================================================================
// func (g *CitationGraph) GetEmergingTrends
func (g *CitationGraph) GetEmergingTrends(yearToday, yearRecent, yearFarAway, lowThreshold,
	highThreshold int) map[string][]int {
	if yearToday <= yearRecent || yearRecent <= yearFarAway {
		log.Fatal("Must make sure yearFarAway < yearRecent < yearToday")
	}

	result := map[string][]int{}
	numYears := yearToday - yearFarAway + 1
	numYearsFarAway := yearRecent - yearFarAway
	for _, nodeID := range g.ToBeAnalyzed {
		node := g.Nodes[int64(nodeID)]
		year := int(node.Year)
		if year < yearFarAway || year > yearToday {
			continue
		}
		stemmedLabels := KeyphraseExtraction.StemPhrases(node.Labels)
		for _, label := range stemmedLabels {
			labelHistory, exists := result[label]
			if !exists {
				labelHistory = make([]int, numYears)
				result[label] = labelHistory
				for i := 0; i < numYears; i++ {
					labelHistory[i] = 0
				}
			}
			labelHistory[year-yearFarAway]++
		}
	}

	toBeDeleted := map[string]bool{}
	for label, labelHistory := range result {
		countFarAway := 0
		for i := 0; i < numYearsFarAway; i++ {
			countFarAway += labelHistory[i]
		}
		if countFarAway >= lowThreshold {
			toBeDeleted[label] = true
			continue
		}

		countRecent := 0
		for i := numYearsFarAway; i < numYears; i++ {
			countRecent += labelHistory[i]
		}
		if countRecent < highThreshold {
			toBeDeleted[label] = true
		}
	}

	for label, _ := range toBeDeleted {
		delete(result, label)
	}

	return result
}

// =================================================================================================
// func (g *CitationGraph) GetEmergingTopicPublications
// note: The publications of emerging topic have the following characteristics:
//	(1) cold start: in the year of publication, it receives citations <= lowThreshold per year
//	(2) break out: in recent years, it receives avg citations >= highThreshold per year
func (g *CitationGraph) GetEmergingTopicPublications(yearToday, yearRecent, yearFarAway,
	lowThreshold, highThreshold int) map[int64][]int {
	if yearToday <= yearRecent || yearRecent <= yearFarAway {
		log.Fatal("Must make sure yearFarAway < yearRecent < yearToday")
	}

	result := map[int64][]int{}
	numYears := yearToday - yearFarAway + 1
	for _, nodeID := range g.ToBeAnalyzed {
		node := g.Nodes[int64(nodeID)]
		year := int(node.Year)

		if year < yearFarAway || year > yearToday {
			continue
		}

		citeHistory := make([]int, numYears)
		result[nodeID] = citeHistory
		for i := 0; i < numYears; i++ {
			citeHistory[i] = 0
		}

		for _, citeID := range node.Cites {
			citer := g.Nodes[citeID]
			citeYear := int(citer.Year)
			if citeYear < yearFarAway || citeYear > yearToday {
				continue
			}
			citeHistory[citeYear-yearFarAway]++
		}
	}

	toBeDeleted := map[int64]bool{}
	for nodeID, citeHistory := range result {
		// find cold start
		node := g.Nodes[int64(nodeID)]
		year := int(node.Year)
		if citeHistory[year-yearFarAway] > lowThreshold {
			toBeDeleted[nodeID] = true
			continue
		}
		coldYear := year
		for coldYear+1 < yearToday {
			if citeHistory[coldYear+1-yearFarAway] <= lowThreshold {
				coldYear++
			} else {
				break
			}
		}

		// skip those cold year ends at next year of publication
		if coldYear-year < 2 {
			toBeDeleted[nodeID] = true
			continue
		}

		// skip those cold year ends before yearRecent and those no hot year
		if coldYear < yearRecent || coldYear >= yearToday {
			toBeDeleted[nodeID] = true
			continue
		}

		// compute hot year average
		hotYearAvg := 0.0
		numHotYears := 0
		for hotYear := coldYear + 1; hotYear <= yearToday; hotYear++ {
			hotYearAvg += float64(citeHistory[hotYear-yearFarAway])
			numHotYears++
		}
		hotYearAvg /= float64(numHotYears)

		// skip those with too low average during hot years
		if hotYearAvg < float64(highThreshold) {
			toBeDeleted[nodeID] = true
		}
	}

	for nodeID, _ := range toBeDeleted {
		delete(result, nodeID)
	}

	return result
}

// =================================================================================================
// func (g *CitationGraph) GetHotTopicPublications
// note: The publications of hot topic have the following characteristics:
//	hot start: in the year of publication or next year, it receives citations >= highThreshold per year
func (g *CitationGraph) GetHotTopicPublications(yearToday, yearRecent, yearFarAway,
	lowThreshold, highThreshold int) map[int64][]int {
	if yearToday <= yearRecent || yearRecent <= yearFarAway {
		log.Fatal("Must make sure yearFarAway < yearRecent < yearToday")
	}

	result := map[int64][]int{}
	numYears := yearToday - yearFarAway + 1
	for _, nodeID := range g.ToBeAnalyzed {
		node := g.Nodes[int64(nodeID)]
		year := int(node.Year)

		if year < yearFarAway || year > yearToday {
			continue
		}

		citeHistory := make([]int, numYears)
		result[nodeID] = citeHistory
		for i := 0; i < numYears; i++ {
			citeHistory[i] = 0
		}

		for _, citeID := range node.Cites {
			citer := g.Nodes[citeID]
			citeYear := int(citer.Year)
			if citeYear < yearFarAway || citeYear > yearToday || citeYear < year {
				continue
			}
			citeHistory[citeYear-yearFarAway]++
		}
	}

	toBeDeleted := map[int64]bool{}
	for nodeID, citeHistory := range result {
		// find hot start
		node := g.Nodes[int64(nodeID)]
		year := int(node.Year)
		if citeHistory[year-yearFarAway] < highThreshold {
			if year+1 <= yearToday {
				if citeHistory[year+1-yearFarAway] < highThreshold {
					toBeDeleted[nodeID] = true
					continue
				}
			} else {
				toBeDeleted[nodeID] = true
				continue
			}

		}
	}

	for nodeID, _ := range toBeDeleted {
		delete(result, nodeID)
	}

	return result
}

// =================================================================================================
// func (g *CitationGraph) SortByYear
// brief description: return a sorted list of titles
// input:
//	None
// output:
//	A map with year as key and slice of titles as value.
func (g *CitationGraph) SortByYear() map[int][]string {
	// ---------------------------------------------------------------------------------------------
	// step 1: create an empty map of the result
	result := map[int][]string{}

	// ---------------------------------------------------------------------------------------------
	// step 2: iterate through the citation graph and put titles into result according to year
	for _, node := range g.Nodes {
		year := int(node.Year)
		resultOfYear, exists := result[year]
		if !exists {
			resultOfYear = []string{}
		}
		result[year] = append(resultOfYear, node.Title)
	}

	// ---------------------------------------------------------------------------------------------
	// step 3: return the result
	return result
}

func (g *CitationGraph) Word2Vec(fileNamePrefix string, numIters, minFreq int,
	minScore float64, useW2PEx bool, yearStartFrom, yearEndWith int) {
	g.SaveWord2VecTrainingData(fileNamePrefix, numIters, minFreq, minScore, useW2PEx, yearStartFrom)

	for year := yearStartFrom; year <= yearEndWith; year++ {
		model, err := word2vec.New(
			word2vec.Window(5),
			word2vec.Model(word2vec.Cbow),
			word2vec.Optimizer(word2vec.NegativeSampling),
			word2vec.NegativeSampleSize(5),
			word2vec.Dim(100),
			word2vec.Iter(100),
			//word2vec.Verbose(),
			word2vec.LogBatch(10000),
		)
		if err != nil {
			log.Fatalln(err)
		}

		inputFileName := fileNamePrefix + fmt.Sprintf("-%d.txt", year)
		input, _ := os.Open(inputFileName)
		defer input.Close()
		if err = model.Train(input); err != nil {
			log.Fatalln(err)
		}

		// write word vector.
		outputFileName := fileNamePrefix + fmt.Sprintf("-cbow-%d.vec", year)
		output, _ := os.Create(outputFileName)
		model.Save(output, vector.Agg)
	}

}

func computePhraseSimilarities(phraseVectors map[string][]float64, highFreqPhrases map[string]bool,
) map[string]map[string]float64 {
	chPhrases := make(chan []string)
	chSimilarities := make(chan map[string]map[string]float64)
	numCPUs := runtime.NumCPU()
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		go func() {
			mySimilarities := map[string]map[string]float64{}
			for phrases := range chPhrases {
				for _, phrase := range phrases {
					vector, _ := phraseVectors[phrase]
					simToPhrase := map[string]float64{}
					mySimilarities[phrase] = simToPhrase
					for another, anotherVector := range phraseVectors {
						_, isHighFreq := highFreqPhrases[another]
						if !isHighFreq {
							continue
						}

						sim := 0.0
						size1 := 0.0
						size2 := 0.0
						for idx, elem := range vector {
							anotherElem := anotherVector[idx]
							sim += elem * anotherElem
							size1 += elem * elem
							size2 += anotherElem * anotherElem
						}
						sim /= math.Sqrt(size1 * size2)
						simToPhrase[another] = sim
					}
				}
			}
			chSimilarities <- mySimilarities
		}()
	}

	phrases := make([]string, 100)
	idxPhrase := 0
	for phrase, _ := range phraseVectors {
		_, isHighFreq := highFreqPhrases[phrase]
		if !isHighFreq {
			continue
		}

		phrases[idxPhrase] = phrase
		idxPhrase++
		if idxPhrase == 100 {
			chPhrases <- phrases
			phrases = make([]string, 100)
			idxPhrase = 0
		}
	}
	if idxPhrase > 0 {
		chPhrases <- phrases[0:idxPhrase]
	}
	close(chPhrases)

	similarities := map[string]map[string]float64{}
	for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
		localSimilarities := <-chSimilarities
		for phrase, simToPhrase := range localSimilarities {
			similarities[phrase] = simToPhrase
		}
	}
	return similarities
}

func computeSimRanks(similarities map[string]map[string]float64) map[string]map[string]int {
	result := map[string]map[string]int{}
	for phrase1, simToPhrase1 := range similarities {
		numNeighbors := len(simToPhrase1) - 1
		neighbors := make([]string, numNeighbors)
		idxNeighbor := 0
		for phrase2, _ := range simToPhrase1 {
			if phrase2 == phrase1 {
				continue
			}
			neighbors[idxNeighbor] = phrase2
			idxNeighbor++
		}
		sort.Slice(neighbors, func(i, j int) bool {
			return simToPhrase1[neighbors[i]] > simToPhrase1[neighbors[j]]
		})
		resultOfPhrase1 := map[string]int{}
		for idx, neighbor := range neighbors {
			resultOfPhrase1[neighbor] = idx
		}
		result[phrase1] = resultOfPhrase1
	}
	return result
}

type RankJump struct {
	phrase1, phrase2 string
	jump             int
}

func (g *CitationGraph) Leap2Trend(fileNamePrefix string, yearStartFrom, yearEndWith, minFreq, minJump int) {
	prevSimRanks := map[string]map[string]int{}
	for year := yearStartFrom; year <= yearEndWith; year++ {
		// open vector file and id file
		vectorFileName := fileNamePrefix + fmt.Sprintf("-cbow-%d.vec", year)
		idFileName := fileNamePrefix + fmt.Sprintf("-IDs-%d.txt", year)
		trainFileName := fileNamePrefix + fmt.Sprintf("-%d.txt", year)
		vectorFile, _ := os.Open(vectorFileName)
		idFile, _ := os.Open(idFileName)
		trainFile, _ := os.Open(trainFileName)
		defer vectorFile.Close()
		defer idFile.Close()
		defer trainFile.Close()

		// load vectors from vector file
		vectorScanner := bufio.NewScanner(vectorFile)
		phraseVectors := map[string][]float64{}
		for vectorScanner.Scan() {
			textLine := vectorScanner.Text()
			fields := strings.Split(textLine, " ")
			if len(fields) < 101 {
				continue
			}
			phrase := fields[0]
			phraseVector := make([]float64, 100)
			phraseVectors[phrase] = phraseVector
			for j := 0; j < 100; j++ {
				phraseVector[j], _ = strconv.ParseFloat(fields[j+1], 64)
			}
		}

		// load ids from id file
		idScanner := bufio.NewScanner(idFile)
		ids := []int64{}
		for idScanner.Scan() {
			idText := idScanner.Text()
			id, _ := strconv.ParseInt(idText, 10, 64)
			ids = append(ids, id)
		}

		// find high freq phrases
		phraseFreqs := map[string]int{}
		trainScanner := bufio.NewScanner(trainFile)
		for trainScanner.Scan() {
			phrases := strings.Split(trainScanner.Text(), " ")
			for _, phrase := range phrases {
				oldFreq, exists := phraseFreqs[phrase]
				if !exists {
					oldFreq = 0
				}
				phraseFreqs[phrase] = oldFreq + 1
			}
		}

		highFreqPhrases := map[string]bool{}
		for phrase, freq := range phraseFreqs {
			if freq < minFreq {
				continue
			}
			highFreqPhrases[phrase] = true
		}

		// compute phrase similarity matrix
		phraseSimilarities := computePhraseSimilarities(phraseVectors, highFreqPhrases)

		// compute ranks
		simRanks := computeSimRanks(phraseSimilarities)

		// find top jumps in phrase similarities
		jumps := []RankJump{}
		for phrase1, rankWithPhrase1 := range prevSimRanks {
			for phrase2, prevRank := range rankWithPhrase1 {
				rank := simRanks[phrase1][phrase2]
				jump := rank - prevRank
				if jump < minJump {
					continue
				}
				jumps = append(jumps, RankJump{phrase1: phrase1, phrase2: phrase2, jump: jump})
			}
		}
		sort.Slice(jumps, func(i, j int) bool {
			return jumps[i].jump > jumps[j].jump
		})

		outputFileName := fileNamePrefix + fmt.Sprintf("-jumpranking-%d.csv", year)
		outputFile, _ := os.Create(outputFileName)
		defer outputFile.Close()
		for idx, rankJump := range jumps {
			outputFile.WriteString(fmt.Sprintf("%d, %s, %s, %d\n", idx, rankJump.phrase1, rankJump.phrase2, rankJump.jump))
		}

		// record phraseSimilarities for next round
		prevSimRanks = simRanks
	}
}

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: about 3 years ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads: unknown
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 0
    (may contain duplicates)
  • Total versions: 2
proxy.golang.org: github.com/wujunfeng1/citationgraphs

CitationGraphs.go project CitationGraphs.go CitationGraphs.go document

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 7.0%
Average: 8.2%
Dependent repos count: 9.3%
Last synced: 10 months ago
proxy.golang.org: github.com/wujunfeng1/CitationGraphs

CitationGraphs.go project CitationGraphs.go CitationGraphs.go document

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 7.0%
Average: 8.2%
Dependent repos count: 9.3%
Last synced: over 1 year ago