Science Score: 18.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (0.5%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
·
Repository
A go package for citation graphs.
Basic Info
- Host: GitHub
- Owner: wujunfeng1
- License: mit
- Language: Go
- Default Branch: main
- Size: 72.3 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Created over 5 years ago
· Last pushed about 5 years ago
Metadata Files
Readme
License
Citation
README.md
CitationGraphs.go
A go package for citation graphs.
Citation (CitationGraphs.go)
// CitationGraphs.go project CitationGraphs.go
package CitationGraphs
import (
"bufio"
"encoding/json"
"fmt"
"log"
"math"
"math/rand"
"os"
"regexp"
"runtime"
"sort"
"strconv"
"strings"
"sync"
"time"
"github.com/chrisport/go-lang-detector/langdet"
"github.com/chrisport/go-lang-detector/langdet/langdetdef"
"github.com/wujunfeng1/ConcurrenceBasedClustering"
"github.com/wujunfeng1/KeyphraseExtraction"
"github.com/wujunfeng1/wego/pkg/model/modelutil/vector"
"github.com/wujunfeng1/wego/pkg/model/word2vec"
)
var reUnicodeHex *regexp.Regexp
var reUnicodeDec *regexp.Regexp
var langDetector langdet.Detector
func init() {
reUnicodeHex = regexp.MustCompile("&//[Xx]([A-Fa-f0-9])+;")
reUnicodeDec = regexp.MustCompile("&//([0-9])+;")
rand.Seed(time.Now().Unix())
langDetector = langdetdef.NewWithDefaultLanguages()
}
// =================================================================================================
// struct CitationNode
// brief description: The node structure of a citation graph, the data of which are collected
// from crawling the website https://academic.microsoft.com , which is the
// public service website of MAG(Microsoft Academic Graph)
// fields:
// id: The MAG(Microsoft Academic Graph) id of the paper at this node.
// We can access the detail of the paper with the id by navigating the web link:
// https://academic.microsoft.com/paper/$id
// year: The year of publication of this paper.
// title: The title of the paper.
// labels: The labels from MAG(Microsoft Academic Graph).
// refs: The references of this paper collected from MAG.
// Please note that this field could be inaccurate: MAG suffers delay of info.
// Many new papers with refs don't have refs listed in MAG.
// cites: The citations to this paper from other papers (also collected from MAG).
// Please note that this field is as inaccurate as refs due to the same reason.
type CitationNode struct {
ID int64
Year int64
Title string
Labels []string
Refs []int64
Cites []int64
}
// =================================================================================================
// struct CitationGraph
// brief description: A data structure of citation graph, the data of which are collected from
// the website https://academic.microsoft.com , which is the public service
// website of MAG(Microsoft Academic Graph)
// fields:
// nodes: The node of struct CitationNode stored in a dictionary with key = id
// toBeAnalyzed: The list of nodes to be analyzed. In ijcai_clustering dataset, these nodes
// are those published in IJCAI.
type CitationGraph struct {
Nodes map[int64]*CitationNode
ToBeAnalyzed []int64
idxMainNodes map[int64]int
}
// =================================================================================================
// struct Corpus
// brief description: the corpus data structure
type Corpus struct {
Vocab map[string]int // the vocabulary
Docs []map[int]int // keys: docID, wordID, value: word count
}
// =================================================================================================
// struct CorpusX
// brief description: the extended corpus data structure
type CorpusX struct {
Vocab map[string]int // the vocabulary
Docs [][]map[int]int // keys: docID, wordID, value: word count
}
// =================================================================================================
// struct CorpusSeq
// brief description: the extended corpus data structure
type CorpusSeq struct {
Vocab map[string]int // the vocabulary
Docs [][]int // keys: docID, wordIdx, value: word id
}
// =================================================================================================
// func NewCorpus
// brief description: create an empty corpus
func NewCorpus() *Corpus {
return &Corpus{
Vocab: map[string]int{},
Docs: []map[int]int{},
}
}
// =================================================================================================
// func NewCorpusX
// brief description: create an empty corpus
func NewCorpusX() *CorpusX {
return &CorpusX{
Vocab: map[string]int{},
Docs: [][]map[int]int{},
}
}
// =================================================================================================
// func NewCorpusSeq
// brief description: create an empty corpus
func NewCorpusSeq() *CorpusSeq {
return &CorpusSeq{
Vocab: map[string]int{},
Docs: [][]int{},
}
}
// =================================================================================================
// func (this *Corpus) AddDoc
// brief description: add one document to corpus with specified docId and word count list, if the
// specified docId already exists in corpus, the old doc will be overwritted
func (this *Corpus) AddDoc(words []string) {
// ---------------------------------------------------------------------------------------------
// step 1: create a new doc and count word counts
doc := map[int]int{}
for _, word := range words {
wordID, exists := this.Vocab[word]
if !exists {
wordID = len(this.Vocab)
this.Vocab[word] = wordID
}
count, exists := doc[wordID]
if !exists {
count = 0
}
doc[wordID] = count + 1
}
// ---------------------------------------------------------------------------------------------
// step 2: add the new doc into docs
this.Docs = append(this.Docs, doc)
}
// =================================================================================================
// func (this *CorpusX) AddDoc
// brief description: add one document to corpus with specified docId and word count list, if the
// specified docId already exists in corpus, the old doc will be overwritted
func (this *CorpusX) AddDoc(words [][]string) {
// ---------------------------------------------------------------------------------------------
// step 1: create a new doc and count word counts
doc := make([]map[int]int, len(words))
for idx, wordGroup := range words {
doc[idx] = map[int]int{}
for _, word := range wordGroup {
wordID, exists := this.Vocab[word]
if !exists {
wordID = len(this.Vocab)
this.Vocab[word] = wordID
}
count, exists := doc[idx][wordID]
if !exists {
count = 0
}
doc[idx][wordID] = count + 1
}
}
// ---------------------------------------------------------------------------------------------
// step 2: add the new doc into docs
this.Docs = append(this.Docs, doc)
}
// =================================================================================================
// func (this *CorpusSeq) AddDoc
// brief description: add one document to corpus with specified docId and word count list, if the
// specified docId already exists in corpus, the old doc will be overwritted
func (this *CorpusSeq) AddDoc(words []string) {
// ---------------------------------------------------------------------------------------------
// step 1: create a new doc and count word counts
doc := make([]int, len(words))
for idx, word := range words {
wordID, exists := this.Vocab[word]
if !exists {
wordID = len(this.Vocab)
this.Vocab[word] = wordID
}
doc[idx] = wordID
}
// ---------------------------------------------------------------------------------------------
// step 2: add the new doc into docs
this.Docs = append(this.Docs, doc)
}
// =================================================================================================
// func (this *CorpusSeq) GetConcurrences
// brief description: get concurrences from corpus
func (this *CorpusSeq) GetConcurrences() map[uint]map[uint]float64 {
// ---------------------------------------------------------------------------------------------
// step 1: create concurrences
concurrences := map[uint]map[uint]float64{}
// ---------------------------------------------------------------------------------------------
// step 2: spawn goroutines to count concurrences when words are in the same document
numCPUs := runtime.NumCPU()
numDocs := len(this.Docs)
chI := make(chan int)
chWorkers := make(chan bool)
lock := sync.RWMutex{}
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
go func() {
myConcurrences := map[uint]map[uint]float64{}
for i0 := range chI {
i1 := i0 + 100
if i1 > numDocs {
i1 = numDocs
}
for i := i0; i < i1; i++ {
wordCount := this.Docs[i]
for w1, count1 := range wordCount {
concurrencesOfW1, exists := myConcurrences[uint(w1)]
if !exists {
concurrencesOfW1 = map[uint]float64{}
myConcurrences[uint(w1)] = concurrencesOfW1
}
for w2, count2 := range wordCount {
if w1 != w2 {
count, exists := concurrencesOfW1[uint(w2)]
if !exists {
count = 0.0
}
concurrencesOfW1[uint(w2)] = count + float64(count1*count2)
}
}
}
}
}
// now we merge the results into concurrences
lock.Lock()
defer lock.Unlock()
for key1, value1 := range myConcurrences {
row, exists := concurrences[key1]
if !exists {
concurrences[key1] = value1
} else {
for key2, value2 := range value1 {
oldValue, exists := row[key2]
if !exists {
oldValue = 0.0
}
row[key2] = oldValue + value2
}
}
}
// tell main thread that this worker finishes
chWorkers <- true
}()
}
// ---------------------------------------------------------------------------------------------
// step 3: send tasks through chI
for i := 0; i < numDocs; i += 100 {
chI <- i
}
close(chI)
// ---------------------------------------------------------------------------------------------
// step 4: wait for all workers
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
<-chWorkers
}
// ---------------------------------------------------------------------------------------------
// step 5: return the concurrences
return concurrences
}
// =================================================================================================
// func (this *CorpusSeq) Word2Phrase
func (this *CorpusSeq) Word2Phrase(numIters, minFreq int, minScore float64) *CorpusSeq {
if numIters <= 0 {
return this
}
// ---------------------------------------------------------------------------------------------
// step 1: count unigram/bigram freqs
freqs1 := map[int]int{}
freqs2 := map[int]map[int]int{}
for _, doc := range this.Docs {
for idx, wordID := range doc {
oldFreqUnigram, exists := freqs1[wordID]
if !exists {
oldFreqUnigram = 0
}
freqs1[wordID] = oldFreqUnigram + 1
if idx > 0 {
prevWordID := doc[idx-1]
freqsOfPrev, exists := freqs2[prevWordID]
if !exists {
freqsOfPrev = map[int]int{}
freqs2[prevWordID] = freqsOfPrev
}
oldFreqBigram, exists := freqsOfPrev[wordID]
if !exists {
oldFreqBigram = 0
}
freqsOfPrev[wordID] = oldFreqBigram + 1
}
}
}
// ---------------------------------------------------------------------------------------------
// step 2: clone the vacabulary and creat the reverse map
vocab := map[string]int{}
words := make([]string, len(this.Vocab))
for word, wordID := range this.Vocab {
vocab[word] = wordID
words[wordID] = word
}
// ---------------------------------------------------------------------------------------------
// step 3: compute score for all those bigrams occur at least minFreq times
for prevWordID, freqsOfPrev := range freqs2 {
for wordID, freq := range freqsOfPrev {
// skip those occur less than or equal to minFreq times
if freq <= minFreq {
continue
}
// compute score
score := float64(freq-minFreq) / float64(freqs1[prevWordID]*freqs1[wordID])
// skip those with small scores
if score < minScore {
continue
}
// put the detected phrase into vocab
phraseID := len(vocab)
phraseText := words[prevWordID] + " " + words[wordID]
vocab[phraseText] = phraseID
fmt.Printf("%s: score %f\n", phraseText, score)
}
}
// ---------------------------------------------------------------------------------------------
// step 4: create docs with these unigrams and selected bigrams
newDocs := make([][]int, len(this.Docs))
for idxDoc, doc := range this.Docs {
newDoc := []int{}
lenDoc := len(doc)
idxWord := 0
for idxWord < lenDoc {
wordID := doc[idxWord]
useBigram := true
if idxWord+1 < lenDoc {
nextWordID := doc[idxWord+1]
bigramFreq := freqs2[wordID][nextWordID]
if bigramFreq > minFreq {
score := float64(bigramFreq-minFreq) /
float64(freqs1[wordID]*freqs1[nextWordID])
if score < minScore {
useBigram = false
}
} else {
useBigram = false
}
} else {
useBigram = false
}
if useBigram {
nextWordID := doc[idxWord+1]
bigramText := words[wordID] + " " + words[nextWordID]
phraseID, exists := vocab[bigramText]
if !exists {
log.Fatalln("corrupted vocab")
}
newDoc = append(newDoc, phraseID)
idxWord += 2
} else {
newDoc = append(newDoc, wordID)
idxWord++
}
}
newDocs[idxDoc] = newDoc
}
// ---------------------------------------------------------------------------------------------
// step 5: create a new corpus with unigrams and selected bigrams
newCorpus := &CorpusSeq{Vocab: vocab, Docs: newDocs}
// ---------------------------------------------------------------------------------------------
// step 6: return the result
if numIters == 1 {
return newCorpus
} else {
return newCorpus.Word2Phrase(numIters-1, minFreq, minScore)
}
}
// =================================================================================================
// func (this *CorpusSeq) Word2PhraseEx
func (this *CorpusSeq) Word2PhraseEx(numIters, minFreq int, minScore float64) *CorpusSeq {
if numIters <= 0 {
return this
}
// ---------------------------------------------------------------------------------------------
// step 1: count unigram/bigram freqs
freqs1 := map[int]int{}
freqs2 := map[int]map[int]int{}
freqs3 := map[int]map[int]int{}
for _, doc := range this.Docs {
for idx, wordID := range doc {
oldFreqUnigram, exists := freqs1[wordID]
if !exists {
oldFreqUnigram = 0
}
freqs1[wordID] = oldFreqUnigram + 1
if idx > 0 {
prevWordID := doc[idx-1]
freqsOfPrev, exists := freqs2[prevWordID]
if !exists {
freqsOfPrev = map[int]int{}
freqs2[prevWordID] = freqsOfPrev
}
oldFreqBigram, exists := freqsOfPrev[wordID]
if !exists {
oldFreqBigram = 0
}
freqsOfPrev[wordID] = oldFreqBigram + 1
freqsOfCurr, exists := freqs3[wordID]
if !exists {
freqsOfCurr = map[int]int{}
freqs3[wordID] = freqsOfCurr
}
oldFreqBigramRev, exists := freqsOfCurr[prevWordID]
if !exists {
oldFreqBigramRev = 0
}
freqsOfCurr[prevWordID] = oldFreqBigramRev + 1
}
}
}
// ---------------------------------------------------------------------------------------------
// step 2: clone the vacabulary and creat the reverse map
vocab := map[string]int{}
words := make([]string, len(this.Vocab))
for word, wordID := range this.Vocab {
vocab[word] = wordID
words[wordID] = word
}
// ---------------------------------------------------------------------------------------------
// step 3: compute score for all those bigrams occur at least minFreq times
for prevWordID, freqsOfPrev := range freqs2 {
for wordID, freq := range freqsOfPrev {
// skip those occur less than or equal to minFreq times
if freq <= minFreq {
continue
}
// compute score
cond1 := float64(freq) / float64(freqs1[prevWordID])
cond2 := float64(freq) / float64(freqs1[wordID])
avg1 := 1.0 / float64(len(freqs2[prevWordID]))
avg2 := 1.0 / float64(len(freqs3[wordID]))
score := cond1 * cond2 / (avg1 * avg2)
// skip those with small scores
if score < minScore {
continue
}
// put the detected phrase into vocab
phraseID := len(vocab)
phraseText := words[prevWordID] + " " + words[wordID]
vocab[phraseText] = phraseID
//fmt.Printf("%s: score %f\n", phraseText, score)
}
}
// ---------------------------------------------------------------------------------------------
// step 4: create docs with these unigrams and selected bigrams
newDocs := make([][]int, len(this.Docs))
for idxDoc, doc := range this.Docs {
newDoc := []int{}
lenDoc := len(doc)
idxWord := 0
for idxWord < lenDoc {
wordID := doc[idxWord]
useBigram := true
if idxWord+1 < lenDoc {
nextWordID := doc[idxWord+1]
bigramFreq := freqs2[wordID][nextWordID]
if bigramFreq > minFreq {
cond1 := float64(bigramFreq) / float64(freqs1[wordID])
cond2 := float64(bigramFreq) / float64(freqs1[nextWordID])
avg1 := 1.0 / float64(len(freqs2[wordID]))
avg2 := 1.0 / float64(len(freqs3[nextWordID]))
score := cond1 * cond2 / (avg1 * avg2)
if score < minScore {
useBigram = false
}
} else {
useBigram = false
}
} else {
useBigram = false
}
if useBigram {
nextWordID := doc[idxWord+1]
bigramText := words[wordID] + " " + words[nextWordID]
phraseID, exists := vocab[bigramText]
if !exists {
log.Fatalln("corrupted vocab")
}
newDoc = append(newDoc, phraseID)
idxWord += 2
} else {
newDoc = append(newDoc, wordID)
idxWord++
}
}
newDocs[idxDoc] = newDoc
}
// ---------------------------------------------------------------------------------------------
// step 5: create a new corpus with unigrams and selected bigrams
newCorpus := &CorpusSeq{Vocab: vocab, Docs: newDocs}
// ---------------------------------------------------------------------------------------------
// step 6: return the result
if numIters == 1 {
return newCorpus
} else {
return newCorpus.Word2PhraseEx(numIters-1, minFreq, minScore)
}
}
// =================================================================================================
// func (this *Corpus) GetConcurrences
// brief description: get concurrences from corpus
func (this *Corpus) GetConcurrences() map[uint]map[uint]float64 {
// ---------------------------------------------------------------------------------------------
// step 1: create concurrences
concurrences := map[uint]map[uint]float64{}
// ---------------------------------------------------------------------------------------------
// step 2: spawn goroutines to count concurrences when words are in the same document
numCPUs := runtime.NumCPU()
numDocs := len(this.Docs)
chI := make(chan int)
chWorkers := make(chan bool)
lock := sync.RWMutex{}
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
go func() {
myConcurrences := map[uint]map[uint]float64{}
for i0 := range chI {
i1 := i0 + 100
if i1 > numDocs {
i1 = numDocs
}
for i := i0; i < i1; i++ {
wordCount := this.Docs[i]
for w1, count1 := range wordCount {
concurrencesOfW1, exists := myConcurrences[uint(w1)]
if !exists {
concurrencesOfW1 = map[uint]float64{}
myConcurrences[uint(w1)] = concurrencesOfW1
}
for w2, count2 := range wordCount {
if w1 != w2 {
count, exists := concurrencesOfW1[uint(w2)]
if !exists {
count = 0.0
}
concurrencesOfW1[uint(w2)] = count + float64(count1*count2)
}
}
}
}
}
// now we merge the results into concurrences
lock.Lock()
defer lock.Unlock()
for key1, value1 := range myConcurrences {
row, exists := concurrences[key1]
if !exists {
concurrences[key1] = value1
} else {
for key2, value2 := range value1 {
oldValue, exists := row[key2]
if !exists {
oldValue = 0.0
}
row[key2] = oldValue + value2
}
}
}
// tell main thread that this worker finishes
chWorkers <- true
}()
}
// ---------------------------------------------------------------------------------------------
// step 3: send tasks through chI
for i := 0; i < numDocs; i += 100 {
chI <- i
}
close(chI)
// ---------------------------------------------------------------------------------------------
// step 4: wait for all workers
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
<-chWorkers
}
// ---------------------------------------------------------------------------------------------
// step 5: return the concurrences
return concurrences
}
// =================================================================================================
// func (this *CorpusX) GetExclusions
// brief description: get exclusions from corpus
func (this *CorpusX) GetExclusions() map[uint]map[uint]bool {
// ---------------------------------------------------------------------------------------------
// step 1: create empty exclusions
exclusions := map[uint]map[uint]bool{}
// ---------------------------------------------------------------------------------------------
// step 2: create texts
numTexts := len(this.Vocab)
texts := make([]string, numTexts)
for text, id := range this.Vocab {
texts[id] = text
}
// ---------------------------------------------------------------------------------------------
// step 3: spawn goroutines to find exclusions when words are in the same group
numCPUs := runtime.NumCPU()
numDocs := len(this.Docs)
chI := make(chan int)
chExclusions := make(chan map[uint]map[uint]bool)
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
go func() {
myExclusions := map[uint]map[uint]bool{}
for i0 := range chI {
i1 := i0 + 100
if i1 > numDocs {
i1 = numDocs
}
for i := i0; i < i1; i++ {
wordGroups := this.Docs[i]
for _, wordCount := range wordGroups {
for w1, _ := range wordCount {
exclusionsOfW1, exists := myExclusions[uint(w1)]
if !exists {
exclusionsOfW1 = map[uint]bool{}
myExclusions[uint(w1)] = exclusionsOfW1
}
text1 := texts[w1]
for w2, _ := range wordCount {
if w2 != w1 {
text2 := texts[w2]
if KeyphraseExtraction.Overlaps(text1, text2) {
exclusionsOfW1[uint(w2)] = true
}
}
}
}
}
}
}
// give main thread my results
chExclusions <- myExclusions
}()
}
// ---------------------------------------------------------------------------------------------
// step 4: send tasks through chI
for i := 0; i < numDocs; i += 100 {
chI <- i
}
close(chI)
// ---------------------------------------------------------------------------------------------
// step 5: merge results
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
myExclusions := <-chExclusions
for w1, deltaExclusionsOfW1 := range myExclusions {
exclusionsOfW1, exists := exclusions[w1]
if exists {
for w2, _ := range deltaExclusionsOfW1 {
exclusionsOfW1[w2] = true
}
} else {
exclusions[w1] = deltaExclusionsOfW1
}
}
}
// ---------------------------------------------------------------------------------------------
// step 6: return results
return exclusions
}
// =================================================================================================
// func (this *CorpusX) GetConcurrences
// brief description: get concurrences from corpus
func (this *CorpusX) GetConcurrences() map[uint]map[uint]float64 {
// ---------------------------------------------------------------------------------------------
// step 1: create empty concurrences
concurrences := map[uint]map[uint]float64{}
// ---------------------------------------------------------------------------------------------
// step 2: spawn goroutines to count concurrences when words are in the same document
numCPUs := runtime.NumCPU()
numDocs := len(this.Docs)
chI := make(chan int)
chWorkers := make(chan bool)
lock := sync.RWMutex{}
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
go func() {
myConcurrences := map[uint]map[uint]float64{}
for i0 := range chI {
i1 := i0 + 100
if i1 > numDocs {
i1 = numDocs
}
for i := i0; i < i1; i++ {
wordGroups := this.Docs[i]
for idx1, wordCount1 := range wordGroups {
for w1, count1 := range wordCount1 {
concurrencesOfW1, exists := myConcurrences[uint(w1)]
if !exists {
concurrencesOfW1 = map[uint]float64{}
myConcurrences[uint(w1)] = concurrencesOfW1
}
for idx2, wordCount2 := range wordGroups {
// skip same word group
if idx1 == idx2 {
continue
}
for w2, count2 := range wordCount2 {
if w1 != w2 {
count, exists := concurrencesOfW1[uint(w2)]
if !exists {
count = 0.0
}
concurrencesOfW1[uint(w2)] = count + float64(count1*count2)
}
}
}
}
}
}
}
// now we merge the results into concurrences
lock.Lock()
defer lock.Unlock()
for key1, value1 := range myConcurrences {
row, exists := concurrences[key1]
if !exists {
concurrences[key1] = value1
} else {
for key2, value2 := range value1 {
oldValue, exists := row[key2]
if !exists {
oldValue = 0.0
}
row[key2] = oldValue + value2
}
}
}
// tell main thread that this worker finishes
chWorkers <- true
}()
}
// ---------------------------------------------------------------------------------------------
// step 3: send tasks through chI
for i := 0; i < numDocs; i += 100 {
chI <- i
}
close(chI)
// ---------------------------------------------------------------------------------------------
// step 4: wait for all workers
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
<-chWorkers
}
// ---------------------------------------------------------------------------------------------
// step 5: return the concurrences
return concurrences
}
// =================================================================================================
// func (this *CorpusX) GetConcurrences
// brief description: get concurrences from corpus
func (this *CorpusX) GetDocConcurrences() map[uint]map[uint]float64 {
// ---------------------------------------------------------------------------------------------
// step 1: create empty concurrences
concurrences := map[uint]map[uint]float64{}
// ---------------------------------------------------------------------------------------------
// step 2: spawn goroutines to count concurrences when words are in the same document
numCPUs := runtime.NumCPU()
numDocs := len(this.Docs)
chI := make(chan int)
chWorkers := make(chan bool)
lock := sync.RWMutex{}
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
go func() {
myConcurrences := map[uint]map[uint]float64{}
for i0 := range chI {
i1 := i0 + 100
if i1 > numDocs {
i1 = numDocs
}
for i := i0; i < i1; i++ {
wordGroups := this.Docs[i]
for idx1, wordCount1 := range wordGroups {
for w1, _ := range wordCount1 {
concurrencesOfW1, exists := myConcurrences[uint(w1)]
visited := map[int]bool{}
if !exists {
concurrencesOfW1 = map[uint]float64{}
myConcurrences[uint(w1)] = concurrencesOfW1
}
for idx2, wordCount2 := range wordGroups {
// skip same word group
if idx1 == idx2 {
continue
}
for w2, _ := range wordCount2 {
_, w2Visited := visited[w2]
if w1 != w2 && !w2Visited {
count, exists := concurrencesOfW1[uint(w2)]
if !exists {
count = 0.0
}
concurrencesOfW1[uint(w2)] = count + 1.0
visited[w2] = true
}
}
}
}
}
}
}
// now we merge the results into concurrences
lock.Lock()
defer lock.Unlock()
for key1, value1 := range myConcurrences {
row, exists := concurrences[key1]
if !exists {
concurrences[key1] = value1
} else {
for key2, value2 := range value1 {
oldValue, exists := row[key2]
if !exists {
oldValue = 0.0
}
row[key2] = oldValue + value2
}
}
}
// tell main thread that this worker finishes
chWorkers <- true
}()
}
// ---------------------------------------------------------------------------------------------
// step 3: send tasks through chI
for i := 0; i < numDocs; i += 100 {
chI <- i
}
close(chI)
// ---------------------------------------------------------------------------------------------
// step 4: wait for all workers
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
<-chWorkers
}
// ---------------------------------------------------------------------------------------------
// step 5: return the concurrences
return concurrences
}
// =================================================================================================
// func (this *Corpus) translate
func (this *Corpus) translate(subCorpus *Corpus) *Corpus {
oldSubVocab := subCorpus.Vocab
translation := map[int]int{}
for word, oldID := range oldSubVocab {
newID, exists := this.Vocab[word]
if !exists {
log.Fatal("not a sub corpus")
}
translation[oldID] = newID
}
newDocs := make([]map[int]int, len(subCorpus.Docs))
for docID, oldDoc := range subCorpus.Docs {
newDoc := map[int]int{}
for oldID, count := range oldDoc {
newDoc[translation[oldID]] = count
}
newDocs[docID] = newDoc
}
return &Corpus{Vocab: this.Vocab, Docs: newDocs}
}
// =================================================================================================
// func (this *CorpusX) translate
func (this *CorpusX) translate(subCorpus *CorpusX) *CorpusX {
oldSubVocab := subCorpus.Vocab
translation := map[int]int{}
for word, oldID := range oldSubVocab {
newID, exists := this.Vocab[word]
if !exists {
log.Fatal("not a sub corpus")
}
translation[oldID] = newID
}
newDocs := make([][]map[int]int, len(subCorpus.Docs))
for docID, oldDoc := range subCorpus.Docs {
newDoc := make([]map[int]int, len(oldDoc))
for idxGroup, oldWordGroup := range oldDoc {
newDoc[idxGroup] = map[int]int{}
for oldID, count := range oldWordGroup {
newDoc[idxGroup][translation[oldID]] = count
}
}
newDocs[docID] = newDoc
}
return &CorpusX{Vocab: this.Vocab, Docs: newDocs}
}
// =================================================================================================
// interface LDAModel
// brief description: the common interface of LDA models
type TopicModel interface {
// train model for iter iteration
Train(numIters int)
// do inference for new doc with its wordCounts
Infer(doc map[int]int) []float64
// compute entropy
ComputeEntropy() float64
// compute relative entropy
ComputeRelativeEntropy() float64
}
// =================================================================================================
// struct DocWord
type DocWord struct {
DocId int
WordId int
}
// =================================================================================================
// struct LDA
// brief description: the data structure of LDA model with Collapsed Gibbs Sampler
// note:
// The fast collapsed gibbs sampler algorithm can be found in reference:
// Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., & Welling, M. (2008, August). Fast
// collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the 14th ACM SIGKDD
// international conference on Knowledge discovery and data mining (pp. 569-577).
type LDA struct {
Alpha float64 // document topic mixture hyperparameter
Beta float64 // topic word mixture hyperparameter
NumTopics int // number of topics
Data *Corpus // the input corpus
WordTopicCount [][]int // word-topic count table
DocTopicCount [][]int // doc-topic count table
TopicCountSum []int // word-topic-sum count table
DocWordToTopic map[DocWord][]int // doc-word-topic count table
}
// =================================================================================================
// func NewLDA
// brief description: create an LDA instance with collapsed gibbs sampler
func NewLDA(numTopics int, alpha float64, beta float64, data *Corpus) *LDA {
// ---------------------------------------------------------------------------------------------
// step 1: check parameters
if data == nil {
log.Fatalln("corpus is nil")
}
if numTopics <= 0 {
log.Fatalln("numTopics cannot <= 0.")
}
// ---------------------------------------------------------------------------------------------
// step 2: create wordTopicCount
numWords := len(data.Vocab)
wordTopicCount := make([][]int, numWords)
for w := 0; w < numWords; w++ {
topicCountOfW := make([]int, numTopics)
wordTopicCount[w] = topicCountOfW
}
// ---------------------------------------------------------------------------------------------
// step 3: create docTopicCount
numDocs := len(data.Docs)
docTopicCount := make([][]int, numDocs)
for doc := 0; doc < numDocs; doc++ {
topicCountOfDoc := make([]int, numTopics)
docTopicCount[doc] = topicCountOfDoc
}
// ---------------------------------------------------------------------------------------------
// step 4: create topicCountSum
topicCountSum := make([]int, numTopics)
// ---------------------------------------------------------------------------------------------
// step 5: create docWordToTopic
docWordToTopic := map[DocWord][]int{}
for doc, wordCounts := range data.Docs {
for w, count := range wordCounts {
docWord := DocWord{doc, w}
toTopic := make([]int, count)
docWordToTopic[docWord] = toTopic
}
}
// ---------------------------------------------------------------------------------------------
// step 6: assemble the result
result := &LDA{
Alpha: alpha,
Beta: beta,
NumTopics: numTopics,
Data: data,
WordTopicCount: wordTopicCount,
DocTopicCount: docTopicCount,
TopicCountSum: topicCountSum,
DocWordToTopic: docWordToTopic,
}
// ---------------------------------------------------------------------------------------------
// step 7: return the result
return result
}
// =================================================================================================
// func (this *LDA) updateCounters
func (this *LDA) updateCounters() {
// ---------------------------------------------------------------------------------------------
// step 1: initialize Counters
numTopics := this.NumTopics
for idxTopic := 0; idxTopic < numTopics; idxTopic++ {
this.TopicCountSum[idxTopic] = 0
}
numWords := len(this.Data.Vocab)
for w := 0; w < numWords; w++ {
topicCountOfW := this.WordTopicCount[w]
for idxTopic := 0; idxTopic < numTopics; idxTopic++ {
topicCountOfW[idxTopic] = 0
}
}
numDocs := len(this.Data.Docs)
for doc := 0; doc < numDocs; doc++ {
topicCountOfDoc := this.DocTopicCount[doc]
for idxTopic := 0; idxTopic < numTopics; idxTopic++ {
topicCountOfDoc[idxTopic] = 0
}
}
// ---------------------------------------------------------------------------------------------
// step 2: count them!
for doc, wordCounts := range this.Data.Docs {
for w, count := range wordCounts {
toTopic := this.DocWordToTopic[DocWord{doc, w}]
for i := 0; i < count; i++ {
// sample word topic
k := toTopic[i]
// update word topic count
this.WordTopicCount[w][k]++
// update doc topic count
this.DocTopicCount[doc][k]++
// update topic count sum
this.TopicCountSum[k]++
}
}
}
}
// =================================================================================================
// func (this *LDA) Init
func (this *LDA) Init() {
// ---------------------------------------------------------------------------------------------
// step 1: randomly assign topic to word
for doc, wordCounts := range this.Data.Docs {
for w, count := range wordCounts {
toTopic := this.DocWordToTopic[DocWord{doc, w}]
for i := 0; i < count; i++ {
// sample word topic
k := rand.Intn(this.NumTopics)
// record the topic assignment
toTopic[i] = k
}
}
}
// ---------------------------------------------------------------------------------------------
// step 2: update counters
this.updateCounters()
}
// =================================================================================================
// func (this *LDA) probTopicOfDocWord
func (this *LDA) probTopicOfDocWord(doc int, w int, k int, idxK int) float64 {
numWords := float64(len(this.Data.Vocab))
myDocTopicCount := float64(this.DocTopicCount[doc][idxK])
myWordTopicCount := float64(this.WordTopicCount[w][idxK])
myTopicCountSum := float64(this.TopicCountSum[idxK])
if idxK == k {
myDocTopicCount--
myWordTopicCount--
myTopicCountSum--
}
docPart := this.Alpha + myDocTopicCount
wordPart := (this.Beta + myWordTopicCount) / (this.Beta*numWords + myTopicCountSum)
return docPart * wordPart
}
// =================================================================================================
// func (this *LDA) ResampleTopics
func (this *LDA) ResampleTopics(numIters int) {
// ---------------------------------------------------------------------------------------------
// step 1: resample topics in parallel
numCPUs := runtime.NumCPU()
for idxIter := 0; idxIter < numIters; idxIter++ {
log.Printf("iter %5d, relative entropy %f", idxIter, this.ComputeRelativeEntropy())
// (1.1) create channels for input and output
chDocs := make(chan []int)
chWorkers := make(chan bool)
// (1.2) create goroutines for resampling
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
go func() {
prefixSum := make([]float64, this.NumTopics)
for docs := range chDocs {
for doc := range docs {
wordCounts := this.Data.Docs[doc]
for w, count := range wordCounts {
docWord := DocWord{doc, w}
toTopic := this.DocWordToTopic[docWord]
for i := 0; i < count; i++ {
k := toTopic[i]
// (1.2.1) compute resampling probabilities
for idxK := 0; idxK < this.NumTopics; idxK += 1 {
prob := this.probTopicOfDocWord(doc, w, k, idxK)
if idxK == 0 {
prefixSum[idxK] = prob
} else {
prefixSum[idxK] = prefixSum[idxK-1] + prob
}
}
// (1.2.2) use resampling probabilities to resample topic
u := rand.Float64() * prefixSum[this.NumTopics-1]
for idxK := 0; idxK < this.NumTopics; idxK++ {
if u < prefixSum[idxK] {
k = idxK
break
}
}
// (1.2.3) record the new topic
toTopic[i] = k
}
}
}
}
chWorkers <- true
}()
}
// (1.3) generate inputs
numDocs := len(this.Data.Docs)
for doc := 0; doc < numDocs; doc += 100 {
lenDocs := 100
if doc+lenDocs > numDocs {
lenDocs = numDocs - doc
}
docs := make([]int, lenDocs)
for i := 0; i < lenDocs; i++ {
docs[i] = doc + i
}
chDocs <- docs
}
close(chDocs)
// (1.4) wait for all workers to complete their jobs
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
<-chWorkers
}
// (1.5) update the counters
this.updateCounters()
}
// step 2: report entropy
log.Printf("final relative entropy %f", this.ComputeRelativeEntropy())
}
// =================================================================================================
// func (this *LDA) Train
// brief description: train model
func (this *LDA) Train(numIters int) {
// randomly init
this.Init()
// resample topics
this.ResampleTopics(numIters)
}
// =================================================================================================
// func (this *LDA) Infer
// brief description: infer topics on new documents
func (this *LDA) Infer(wordCounts map[int]int) []float64 {
// ---------------------------------------------------------------------------------------------
// step 1: compute unscaled probabilities
probs := make([]float64, this.NumTopics)
sumProbs := float64(0.0)
numWords := len(this.Data.Vocab)
for idxK := 0; idxK < this.NumTopics; idxK++ {
myProb := float64(0.0)
for w, count := range wordCounts {
myWordTopicCount := float64(this.WordTopicCount[w][idxK])
myTopicCountSum := float64(this.TopicCountSum[idxK])
myProb += float64(count) * (this.Beta + myWordTopicCount) /
(this.Beta*float64(numWords) + myTopicCountSum)
}
sumProbs += myProb
probs[idxK] = myProb
}
// ---------------------------------------------------------------------------------------------
// step 2: scale the probs to make their sum == 1.0
if sumProbs == 0.0 {
sumProbs = 1.0
}
for idxK := 0; idxK < this.NumTopics; idxK++ {
probs[idxK] /= sumProbs
}
// ---------------------------------------------------------------------------------------------
// step 3: return the result
return probs
}
// =================================================================================================
// func (this *LDA) ComputeEntropy
// brief description: compute entropy
func (this *LDA) ComputeEntropy() float64 {
entropy := 0.0
sumCount := 0
for doc, _ := range this.Data.Docs {
myTopicCount := this.DocTopicCount[doc]
mySumCount := 0
for k := 0; k < this.NumTopics; k++ {
mySumCount += myTopicCount[k]
}
myEntropy := 0.0
for k := 0; k < this.NumTopics; k++ {
if myTopicCount[k] > 0 {
myP := float64(myTopicCount[k]) / float64(mySumCount)
myEntropy -= myP * math.Log(myP)
}
}
sumCount += mySumCount
entropy += float64(mySumCount) * myEntropy
}
entropy /= float64(sumCount)
return entropy
}
// =================================================================================================
// func (this *LDA) ComputeEntropy
// brief description: compute entropy
func (this *LDA) ComputeRelativeEntropy() float64 {
entropy := this.ComputeEntropy()
maxEntropy := -math.Log(1.0 / float64(this.NumTopics))
return entropy / maxEntropy
}
// =================================================================================================
// function convertUnicodeHex
// brief description:
// Convert a hex representation string of unicode to the unicode.
// input:
// s: The hex representation.
// output:
// The unicode.
func convertUnicodeHex(s []byte) []byte {
ss := s[3 : len(s)-1]
u, err := strconv.ParseInt(string(ss), 16, 64)
if err != nil {
return []byte("<?>")
} else {
return []byte(fmt.Sprintf("%c", u))
}
}
// =================================================================================================
// function convertUnicodeDec
// brief description:
// Convert a dec representation string of unicode to the unicode.
// input:
// s: The dec representation.
// output:
// The unicode.
func convertUnicodeDec(s []byte) []byte {
ss := s[2 : len(s)-1]
u, err := strconv.ParseInt(string(ss), 10, 64)
if err != nil {
return []byte("<?>")
} else {
return []byte(fmt.Sprintf("%c", u))
}
}
// =================================================================================================
// function TidyTitle
// brief description: Make a title more tidy before using it. The procedure include the
// steps:
// (1) remove the spaces at the head and the tail of the title,
// (2) replace "<" with "<",
// (3) replace ">" with ">",
// (4) replace "&" with "&",
// (5) replace """ with "\"",
// (6) replace "'" with "'",
// (7) replace "&//[number];" with a corresponding unicode
// input:
// title: The String (text) of a title.
// output:
// A string of the tidied-up title.
func TidyTitle(title string) string {
// --------------------------------------------------------------------------------------
// step 1: remove the spaces at the head and the tail
result := strings.TrimSpace(title)
// --------------------------------------------------------------------------------------
// step 2: replace "<" with "<"
result = strings.Replace(result, "<", "<", -1)
// --------------------------------------------------------------------------------------
// step 3: replace ">" with ">"
result = strings.Replace(result, ">", ">", -1)
// --------------------------------------------------------------------------------------
// step 4: replace "&" with "&"
result = strings.Replace(result, "&", "&", -1)
// --------------------------------------------------------------------------------------
// step 5: replace """ with "\""
result = strings.Replace(result, """, "\"", -1)
// --------------------------------------------------------------------------------------
// step 6: replace "'" with "'"
result = strings.Replace(result, "'", "'", -1)
// --------------------------------------------------------------------------------------
// step 7: replace "&//[number];" with a corresponding unicode
byteResult := []byte(result)
byteResult = reUnicodeHex.ReplaceAllFunc(byteResult, convertUnicodeHex)
byteResult = reUnicodeDec.ReplaceAllFunc(byteResult, convertUnicodeDec)
result = string(byteResult)
// --------------------------------------------------------------------------------------
// step 8: return the result
return result
}
// =================================================================================================
// function LoadCitationGraph
// brief description:
// Load data from three files (nodes, edges, labels) of a citation graph.
// input:
// path: The name of the path where the three files are stored.
// prefix: The prefix of the names of the three files. For example, the prefix of
// ijcai-citation-graph-nodes.csv is ijcai.
// output:
// The citation graph represented by the three files.
func LoadCitationGraph(path string, prefix string) *CitationGraph {
// --------------------------------------------------------------------------------------
// step 1: Prepare the data structure of the result.
result := new(CitationGraph)
result.Nodes = make(map[int64]*CitationNode)
result.idxMainNodes = map[int64]int{}
// --------------------------------------------------------------------------------------
// step 2: Assemble the file names for nodes, edges, and labels.
fileNameOfNodes := path + "/" + prefix + "-citation-graph-nodes.csv"
fileNameOfEdges := path + "/" + prefix + "-citation-graph-edges.csv"
fileNameOfLabels := path + "/" + prefix + "-citation-graph-labels.csv"
// --------------------------------------------------------------------------------------
// step 3: Load the nodes.
// (3.1) open the file of Nodes
fileOfNodes, err := os.Open(fileNameOfNodes)
if err != nil {
log.Fatal(err)
}
defer fileOfNodes.Close()
scannerOfNodes := bufio.NewScanner(fileOfNodes)
// (3.2) Examine the first line to check whether the file format is correct.
if !scannerOfNodes.Scan() {
log.Fatal("Cannot read " + fileNameOfNodes)
}
firstLine := scannerOfNodes.Text()
columnNames := strings.Split(firstLine, ",")
if len(columnNames) != 4 {
log.Fatal("Incorrect file format of " + fileNameOfNodes)
}
if strings.TrimSpace(columnNames[0]) != "#id" ||
strings.TrimSpace(columnNames[1]) != "in-"+prefix ||
strings.TrimSpace(columnNames[2]) != "year" ||
strings.TrimSpace(columnNames[3]) != "title" {
log.Fatal("Incorrect file format of " + fileNameOfNodes)
}
// (3.3) Read the rest of lines.
for scannerOfNodes.Scan() {
line := scannerOfNodes.Text()
columns := strings.Split(line, ",")
for i := 1; i < 4; i++ {
columns[i] = strings.TrimSpace(columns[i])
}
id, _ := strconv.ParseInt(columns[0], 10, 64)
isMainNode, _ := strconv.ParseBool(columns[1])
year, _ := strconv.ParseInt(columns[2], 10, 64)
title := strings.ReplaceAll(columns[3], "[comma]", ",")
node := new(CitationNode)
node.ID = id
node.Year = year
node.Title = title
result.Nodes[id] = node
if isMainNode {
result.idxMainNodes[id] = len(result.ToBeAnalyzed)
result.ToBeAnalyzed = append(result.ToBeAnalyzed, id)
}
}
// --------------------------------------------------------------------------------------
// step 4: Load the edges.
// (4.1) Open the file of edges
fileOfEdges, err := os.Open(fileNameOfEdges)
if err != nil {
log.Fatal(err)
}
defer fileOfEdges.Close()
scannerOfEdges := bufio.NewScanner(fileOfEdges)
// (4.2) Examine the first line to check whether the file format is correct.
if !scannerOfEdges.Scan() {
log.Fatal("Cannot read " + fileNameOfEdges)
}
firstLine = scannerOfEdges.Text()
columnNames = strings.Split(firstLine, ",")
if len(columnNames) != 2 {
log.Fatal("Incorrect file format of " + fileNameOfEdges)
}
if strings.TrimSpace(columnNames[0]) != "#id" ||
strings.TrimSpace(columnNames[1]) != "ref-id" {
log.Fatal("Incorrect file format of " + fileNameOfEdges)
}
// (4.3) read the rest of lines
for scannerOfEdges.Scan() {
line := scannerOfEdges.Text()
columns := strings.Split(line, ",")
for i := 1; i < 2; i++ {
columns[i] = strings.TrimSpace(columns[i])
}
id, _ := strconv.ParseInt(columns[0], 10, 64)
refID, _ := strconv.ParseInt(columns[1], 10, 64)
node, _ := result.Nodes[id]
refNode, _ := result.Nodes[refID]
node.Refs = append(node.Refs, refID)
refNode.Cites = append(refNode.Cites, id)
}
// --------------------------------------------------------------------------------------
// step 5: Load the labels.
// (5.1) Open the file of labels
fileOfLabels, err := os.Open(fileNameOfLabels)
if err != nil {
log.Fatal(err)
}
defer fileOfLabels.Close()
scannerOfLabels := bufio.NewScanner(fileOfLabels)
// (5.2) Examine the first line to check whether the file format is correct.
if !scannerOfLabels.Scan() {
log.Fatal("Cannot read " + fileNameOfLabels)
}
firstLine = scannerOfLabels.Text()
columnNames = strings.Split(firstLine, ",")
if len(columnNames) != 2 {
log.Fatal("Incorrect file format of " + fileNameOfLabels)
}
if strings.TrimSpace(columnNames[0]) != "#id" ||
strings.TrimSpace(columnNames[1]) != "label" {
log.Fatal("Incorrect file format of " + fileNameOfEdges)
}
// (5.3) Read the rest of lines.
for scannerOfLabels.Scan() {
line := scannerOfLabels.Text()
columns := strings.Split(line, ",")
for i := 1; i < 2; i++ {
columns[i] = strings.TrimSpace(columns[i])
}
id, _ := strconv.ParseInt(columns[0], 10, 64)
label := strings.TrimSpace(columns[1])
node, _ := result.Nodes[id]
node.Labels = append(node.Labels, label)
}
// --------------------------------------------------------------------------------------
// step 6: Return the result
return result
}
// =================================================================================================
// struct GSDMM
// brief description: the data structure of GSDMM model
type GSDMM struct {
Alpha float64 // document topic mixture hyperparameter
Beta float64 // topic word mixture hyperparameter
NumTopics int // number of topics
Data *Corpus // the input corpus
NumWordsInDoc []int // the number of words in documents
DocTopic []int // doc-topic count table
TopicWordCount [][]int // word-topic count table
TopicWordCountSum []int // word-topic-sum count table
TopicDocCount []int // topic-doc-sum count table
}
// =================================================================================================
// func NewGSDMM
// brief description: create an LDA instance with collapsed gibbs sampler
func NewGSDMM(numTopics int, alpha float64, beta float64, data *Corpus) *GSDMM {
// ---------------------------------------------------------------------------------------------
// step 1: check parameters
if data == nil {
log.Fatalln("corpus is nil")
}
if numTopics <= 0 {
log.Fatalln("numTopics cannot <= 0.")
}
// ---------------------------------------------------------------------------------------------
// step 2: create docTopic
numDocs := len(data.Docs)
docTopic := make([]int, numDocs)
// ---------------------------------------------------------------------------------------------
// step 3: create topicWordCount
numWords := len(data.Vocab)
topicWordCount := make([][]int, numTopics)
for k := 0; k < numTopics; k++ {
topicWordCount[k] = make([]int, numWords)
}
// ---------------------------------------------------------------------------------------------
// step 4: create topicWordCountSum
topicWordCountSum := make([]int, numTopics)
// ---------------------------------------------------------------------------------------------
// step 5: create topicDocCount
topicDocCount := make([]int, numTopics)
// ---------------------------------------------------------------------------------------------
// step 6: create NumWordsInDoc
numWordsInDoc := make([]int, numDocs)
for doc, wordCounts := range data.Docs {
numWordsInDoc[doc] = 0
for _, count := range wordCounts {
numWordsInDoc[doc] += count
}
}
// ---------------------------------------------------------------------------------------------
// step 7: assemble the result
result := &GSDMM{
Alpha: alpha,
Beta: beta,
NumTopics: numTopics,
Data: data,
NumWordsInDoc: numWordsInDoc,
DocTopic: docTopic,
TopicWordCount: topicWordCount,
TopicWordCountSum: topicWordCountSum,
TopicDocCount: topicDocCount,
}
// ---------------------------------------------------------------------------------------------
// step 8: return the result
return result
}
// =================================================================================================
// func (this *GSDMM) updateCounters
func (this *GSDMM) updateCounters() {
// ---------------------------------------------------------------------------------------------
// step 1: initialize Counters
numTopics := this.NumTopics
numWords := len(this.Data.Vocab)
for idxTopic := 0; idxTopic < numTopics; idxTopic++ {
this.TopicWordCountSum[idxTopic] = 0
this.TopicDocCount[idxTopic] = 0
wordCountOfTopic := this.TopicWordCount[idxTopic]
for w := 0; w < numWords; w++ {
wordCountOfTopic[w] = 0
}
}
// ---------------------------------------------------------------------------------------------
// step 2: count them!
for doc, wordCounts := range this.Data.Docs {
k := this.DocTopic[doc]
this.TopicDocCount[k]++
for w, count := range wordCounts {
this.TopicWordCount[k][w] += count
this.TopicWordCountSum[k] += count
}
}
}
// =================================================================================================
// func (this *GSDMM) Init
func (this *GSDMM) Init() {
// ---------------------------------------------------------------------------------------------
// step 1: randomly assign topic to doc
for doc, _ := range this.Data.Docs {
// sample document topic
this.DocTopic[doc] = rand.Intn(this.NumTopics)
}
// ---------------------------------------------------------------------------------------------
// step 2: update counters
this.updateCounters()
}
// =================================================================================================
// func (this *GSDMM) probTopicOfDoc
func (this *GSDMM) probTopicOfDoc(doc int, k int, idxK int) float64 {
numTopics := float64(this.NumTopics)
numDocs := float64(len(this.Data.Docs))
docCountOfTopic := float64(this.TopicDocCount[idxK])
if idxK == k {
docCountOfTopic--
}
docPart := (docCountOfTopic + this.Alpha) / (numDocs - 1.0 + this.Alpha*numTopics)
topicWordCountSum := float64(this.TopicWordCountSum[idxK])
wordCounts := this.Data.Docs[doc]
if idxK == k {
topicWordCountSum -= float64(this.NumWordsInDoc[doc])
}
numWords := float64(len(this.Data.Vocab))
idxWordInDoc := 0
wordPart := 1.0
for w, count := range wordCounts {
wordCountOfTopic := float64(this.TopicWordCount[idxK][w])
if idxK == k {
wordCountOfTopic -= float64(count)
}
for j := 0; j < count; j++ {
wordPart *= (wordCountOfTopic + this.Beta + float64(j)) / (topicWordCountSum +
this.Beta*numWords + float64(idxWordInDoc))
idxWordInDoc++
}
}
//fmt.Printf("docPart = %f, wordPart = %v, prod = %v\n", docPart, wordPart, docPart*wordPart)
return docPart * wordPart
}
// =================================================================================================
// func (this *GSDMM) ResampleTopics
func (this *GSDMM) ResampleTopics(numIters int) {
// ---------------------------------------------------------------------------------------------
// step 1: resample topics in parallel
numCPUs := runtime.NumCPU()
for idxIter := 0; idxIter < numIters; idxIter++ {
log.Printf("iter %5d, relative entropy %f", idxIter, this.ComputeRelativeEntropy())
// (1.1) create channels for input and output
chDocs := make(chan []int)
chWorkers := make(chan bool)
// (1.2) create goroutines for resampling
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
go func() {
prefixSum := make([]float64, this.NumTopics)
for docs := range chDocs {
//fmt.Println("-------------------------------------------")
for _, doc := range docs {
// (1.2.1) compute resampling probabilities
k := this.DocTopic[doc]
for idxK := 0; idxK < this.NumTopics; idxK++ {
prob := this.probTopicOfDoc(doc, k, idxK)
if idxK == 0 {
prefixSum[idxK] = prob
} else {
prefixSum[idxK] = prefixSum[idxK-1] + prob
}
}
// if doc < 10 {
// fmt.Printf("%d: %v\n", doc, prefixSum)
// }
// (1.2.2) use resampling probabilities to resample topic
u := rand.Float64() * prefixSum[this.NumTopics-1]
for idxK := 0; idxK < this.NumTopics; idxK++ {
if u < prefixSum[idxK] {
k = idxK
break
}
}
// (1.2.3) record the new topic
this.DocTopic[doc] = k
}
}
chWorkers <- true
}()
}
// (1.3) generate inputs
numDocs := len(this.Data.Docs)
for doc := 0; doc < numDocs; doc += 100 {
lenDocs := 100
if doc+lenDocs > numDocs {
lenDocs = numDocs - doc
}
docs := make([]int, lenDocs)
for i := 0; i < lenDocs; i++ {
docs[i] = doc + i
}
chDocs <- docs
}
close(chDocs)
// (1.4) wait for all workers to complete their jobs
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
<-chWorkers
}
// (1.5) update the counters
this.updateCounters()
}
// step 2: report entropy
log.Printf("final relative entropy %f", this.ComputeRelativeEntropy())
}
// =================================================================================================
// func (this *GSDMM) Train
// brief description: train model
func (this *GSDMM) Train(numIters int) {
// randomly init
this.Init()
// resample topics
this.ResampleTopics(numIters)
}
// =================================================================================================
// func (this *GSDMM) Infer
// brief description: infer topics on new documents
func (this *GSDMM) Infer(wordCounts map[int]int) []float64 {
// ---------------------------------------------------------------------------------------------
// step 1: compute unscaled probabilities
probs := make([]float64, this.NumTopics)
sumProbs := float64(0.0)
numTopics := float64(len(this.Data.Docs))
numWords := float64(len(this.Data.Vocab))
for idxK := 0; idxK < this.NumTopics; idxK++ {
// -----------------------------------------------------------------------------------------
// (1.1) compute doc part of probs
docCountOfTopic := float64(this.TopicDocCount[idxK])
docPart := (docCountOfTopic + this.Alpha) / (docCountOfTopic - 1.0 + this.Alpha*numTopics)
// -----------------------------------------------------------------------------------------
// (1.2) compute word part of probs
topicWordCountSum := float64(this.TopicWordCountSum[idxK])
idxWordInDoc := 0
wordPart := 1.0
for w, count := range wordCounts {
wordCountOfTopic := float64(this.TopicWordCount[idxK][w])
for j := 0; j < count; j++ {
wordPart *= (wordCountOfTopic + this.Beta + float64(j)) / (topicWordCountSum +
this.Beta*numWords + float64(idxWordInDoc))
idxWordInDoc++
}
}
// -----------------------------------------------------------------------------------------
// (1.3) assemble myProb
myProb := docPart * wordPart
sumProbs += myProb
probs[idxK] = myProb
}
// ---------------------------------------------------------------------------------------------
// step 2: scale the probs to make their sum == 1.0
if sumProbs == 0.0 {
sumProbs = 1.0
}
for idxK := 0; idxK < this.NumTopics; idxK++ {
probs[idxK] /= sumProbs
}
// ---------------------------------------------------------------------------------------------
// step 3: return the result
return probs
}
// =================================================================================================
// func (this *GSDMM) ComputeEntropy
// brief description: compute entropy
func (this *GSDMM) ComputeEntropy() float64 {
entropy := 0.0
numDocs := float64(len(this.Data.Docs))
for doc, _ := range this.Data.Docs {
k := this.DocTopic[doc]
myProbs := make([]float64, this.NumTopics)
mySumProbs := 0.0
for idxK := 0; idxK < this.NumTopics; idxK++ {
myProbs[idxK] = this.probTopicOfDoc(doc, k, idxK)
mySumProbs += myProbs[idxK]
}
myEntropy := 0.0
if mySumProbs > 0.0 {
for idxK := 0; idxK < this.NumTopics; idxK++ {
myP := myProbs[idxK] / mySumProbs
if myP == 0.0 {
continue
}
//fmt.Printf("%d: myP = %f\n", idxK, myP)
myEntropy -= myP * math.Log(myP)
}
}
entropy += myEntropy / numDocs
}
return entropy
}
// =================================================================================================
// func (this *GSDMM) ComputeEntropy
// brief description: compute entropy
func (this *GSDMM) ComputeRelativeEntropy() float64 {
entropy := this.ComputeEntropy()
maxEntropy := -math.Log(1.0 / float64(this.NumTopics))
return entropy / maxEntropy
}
// =================================================================================================
// struct WPDM
// brief description: the data structure of WPDM model
type WPDM struct {
eps float64 // radius of neighborhood
minPts uint // minimum points for a dense neighborhood in DBScan, set this to 0 for AHC
Data *Corpus // the input corpus
DocTopic []int // doc-topic table
NumTopics int // number of topics
}
// ===========================================================================================
// function SaveCitationGraph
// brief description: Save data to three files (nodes, edges, labels) of a citation graph.
// input:
// path: The name of the path where the three files are stored.
// prefix: The prefix of the names of the three files. For example, the prefix of
// ijcai-citation-graph-nodes.csv is ijcai.
// citationGraph: The citation graph represented by the three files.
// output:
// nothing
func SaveCitationGraph(path string, prefix string, citationGraph *CitationGraph) {
// --------------------------------------------------------------------------------------
// step 1: Assemble the file names for nodes, edges, and labels.
fileNameOfNodes := path + "/" + prefix + "-citation-graph-nodes.csv"
fileNameOfEdges := path + "/" + prefix + "-citation-graph-edges.csv"
fileNameOfLabels := path + "/" + prefix + "-citation-graph-labels.csv"
// --------------------------------------------------------------------------------------
// step 2: Save the nodes.
// (2.1) Open the file of Nodes for writing
fileOfNodes, err := os.Create(fileNameOfNodes)
if err != nil {
log.Fatal(err)
}
defer fileOfNodes.Close()
// (2.2) print the first line with the file format info
_, err = fileOfNodes.WriteString("#id, in-" + prefix + ", year, title\n")
if err != nil {
log.Fatal(err)
}
// (2.3) save data into the rest of lines
idSetForAnalysis := make(map[int64]int)
for i, v := range citationGraph.ToBeAnalyzed {
idSetForAnalysis[v] = i
}
for id, node := range citationGraph.Nodes {
_, toBeAnalyzed := idSetForAnalysis[id]
year := node.Year
title := TidyTitle(strings.ReplaceAll(node.Title, ",", "[comma]"))
_, err = fileOfNodes.WriteString(fmt.Sprintf("%d, %t, %d, %s\n", id, toBeAnalyzed, year, title))
if err != nil {
log.Fatal(err)
}
}
// --------------------------------------------------------------------------------------
// step 3: Save the edges.
// (3.1) Open the file of Edges for writing
fileOfEdges, err := os.Create(fileNameOfEdges)
if err != nil {
log.Fatal(err)
}
defer fileOfEdges.Close()
// (3.2) print the first line with the file format info
_, err = fileOfEdges.WriteString("#id, ref-id\n")
if err != nil {
log.Fatal(err)
}
// (3.3) save data into the rest of lines
edgeSet := make(map[int64]map[int64]bool)
for id, node := range citationGraph.Nodes {
setOfID, exists := edgeSet[id]
if !exists {
setOfID = make(map[int64]bool)
edgeSet[id] = setOfID
}
for _, refID := range node.Refs {
setOfID[refID] = true
}
for _, citeID := range node.Cites {
setOfCiteID, exists := edgeSet[citeID]
if !exists {
setOfCiteID = make(map[int64]bool)
edgeSet[citeID] = setOfCiteID
}
setOfCiteID[id] = true
}
}
for id, setOfID := range edgeSet {
for refID, _ := range setOfID {
fileOfEdges.WriteString(fmt.Sprintf("%d, %d\n", id, refID))
}
}
// --------------------------------------------------------------------------------------
// step 4: Save the labels.
// (4.1) Open the file of labels for writing
fileOfLabels, err := os.Create(fileNameOfLabels)
if err != nil {
log.Fatal(err)
}
defer fileOfLabels.Close()
// (4.2) print the first line with the file format info
_, err = fileOfLabels.WriteString("#id, label\n")
if err != nil {
log.Fatal(err)
}
// (4.3) save data into the rest of lines
for id, node := range citationGraph.Nodes {
for _, label := range node.Labels {
fileOfLabels.WriteString(fmt.Sprintf("%d, %s\n", id, label))
}
}
}
// =================================================================================================
// method TFIDF
// brief description: compute the TFIDF of possible key phrases for each main node of a citationGraph
// input:
// nothing
// output:
// the result of TFIDF grouped by the main nodes of the CitationGraph
func (g *CitationGraph) TFIDF() []map[string]float64 {
candidateTFGroups := []map[string]uint{}
phraseCandidateGroups := [][]string{}
for _, id := range g.ToBeAnalyzed {
// get phrase candidates
node := g.Nodes[id]
phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(node.Title)
// collect a list of auxiliary phrases
auxPhrases := []string{}
for _, refID := range node.Refs {
refPhrases := KeyphraseExtraction.ExtractKeyPhraseCandidates(g.Nodes[refID].Title)
for _, phrase := range refPhrases {
auxPhrases = append(auxPhrases, phrase)
}
}
// compute TF
candidateTFGroups = append(candidateTFGroups, KeyphraseExtraction.TF(phraseCandidates, auxPhrases))
// append this group of phraseCandidates to the candidate groups
phraseCandidateGroups = append(phraseCandidateGroups, phraseCandidates)
}
// compute IDF
IDFs := KeyphraseExtraction.IDF(phraseCandidateGroups)
// TFIDF = TF * IDF
result := []map[string]float64{}
for _, candidateTFs := range candidateTFGroups {
groupResult := map[string]float64{}
for text, tf := range candidateTFs {
idf, exists := IDFs[text]
if !exists {
log.Fatal(fmt.Sprintf("phrase %s not found in IDFs", text))
}
groupResult[text] = float64(tf) * idf
}
result = append(result, groupResult)
}
// return the result
return result
}
// =================================================================================================
// method SimTFIDF
// brief description: compute the Fuzzy TFIDF of possible key phrases for each main node of a citationGraph
// input:
// phraseSimilarity: similarities of phrases
// output:
// the result of Fuzzy TFIDF grouped by the main nodes of the CitationGraph
func (g *CitationGraph) SimTFIDF(phraseSimilarity map[string]map[string]float64) []map[string]float64 {
numMainNodes := len(g.ToBeAnalyzed)
candidateTFGroups := make([]map[string]float64, numMainNodes)
phraseCandidateGroups := make([][]string, numMainNodes)
numCPUs := runtime.NumCPU()
chI := make(chan int)
chWorkers := make(chan bool)
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
go func() {
for idx0 := range chI {
idx1 := idx0 + 100
if idx1 > numMainNodes {
idx1 = numMainNodes
}
for idxID := idx0; idxID < idx1; idxID++ {
// get phrase candidates
id := g.ToBeAnalyzed[idxID]
node := g.Nodes[id]
phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(node.Title)
// collect a list of auxiliary phrases
auxPhrases := []string{}
for _, refID := range node.Refs {
refPhrases := KeyphraseExtraction.ExtractKeyPhraseCandidates(g.Nodes[refID].Title)
for _, phrase := range refPhrases {
auxPhrases = append(auxPhrases, phrase)
}
}
// compute Sim TF
candidateTFGroups[idxID] = KeyphraseExtraction.SimTF(phraseCandidates, auxPhrases, phraseSimilarity)
// insert this group of phraseCandidates to the candidate groups
phraseCandidateGroups[idxID] = phraseCandidates
}
}
chWorkers <- true
}()
}
for i := 0; i < numMainNodes; i += 100 {
chI <- i
}
close(chI)
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
<-chWorkers
}
fmt.Println("All sim TF computed")
// compute IDF
IDFs := KeyphraseExtraction.IDF(phraseCandidateGroups)
fmt.Println(("sim IDF computed"))
// TFIDF = TF * IDF
result := make([]map[string]float64, numMainNodes)
for idx, candidateTFs := range candidateTFGroups {
groupResult := map[string]float64{}
for text, tf := range candidateTFs {
idf, exists := IDFs[text]
if !exists {
log.Fatal(fmt.Sprintf("phrase %s not found in IDFs", text))
}
groupResult[text] = float64(tf) * idf
}
sortedGroupResult := KeyphraseExtraction.ArgSort(groupResult)
selectedGroupResult := map[string]float64{}
for i := 0; i < len(sortedGroupResult); i++ {
phraseI := sortedGroupResult[i]
selected := true
for phraseJ, _ := range selectedGroupResult {
if KeyphraseExtraction.Includes(phraseJ, phraseI) ||
KeyphraseExtraction.Includes(phraseI, phraseJ) {
selected = false
break
}
}
if selected {
selectedGroupResult[phraseI] = groupResult[phraseI]
}
}
result[idx] = selectedGroupResult
}
// return the result
return result
}
// =================================================================================================
// method SimTFSimIDF
// brief description: compute the Fuzzy TFIDF of possible key phrases for each main node of a citationGraph
// input:
// nothing
// output:
// the result of Fuzzy TFIDF grouped by the main nodes of the CitationGraph
func (g *CitationGraph) SimTFSimIDF(phraseSimilarity map[string]map[string]float64) []map[string]float64 {
candidateTFGroups := []map[string]float64{}
phraseCandidateGroups := [][]string{}
for idxID, id := range g.ToBeAnalyzed {
// get phrase candidates
node := g.Nodes[id]
phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(node.Title)
// collect a list of auxiliary phrases
auxPhrases := []string{}
for _, refID := range node.Refs {
refPhrases := KeyphraseExtraction.ExtractKeyPhraseCandidates(g.Nodes[refID].Title)
for _, phrase := range refPhrases {
auxPhrases = append(auxPhrases, phrase)
}
}
// compute Fuzzy TF
candidateTFGroups = append(candidateTFGroups, KeyphraseExtraction.SimTF(phraseCandidates, auxPhrases, phraseSimilarity))
// append this group of phraseCandidates to the candidate groups
phraseCandidateGroups = append(phraseCandidateGroups, phraseCandidates)
if (idxID+1)%1000 == 0 {
fmt.Printf("%d of %d sim TF computed\n", idxID+1, len(g.ToBeAnalyzed))
}
}
fmt.Println("All sim TF computed")
// compute IDF
IDFs := KeyphraseExtraction.SimIDF(phraseCandidateGroups, phraseSimilarity)
fmt.Println(("All sim IDF computed"))
// TFIDF = TF * IDF
result := []map[string]float64{}
for _, candidateTFs := range candidateTFGroups {
groupResult := map[string]float64{}
for text, tf := range candidateTFs {
idf, exists := IDFs[text]
if !exists {
log.Fatal(fmt.Sprintf("phrase %s not found in IDFs", text))
}
groupResult[text] = float64(tf) * idf
}
result = append(result, groupResult)
}
// return the result
return result
}
// =================================================================================================
// method GetPhraseSimilarity
// brief description: compute the similarities between pairs of phrases and gives them in form of a
// sparse matrix using Title Link method
// input:
// nothing
// output:
// the sparse matrix of phrase similarities
// note:
// The title link method can be found in:
// Bogomolova, A., Ryazanova, M., & Balk, I. (2021). Cluster approach to analysis of publication
// titles. In Journal of Physics: Conference Series (Vol. 1727, No. 1, p. 012016). IOP Publishing.
func (g *CitationGraph) GetPhraseSimilarity(simType int) map[string]map[string]float64 {
// --------------------------------------------------------------------------------------------
// step 1: create corpus using all nodes
corpus := g.CreateCorpus(2)
// --------------------------------------------------------------------------------------------
// step 2: get concurrences from corpuse
concurrences := corpus.GetConcurrences()
// --------------------------------------------------------------------------------------------
// step 3: create concurrence model from concurrences
cm := ConcurrenceBasedClustering.NewConcurrenceModel()
cm.SetConcurrences(uint(len(corpus.Vocab)), concurrences)
// --------------------------------------------------------------------------------------------
// step 4: get similarity
simMat := map[uint]map[uint]float64{}
switch simType {
case 0:
simMat = cm.InduceSimilarities()
case 1:
simMat = cm.InduceNormalizedSimilarities()
case 2:
simMat = cm.InduceJaccardSimilarities()
case 3:
simMat = cm.InduceWeightedJaccardSimilarities()
case 4:
simMat = cm.InduceNormalizedJaccardSimilarities()
}
// --------------------------------------------------------------------------------------------
// step 6: conver simMat to the result then return it
result := map[string]map[string]float64{}
words := make([]string, len(corpus.Vocab))
for word, idx := range corpus.Vocab {
words[idx] = word
}
for i, row := range simMat {
wordI := words[i]
rowI := map[string]float64{}
for j, value := range row {
wordJ := words[j]
rowI[wordJ] = value
}
result[wordI] = rowI
}
return result
}
// =================================================================================================
// method GetPhraseSimilarityX
// brief description: compute the similarities between pairs of phrases and gives them in form of a
// sparse matrix using Title Link method
// input:
// nothing
// output:
// the sparse matrix of phrase similarities
// note:
// The title link method can be found in:
// Bogomolova, A., Ryazanova, M., & Balk, I. (2021). Cluster approach to analysis of publication
// titles. In Journal of Physics: Conference Series (Vol. 1727, No. 1, p. 012016). IOP Publishing.
func (g *CitationGraph) GetPhraseSimilarityX(simType int) map[string]map[string]float64 {
// --------------------------------------------------------------------------------------------
// step 1: create corpus using all nodes
corpus := g.CreateCorpusX(2)
// --------------------------------------------------------------------------------------------
// step 2: get concurrences and exclusions from corpuse
concurrences := corpus.GetConcurrences()
exclusions := corpus.GetExclusions()
// --------------------------------------------------------------------------------------------
// step 3: create concurrence model from concurrences
cm := ConcurrenceBasedClustering.NewConcurrenceModel()
cm.SetConcurrences(uint(len(corpus.Vocab)), concurrences)
cm.SetExclusions(exclusions)
// --------------------------------------------------------------------------------------------
// step 4: get similarity
simMat := map[uint]map[uint]float64{}
switch simType {
case 0:
simMat = cm.InduceSimilarities()
case 1:
simMat = cm.InduceNormalizedSimilarities()
case 2:
simMat = cm.InduceJaccardSimilarities()
case 3:
simMat = cm.InduceWeightedJaccardSimilarities()
case 4:
simMat = cm.InduceNormalizedJaccardSimilarities()
}
// --------------------------------------------------------------------------------------------
// step 6: conver simMat to the result then return it
result := map[string]map[string]float64{}
words := make([]string, len(corpus.Vocab))
for word, idx := range corpus.Vocab {
words[idx] = word
}
for i, row := range simMat {
wordI := words[i]
rowI := map[string]float64{}
for j, value := range row {
wordJ := words[j]
rowI[wordJ] = value
}
result[wordI] = rowI
}
return result
}
// =================================================================================================
// func (g *CitationGraph) CreateCorpus
// brief description: create a corpus from a citation graph
// input
// corpusType:
// 0 for title + ref titles per document for main nodes,
// 1 for title per document for main nodes,
// 2 for title per document for all nodes,
// 3 for labels per document for main nodes,
func (g *CitationGraph) CreateCorpus(corpusType int) *Corpus {
// ---------------------------------------------------------------------------------------------
// step 1: create an empty corpus
corpus := NewCorpus()
// ---------------------------------------------------------------------------------------------
// step 2: create goroutines to add documents to the corpuse
numCPUs := runtime.NumCPU()
chNodes := make(chan map[int]*CitationNode)
chWorkers := make(chan bool)
numNodes := len(g.ToBeAnalyzed)
if corpusType == 2 {
numNodes = len(g.Nodes)
}
wordsOfNode := make([][]string, numNodes)
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
go func() {
for nodes := range chNodes {
for idxNode, node := range nodes {
// (2.1) create an empty phrase list
words := []string{}
// (2.2) extract words of phrases in title and put them in the phrase list
if corpusType <= 2 {
phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(node.Title)
for _, candidate := range phraseCandidates {
candidateWords := strings.Split(candidate, " ")
for _, word := range candidateWords {
words = append(words, word)
}
}
}
// (2.3) extract words of phrases in reference titles and put them in the
// phrase list
if corpusType <= 0 {
for _, refID := range node.Refs {
refNode := g.Nodes[refID]
phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(refNode.Title)
for _, candidate := range phraseCandidates {
candidateWords := strings.Split(candidate, " ")
for _, word := range candidateWords {
words = append(words, word)
}
}
}
}
// (2.4) stem labels of the node and put them in phrase list
if corpusType == 3 {
stemmedLabels := KeyphraseExtraction.StemPhrases(node.Labels)
for _, stemmedLabel := range stemmedLabels {
words = append(words, stemmedLabel)
}
}
// (2.5) record the phrases
wordsOfNode[idxNode] = words
}
}
chWorkers <- true
}()
}
// ---------------------------------------------------------------------------------------------
// step 3: put nodes into the input channel
nodes := map[int]*CitationNode{}
nodeIDsSent := map[int]bool{}
for idx, id := range g.ToBeAnalyzed {
// (1.1) get a node
node := g.Nodes[id]
// (1.2) append nodes with this node
nodes[idx] = node
nodeIDsSent[int(id)] = true
// (1.3) send nodes when nodes are large enough
if len(nodes) == 100 || idx+1 == len(g.ToBeAnalyzed) {
chNodes <- nodes
nodes = map[int]*CitationNode{}
}
}
if corpusType == 2 {
// send nodes not in main nodes
idx := len(g.ToBeAnalyzed)
for id, node := range g.Nodes {
// (1.4) skips those already sent
_, alreadySent := nodeIDsSent[int(id)]
if alreadySent {
continue
}
// (1.5) append nodes with this node
nodes[idx] = node
nodeIDsSent[int(id)] = true
// (1.6) send nodes when nodes are large enough
if len(nodes) == 100 {
chNodes <- nodes
nodes = map[int]*CitationNode{}
}
idx++
}
// (1.7) send the rest of nodes
if len(nodes) > 0 {
chNodes <- nodes
}
}
// (1.8) close channel
close(chNodes)
// ---------------------------------------------------------------------------------------------
// step 4: wait for all workers
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
<-chWorkers
}
for _, words := range wordsOfNode {
corpus.AddDoc(words)
}
return corpus
}
// =================================================================================================
// func (g *CitationGraph) CreateCorpusX
// brief description: create a corpusX from a citation graph
// input
// corpusType:
// 0 for title + ref titles per document for main nodes,
// 1 for title per document for main nodes,
// 2 for title per document for all nodes,
func (g *CitationGraph) CreateCorpusX(corpusType int) *CorpusX {
// ---------------------------------------------------------------------------------------------
// step 1: create an empty corpus
corpus := NewCorpusX()
// ---------------------------------------------------------------------------------------------
// step 2: create goroutines to add documents to the corpuse
numCPUs := runtime.NumCPU()
chNodes := make(chan map[int]*CitationNode)
chWorkers := make(chan bool)
numNodes := len(g.ToBeAnalyzed)
if corpusType == 2 {
numNodes = len(g.Nodes)
}
wordsOfNode := make([][][]string, numNodes)
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
go func() {
for nodes := range chNodes {
for idxNode, node := range nodes {
// (2.1) create an empty phrase list
words := [][]string{}
// (2.2) extract words of phrases in title and put them in the phrase list
if corpusType <= 2 {
phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(node.Title)
for _, candidate := range phraseCandidates {
candidateWords := KeyphraseExtraction.GetAllPossiblePhrases(candidate)
words = append(words, candidateWords)
}
}
// (2.3) extract words of phrases in reference titles and put them in the
// phrase list
if corpusType <= 0 {
for _, refID := range node.Refs {
refNode := g.Nodes[refID]
phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(refNode.Title)
for _, candidate := range phraseCandidates {
candidateWords := KeyphraseExtraction.GetAllPossiblePhrases(candidate)
words = append(words, candidateWords)
}
}
}
// (2.4) record the phrases
wordsOfNode[idxNode] = words
}
}
chWorkers <- true
}()
}
// ---------------------------------------------------------------------------------------------
// step 3: put nodes into the input channel
nodes := map[int]*CitationNode{}
nodeIDsSent := map[int]bool{}
for idx, id := range g.ToBeAnalyzed {
// (1.1) get a node
node := g.Nodes[id]
// (1.2) append nodes with this node
nodes[idx] = node
nodeIDsSent[int(id)] = true
// (1.3) send nodes when nodes are large enough
if len(nodes) == 100 || idx+1 == len(g.ToBeAnalyzed) {
chNodes <- nodes
nodes = map[int]*CitationNode{}
}
}
if corpusType == 2 {
// send nodes not in main nodes
idx := len(g.ToBeAnalyzed)
for id, node := range g.Nodes {
// (1.4) skips those already sent
_, alreadySent := nodeIDsSent[int(id)]
if alreadySent {
continue
}
// (1.5) append nodes with this node
nodes[idx] = node
nodeIDsSent[int(id)] = true
// (1.6) send nodes when nodes are large enough
if len(nodes) == 100 {
chNodes <- nodes
nodes = map[int]*CitationNode{}
}
idx++
}
// (1.7) send the rest of nodes
if len(nodes) > 0 {
chNodes <- nodes
}
}
// (1.8) close channel
close(chNodes)
// ---------------------------------------------------------------------------------------------
// step 4: wait for all workers
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
<-chWorkers
}
for _, words := range wordsOfNode {
corpus.AddDoc(words)
}
return corpus
}
// =================================================================================================
// func (g *CitationGraph) CreateCorpusSeq
// brief description: create a corpus from a citation graph
// input
// corpusType:
// 0 for title + ref titles per document for main nodes,
// 1 for title per document for main nodes,
// 2 for title per document for all nodes,
// 3 for labels per document for main nodes,
func (g *CitationGraph) CreateCorpusSeq(corpusType int) (corpus *CorpusSeq, years []int,
nodeIDs []int64, isEnglish []bool) {
// ---------------------------------------------------------------------------------------------
// step 1: create an empty corpus and the array for years
corpus = NewCorpusSeq()
numNodes := len(g.ToBeAnalyzed)
if corpusType == 2 {
numNodes = len(g.Nodes)
}
years = make([]int, numNodes)
nodeIDs = make([]int64, numNodes)
isEnglish = make([]bool, numNodes)
// ---------------------------------------------------------------------------------------------
// step 2: create goroutines to add documents to the corpuse
numCPUs := runtime.NumCPU()
chNodes := make(chan map[int]*CitationNode)
chWorkers := make(chan bool)
wordsOfNode := make([][]string, numNodes)
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
go func() {
for nodes := range chNodes {
for idxNode, node := range nodes {
// (2.1) create an empty phrase list
words := []string{}
// (2.2) extract words of phrases in title and put them in the phrase list
if corpusType <= 2 {
phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(node.Title)
for _, candidate := range phraseCandidates {
candidateWords := strings.Split(candidate, " ")
for _, word := range candidateWords {
words = append(words, word)
}
}
}
// (2.3) extract words of phrases in reference titles and put them in the
// phrase list
if corpusType <= 0 {
for _, refID := range node.Refs {
refNode := g.Nodes[refID]
phraseCandidates := KeyphraseExtraction.ExtractKeyPhraseCandidates(refNode.Title)
for _, candidate := range phraseCandidates {
candidateWords := strings.Split(candidate, " ")
for _, word := range candidateWords {
words = append(words, word)
}
}
}
}
// (2.4) stem labels of the node and put them in phrase list
if corpusType == 3 {
stemmedLabels := KeyphraseExtraction.StemPhrases(node.Labels)
for _, stemmedLabel := range stemmedLabels {
words = append(words, stemmedLabel)
}
}
// (2.5) record the phrases
if !isEnglish[idxNode] {
lang := langDetector.GetClosestLanguage(strings.ToLower(node.Title))
if lang == "english" {
isEnglish[idxNode] = true
}
}
wordsOfNode[idxNode] = words
}
}
chWorkers <- true
}()
}
// ---------------------------------------------------------------------------------------------
// step 3: put nodes into the input channel
nodes := map[int]*CitationNode{}
nodeIDsSent := map[int]bool{}
for idx, id := range g.ToBeAnalyzed {
// (1.1) get a node and record its year
node := g.Nodes[id]
years[idx] = int(node.Year)
nodeIDs[idx] = id
isEnglish[idx] = true
// (1.2) append nodes with this node
nodes[idx] = node
nodeIDsSent[int(id)] = true
// (1.3) send nodes when nodes are large enough
if len(nodes) == 100 || idx+1 == len(g.ToBeAnalyzed) {
chNodes <- nodes
nodes = map[int]*CitationNode{}
}
}
if corpusType == 2 {
// send nodes not in main nodes
idx := len(g.ToBeAnalyzed)
for id, node := range g.Nodes {
// (1.4) skips those already sent
_, alreadySent := nodeIDsSent[int(id)]
if alreadySent {
continue
}
// (1.5) append nodes with this node and record its year
nodes[idx] = node
nodeIDsSent[int(id)] = true
years[idx] = int(node.Year)
isEnglish[idx] = false
nodeIDs[idx] = id
// (1.6) send nodes when nodes are large enough
if len(nodes) == 100 {
chNodes <- nodes
nodes = map[int]*CitationNode{}
}
idx++
}
// (1.7) send the rest of nodes
if len(nodes) > 0 {
chNodes <- nodes
}
}
// (1.8) close channel
close(chNodes)
// ---------------------------------------------------------------------------------------------
// step 4: wait for all workers
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
<-chWorkers
}
for _, words := range wordsOfNode {
corpus.AddDoc(words)
}
return
}
// =================================================================================================
// func (g *CitationGraph) GetIdxMainNode
func (g *CitationGraph) GetIdxMainNode(nodeID int64) uint {
idx, exists := g.idxMainNodes[nodeID]
if !exists {
log.Fatalln(fmt.Sprintf("cannot find %v in main node indices\n", nodeID))
}
return uint(idx)
}
// =================================================================================================
// func (g *CitationGraph) ClusterByLDA
// brief description: cluster the main nodes of g with their titles and their reference titles using
// using LDA.
// input:
// numTopics: number of topics
// alpha, beta: the parameters of LDA
// numIters: number of iterations for LDA
// output:
// for each main node, gives the likelihoods the node belonging to a cluster.
func (g *CitationGraph) ClusterByLDA(numTopics int, alpha, beta float64, numIters int,
) map[int64][]float64 {
if numTopics <= 0 || alpha <= 0.0 || beta <= 0.0 || numIters <= 0 { //
log.Fatalln("all parameters of ClusterByLDA must be > 0")
}
// ---------------------------------------------------------------------------------------------
// step 1: create corpus for the main nodes
corpus := g.CreateCorpus(0)
// ---------------------------------------------------------------------------------------------
// step 2: use the corpus to compute LDA
LDA := NewLDA(numTopics, alpha, beta, corpus)
LDA.Train(numIters)
// ---------------------------------------------------------------------------------------------
// step 3: infer cluster memberships for each node
memberships := map[int64][]float64{}
for docID, doc := range corpus.Docs {
nodeID := g.ToBeAnalyzed[docID]
membershipsOfNode := LDA.Infer(doc)
memberships[nodeID] = membershipsOfNode
}
// ---------------------------------------------------------------------------------------------
// step 4: return the result
return memberships
}
// =================================================================================================
// func (g *CitationGraph) ClusterTitlesByWPDM
// brief description: cluster the main nodes of g with their titles and their reference titles using
// using LDA.
// input:
// eps: the radius of neighborhood.
// minPts: Only if the neighborhood of a point contains at least minPt points
// (the center point of the neighborhood included), the neighborhood is
// called dense. Only dense neighborhoods are connected to communities.
// simType: the type of similarity, 0 for simple induced similarity, 1 for normalized
// similarity, 2 for jaccard similarity, 4 for weighted jaccard similarity, 4 for
// normalized jaccard similarity
// workSpaceFileName: a file name for intermediate result
// output:
// for each main node, gives the likelihoods the node belonging to a cluster.
func (g *CitationGraph) ClusterTitlesByWPDM(eps float64, minPts uint, simType int,
workSpaceFileName string) []map[uint]bool {
// ---------------------------------------------------------------------------------------------
// step 1: create corpus for all nodes
corpus := g.CreateCorpus(2)
// ---------------------------------------------------------------------------------------------
// step 2: create concurrences from corpus
concurrences := corpus.GetConcurrences()
// ---------------------------------------------------------------------------------------------
// step 3: create corpus for main nodes
mainCorpus := g.CreateCorpus(1)
// ---------------------------------------------------------------------------------------------
// step 4: translate mainCorpus to corpus
mainCorpus = corpus.translate(mainCorpus)
// ---------------------------------------------------------------------------------------------
// step 5: cluster mainCorpus using WPDM
cm := ConcurrenceBasedClustering.NewConcurrenceModel()
cm.SetPairFilter(0.1, 3.0)
cm.SetConcurrences(uint(len(corpus.Vocab)), concurrences)
groups := make([]map[uint]bool, len(mainCorpus.Docs))
for docID, doc := range mainCorpus.Docs {
group := map[uint]bool{}
for w, _ := range doc {
group[uint(w)] = true
}
groups[docID] = group
}
communities := []map[uint]bool{}
if minPts > 0 {
communities = cm.GroupPairDBScan(groups, eps, minPts, simType, workSpaceFileName)
} else {
communities = cm.GroupPairAHC(groups, eps, simType, workSpaceFileName)
}
// ---------------------------------------------------------------------------------------------
// step 6: return the result
return communities
}
// =================================================================================================
// func (g *CitationGraph) ClusterLabelsByWPDM
// brief description: cluster the main nodes of g with their titles and their reference titles using
// using LDA.
// input:
// eps: the radius of neighborhood.
// minPts: Only if the neighborhood of a point contains at least minPt points
// (the center point of the neighborhood included), the neighborhood is
// called dense. Only dense neighborhoods are connected to communities.
// simType: the type of similarity, 0 for simple induced similarity, 1 for normalized
// similarity, 2 for jaccard similarity, 4 for weighted jaccard similarity, 4 for
// normalized jaccard similarity
// workSpaceFileName: a file name for intermediate result
// output:
// for each main node, gives the likelihoods the node belonging to a cluster.
func (g *CitationGraph) ClusterLabelsByWPDM(eps float64, minPts uint, simType int,
workSpaceFileName string) []map[uint]bool {
// ---------------------------------------------------------------------------------------------
// step 1: create corpus for labels at main nodes
corpus := g.CreateCorpus(3)
// ---------------------------------------------------------------------------------------------
// step 2: create concurrences from corpus
concurrences := corpus.GetConcurrences()
// ---------------------------------------------------------------------------------------------
// step 5: cluster corpuse using WPDM
cm := ConcurrenceBasedClustering.NewConcurrenceModel()
cm.SetConcurrences(uint(len(corpus.Vocab)), concurrences)
groups := make([]map[uint]bool, len(corpus.Docs))
for docID, doc := range corpus.Docs {
group := map[uint]bool{}
for w, _ := range doc {
group[uint(w)] = true
}
groups[docID] = group
}
communities := []map[uint]bool{}
if minPts > 0 {
communities = cm.GroupPairDBScan(groups, eps, minPts, simType, workSpaceFileName)
} else {
communities = cm.GroupPairAHC(groups, eps, simType, workSpaceFileName)
}
// ---------------------------------------------------------------------------------------------
// step 6: return the result
return communities
}
// =================================================================================================
// func (g *CitationGraph) ClusterTitlesByDBScan
// brief description: cluster the main nodes of g with their titles and their reference titles using
// using LDA.
// input:
// eps: the radius of neighborhood.
// minPts: Only if the neighborhood of a point contains at least minPt points
// (the center point of the neighborhood included), the neighborhood is
// called dense. Only dense neighborhoods are connected to communities.
// simType: the type of similarity, 0 for simple induced similarity, 1 for normalized
// similarity, 2 for jaccard similarity, 4 for weighted jaccard similarity, 4 for
// normalized jaccard similarity
// output:
// for each main node, gives the likelihoods the node belonging to a cluster.
// func (g *CitationGraph) ClusterTitlesByDBScan(eps float64, minPts uint, simType int,
// ) map[int64][]float64 {
// }
// =================================================================================================
// func (g *CitationGraph) ClusterLabelsByDBScan
// brief description: cluster the main nodes of g with their titles and their reference titles using
// using LDA.
// input:
// eps: the radius of neighborhood.
// minPts: Only if the neighborhood of a point contains at least minPt points
// (the center point of the neighborhood included), the neighborhood is
// called dense. Only dense neighborhoods are connected to communities.
// simType: the type of similarity, 0 for simple induced similarity, 1 for normalized
// similarity, 2 for jaccard similarity, 4 for weighted jaccard similarity, 4 for
// normalized jaccard similarity
// output:
// for each main node, gives the likelihoods the node belonging to a cluster.
// func (g *CitationGraph) ClusterTitlesByDBScan(eps float64, minPts uint, simType int,
// ) map[int64][]float64 {
// }
// =================================================================================================
// func (g *CitationGraph) ClusterTitlesByGSDMM
// brief description: cluster the main nodes of g with their titles and their reference titles using
// using LDA.
// input:
// simType: the type of similarity, 0 for simple induced similarity, 1 for normalized
// similarity, 2 for jaccard similarity, 4 for weighted jaccard similarity, 4 for
// normalized jaccard similarity
// numTopics: number of topics
// alpha, beta: the parameters of GSDMM
// numIters: number of iterations
// output:
// for each main node, gives the likelihoods the node belonging to a cluster.
func (g *CitationGraph) ClusterTitlesByGSDMM(simType, numTopics int, alpha, beta float64, numIters int,
) map[int64][]float64 {
if numTopics <= 0 || alpha <= 0.0 || beta <= 0.0 || numIters <= 0 {
log.Fatalln("all parameters of ClusterByLDA must be > 0")
}
// ---------------------------------------------------------------------------------------------
// step 1: create corpus for the main nodes
phraseSimilarities := g.GetPhraseSimilarityX(simType)
tfidfGroups := g.SimTFIDF(phraseSimilarities)
corpus := NewCorpus()
for _, tfidfs := range tfidfGroups {
texts := []string{}
sumWeights := 0.0
for _, weight := range tfidfs {
sumWeights += weight
}
meanWeight := sumWeights / float64(len(tfidfs))
for text, weight := range tfidfs {
if weight < 0.5*meanWeight {
continue
}
texts = append(texts, text)
}
corpus.AddDoc(texts)
}
// ---------------------------------------------------------------------------------------------
// step 2: use the corpus to compute LDA
GSDMM := NewGSDMM(numTopics, alpha, beta, corpus)
GSDMM.Train(numIters)
// ---------------------------------------------------------------------------------------------
// step 3: infer cluster memberships for each node
memberships := map[int64][]float64{}
for docID, doc := range corpus.Docs {
nodeID := g.ToBeAnalyzed[docID]
membershipsOfNode := GSDMM.Infer(doc)
memberships[nodeID] = membershipsOfNode
}
// ---------------------------------------------------------------------------------------------
// step 4: return the result
return memberships
}
// =================================================================================================
// func (g *CitationGraph) ClusterLabelsByGSDMM
// brief description: cluster the main nodes of g with their titles and their reference titles using
// using LDA.
// input:
// eps: the radius of neighborhood.
// minPts: Only if the neighborhood of a point contains at least minPt points
// (the center point of the neighborhood included), the neighborhood is
// called dense. Only dense neighborhoods are connected to communities.
// simType: the type of similarity, 0 for simple induced similarity, 1 for normalized
// similarity, 2 for jaccard similarity, 4 for weighted jaccard similarity, 4 for
// normalized jaccard similarity
// output:
// for each main node, gives the likelihoods the node belonging to a cluster.
// func (g *CitationGraph) ClusterTitlesByGSDMM(eps float64, minPts uint, simType int,
// ) map[int64][]float64 {
// }
func (g *CitationGraph) ClusterLabelsByGSDMM(numTopics int, alpha, beta float64, numIters int,
) map[int64][]float64 {
if numTopics <= 0 || alpha <= 0.0 || beta <= 0.0 || numIters <= 0 {
log.Fatalln("all parameters of ClusterByLDA must be > 0")
}
// ---------------------------------------------------------------------------------------------
// step 1: create corpus for labels of the main nodes
corpus := g.CreateCorpus(3)
// reverse corpus Vocab
reverseVocab := map[int]string{}
for w, j := range corpus.Vocab {
reverseVocab[j] = w
}
// ---------------------------------------------------------------------------------------------
// step 2: use the corpus to compute LDA
GSDMM := NewGSDMM(numTopics, alpha, beta, corpus)
GSDMM.Train(numIters)
// ---------------------------------------------------------------------------------------------
// step 3: infer cluster memberships for each node
memberships := map[int64][]float64{}
for docID, doc := range corpus.Docs {
nodeID := g.ToBeAnalyzed[docID]
membershipsOfNode := GSDMM.Infer(doc)
memberships[nodeID] = membershipsOfNode
}
// ---------------------------------------------------------------------------------------------
// step 4: return the result
return memberships
}
// =================================================================================================
// func (g *CitationGraph) checkMemberships
func (g *CitationGraph) checkMemberships(memberships map[int64][]float64) {
if len(memberships) != len(g.ToBeAnalyzed) {
log.Fatalln("length of memberships is incorrect.")
}
numDims := 0
for _, nodeID := range g.ToBeAnalyzed {
membership, exists := memberships[nodeID]
if !exists {
log.Fatalln("some nodes for analysis cannot be found in memberships")
}
if numDims == 0 {
numDims = len(membership)
} else if numDims != len(membership) {
log.Fatalln("number of dimensions are inconsistent in memberships")
}
}
}
// =================================================================================================
// func (g *CitationGraph) checkCommunities
func (g *CitationGraph) checkCommunities(communities []map[uint]bool) {
numNodes := 0
for _, community := range communities {
numNodes += len(community)
}
if numNodes != len(g.ToBeAnalyzed) {
log.Fatalln("total number of nodes in communities is incorrect.")
}
}
// =================================================================================================
// func membCos
func membCos(membershipI, membershipJ []float64) float64 {
if len(membershipI) != len(membershipJ) {
log.Fatalln("lengthes of membershipI and membershipJ don't match.")
}
sim := 0.0
sizeI := 0.0
sizeJ := 0.0
for k, membI := range membershipI {
membJ := membershipJ[k]
sim += membI * membJ
sizeI += membI * membI
sizeJ += membJ * membJ
}
sizeI = math.Sqrt(sizeI)
sizeJ = math.Sqrt(sizeJ)
sim /= sizeI * sizeJ
return sim
}
// =================================================================================================
// func (g *CitationGraph) CompareByModularity
func (g *CitationGraph) CompareByModularity(communities []map[uint]bool,
memberships map[int64][]float64) float64 {
// ---------------------------------------------------------------------------------------------
// step 1: check parameters
g.checkCommunities(communities)
g.checkMemberships(memberships)
// ---------------------------------------------------------------------------------------------
// step 2: compute sumSims
n := len(g.ToBeAnalyzed)
sumSims := make([]float64, n)
numCPUs := runtime.NumCPU()
chI := make(chan int)
chWorkers := make(chan bool)
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
go func() {
for i0 := range chI {
i1 := i0 + 100
if i1 > n {
i1 = n
}
for i := i0; i < i1; i++ {
nodeIDOfI := g.ToBeAnalyzed[i]
membershipI := memberships[nodeIDOfI]
sumSims[i] = 0.0
for _, nodeIDOfJ := range g.ToBeAnalyzed {
membershipJ := memberships[nodeIDOfJ]
sim := membCos(membershipI, membershipJ)
sumSims[i] += sim
}
}
}
chWorkers <- true
}()
}
for i := 0; i < n; i += 100 {
chI <- i
}
close(chI)
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
<-chWorkers
}
totalSumSims := 0.0
for i := 0; i < n; i++ {
totalSumSims += sumSims[i]
}
// ---------------------------------------------------------------------------------------------
// step 3: compute modularity
modularity := 0.0
for _, community := range communities {
for i, _ := range community {
nodeIDOfI := g.ToBeAnalyzed[i]
membershipI := memberships[nodeIDOfI]
for j, _ := range community {
nodeIDOfJ := g.ToBeAnalyzed[j]
membershipJ := memberships[nodeIDOfJ]
sim := membCos(membershipI, membershipJ)
modularity += sim - sumSims[i]*sumSims[j]/totalSumSims
}
}
}
modularity /= totalSumSims
// ---------------------------------------------------------------------------------------------
// step 4: return result
return modularity
}
// =================================================================================================
// func (g *CitationGraph) CompareByCPM
func (g *CitationGraph) CompareByCPM(gamma float64, communities []map[uint]bool,
memberships map[int64][]float64) float64 {
// ---------------------------------------------------------------------------------------------
// step 1: check parameters
g.checkCommunities(communities)
g.checkMemberships(memberships)
// ---------------------------------------------------------------------------------------------
// step 2: compute CPM
cpm := 0.0
for _, community := range communities {
for i, _ := range community {
nodeIDOfI := g.ToBeAnalyzed[i]
membershipI := memberships[nodeIDOfI]
for j, _ := range community {
nodeIDOfJ := g.ToBeAnalyzed[j]
membershipJ := memberships[nodeIDOfJ]
sim := membCos(membershipI, membershipJ)
cpm += sim - gamma
}
}
}
// ---------------------------------------------------------------------------------------------
// step 3: return result
return cpm
}
// =================================================================================================
// func (g *CitationGraph) getCommunitiesFromMemberships
func (g *CitationGraph) GetCommunitiesFromMemberships(memberships map[int64][]float64) []map[uint]bool {
numCommunities := 0
for _, membership := range memberships {
numCommunities = len(membership)
break
}
result := make([]map[uint]bool, numCommunities)
for i := 0; i < numCommunities; i++ {
result[i] = map[uint]bool{}
}
for nodeID, membership := range memberships {
maxMemb := 0.0
idxMaxMemb := 0
for k := 0; k < numCommunities; k++ {
memb := membership[k]
if memb > maxMemb {
maxMemb = memb
idxMaxMemb = k
}
}
result[idxMaxMemb][g.GetIdxMainNode(nodeID)] = true
}
return result
}
// =================================================================================================
// func getCommunityIDFromCommunities
func getCommunityIDsFromCommunities(n int, communities []map[uint]bool) []int {
result := make([]int, n)
for i, community := range communities {
for u, _ := range community {
result[u] = i
}
}
return result
}
// =================================================================================================
// func (g *CitationGraph) CompareByRI
func (g *CitationGraph) CompareByRI(communities []map[uint]bool,
memberships map[int64][]float64) float64 {
// ---------------------------------------------------------------------------------------------
// step 1: check parameters
g.checkCommunities(communities)
g.checkMemberships(memberships)
// ---------------------------------------------------------------------------------------------
// step 2: compute the intersection part
partIntersect := 0.0
myCommunities := g.GetCommunitiesFromMemberships(memberships)
for _, ci := range communities {
if len(ci) == 0 {
continue
}
for _, cj := range myCommunities {
if len(cj) == 0 {
continue
}
cu, cv := ci, cj
if len(cu) > len(cv) {
cu, cv = cv, cu
}
numInBoth := 0
for u, _ := range cu {
_, inCv := cv[u]
if inCv {
numInBoth++
}
}
partIntersect += float64(numInBoth * (numInBoth - 1) / 2)
}
}
// ---------------------------------------------------------------------------------------------
// step 3: compute the cross-boundary part
n := len(g.ToBeAnalyzed)
commuID1s := getCommunityIDsFromCommunities(n, communities)
commuID2s := getCommunityIDsFromCommunities(n, myCommunities)
partCross := 0.0
for u := 0; u < n; u++ {
c1u := commuID1s[u]
c2u := commuID2s[u]
for v := 0; v < n; v++ {
c1v := commuID1s[v]
c2v := commuID2s[v]
if c1u == c1v || c2u == c2v {
continue
}
partCross += 0.5
}
}
// ---------------------------------------------------------------------------------------------
// step 4: normalize and return the result
partN := float64(n * (n - 1) / 2)
fmt.Printf("part intersect = %f, partCross = %f, partN = %f\n", partIntersect, partCross, partN)
result := (partIntersect + partCross) / partN
return result
}
// =================================================================================================
// func (g *CitationGraph) CompareByARI
func (g *CitationGraph) CompareByARI(communities []map[uint]bool,
memberships map[int64][]float64) float64 {
// ---------------------------------------------------------------------------------------------
// step 1: check parameters
g.checkCommunities(communities)
g.checkMemberships(memberships)
// ---------------------------------------------------------------------------------------------
// step 2: compute the intersection part
result := 0.0
myCommunities := g.GetCommunitiesFromMemberships(memberships)
for _, ci := range communities {
if len(ci) == 0 {
continue
}
for _, cj := range myCommunities {
if len(cj) == 0 {
continue
}
cu, cv := ci, cj
if len(cu) > len(cv) {
cu, cv = cv, cu
}
numInBoth := 0
for u, _ := range cu {
_, inCv := cv[u]
if inCv {
numInBoth++
}
}
result += float64(numInBoth * (numInBoth - 1) / 2)
}
}
// ---------------------------------------------------------------------------------------------
// step 3: compute the cross-boundary part
n := len(g.ToBeAnalyzed)
partN := float64(n * (n - 1) / 2)
partI := 0.0
for _, ci := range communities {
ni := len(ci)
if ni <= 1 {
continue
}
partI += float64(ni * (ni - 1) / 2)
}
partJ := 0.0
for _, cj := range myCommunities {
nj := len(cj)
if nj <= 1 {
continue
}
partJ += float64(nj * (nj - 1) / 2)
}
partCross := partI * partJ / partN
fmt.Printf("part intersect = %f, partCross = %f, partI = %f, partJ = %f\n", result, partCross, partI, partJ)
// ---------------------------------------------------------------------------------------------
// step 4: normalize and return the result
result = (result - partCross) / (0.5*(partI+partJ) - partCross)
return result
}
// =================================================================================================
// func (g *CitationGraph) ComputeEntropies
func (g *CitationGraph) ComputeEntropies(communities []map[uint]bool,
memberships map[int64][]float64) (float64, float64, float64) {
// ---------------------------------------------------------------------------------------------
// step 1: check parameters
g.checkCommunities(communities)
g.checkMemberships(memberships)
// ---------------------------------------------------------------------------------------------
// step 2: compute the cross entropy
crossEntropy := 0.0
n := len(g.ToBeAnalyzed)
myCommunities := g.GetCommunitiesFromMemberships(memberships)
for _, ci := range communities {
if len(ci) == 0 {
continue
}
for _, cj := range myCommunities {
if len(cj) == 0 {
continue
}
cu, cv := ci, cj
if len(cu) > len(cv) {
cu, cv = cv, cu
}
numInBoth := 0
for u, _ := range cu {
_, inCv := cv[u]
if inCv {
numInBoth++
}
}
if numInBoth == 0 {
continue
}
p := float64(numInBoth) / float64(n)
crossEntropy -= p * math.Log(p)
}
}
// ---------------------------------------------------------------------------------------------
// step 2: compute the entropies
entropy1 := 0.0
for _, ci := range communities {
ni := len(ci)
if ni == 0 {
continue
}
p := float64(ni) / float64(n)
entropy1 -= p * math.Log(p)
}
entropy2 := 0.0
for _, cj := range myCommunities {
nj := len(cj)
if nj == 0 {
continue
}
p := float64(nj) / float64(n)
entropy2 -= p * math.Log(p)
}
// ---------------------------------------------------------------------------------------------
// step 3: return the entropies
return crossEntropy, entropy1, entropy2
}
// =================================================================================================
// func (g *CitationGraph) CompareByMI
func (g *CitationGraph) CompareByMI(communities []map[uint]bool,
memberships map[int64][]float64) float64 {
// ---------------------------------------------------------------------------------------------
// step 1: compute the entropies
crossEntropy, entropy1, entropy2 := g.ComputeEntropies(communities, memberships)
// ---------------------------------------------------------------------------------------------
// step 2: compute and return the result
return entropy1 + entropy2 - crossEntropy
}
// =================================================================================================
// func (g *CitationGraph) CompareByNMI
func (g *CitationGraph) CompareByNMI(communities []map[uint]bool,
memberships map[int64][]float64) float64 {
// ---------------------------------------------------------------------------------------------
// step 1: compute the entropies
crossEntropy, entropy1, entropy2 := g.ComputeEntropies(communities, memberships)
// ---------------------------------------------------------------------------------------------
// step 2: compute and return the result
return 2.0 * (entropy1 + entropy2 - crossEntropy) / (entropy1 + entropy2)
}
func numCombs(a ...int) float64 {
m := len(a)
m1 := m
for i := 0; i < m; i++ {
if a[i] < 0 {
m1 = i
break
}
}
if m1 == m {
log.Fatalln(fmt.Sprintf("in numComb, m1 == m\n"))
}
b := make([]int, m)
n1 := 0
for i := 0; i < m1; i++ {
b[i] = a[i]
n1 += a[i]
}
n2 := 0
for i := m1 + 1; i < m; i++ {
b[i] = a[i]
n2 += a[i]
}
if n1 != n2 {
log.Fatalln(fmt.Sprintf("in numComb, n1 = %d != n2 = %d\n", n1, n2))
}
result := 1.0
for k := 0; k < n1; k++ {
i1Max := 0
for i1 := 1; i1 < m1; i1++ {
if b[i1] > b[i1Max] {
i1Max = i1
}
}
i2Max := m1 + 1
for i2 := m1 + 2; i2 < m; i2++ {
if b[i2] > b[i2Max] {
i2Max = i2
}
}
result *= float64(b[i1Max]) / float64(b[i2Max])
b[i1Max]--
b[i2Max]--
}
return result
}
// =================================================================================================
// func (g *CitationGraph) ComputeEMI
func (g *CitationGraph) ComputeEMI(communities []map[uint]bool,
memberships map[int64][]float64) float64 {
// ---------------------------------------------------------------------------------------------
// step 1: check parameters
g.checkCommunities(communities)
g.checkMemberships(memberships)
// ---------------------------------------------------------------------------------------------
// step 2: compute the cross entropy
result := 0.0
n := len(g.ToBeAnalyzed)
myCommunities := g.GetCommunitiesFromMemberships(memberships)
for _, ci := range communities {
ni := len(ci)
if ni == 0 {
continue
}
for _, cj := range myCommunities {
nj := len(cj)
if nj == 0 {
continue
}
k0 := ni + nj - n
if k0 < 1 {
k0 = 1
}
k1 := ni
if k1 > nj {
k1 = nj
}
for k := k0; k <= k1; k++ {
result += numCombs(ni, nj, n-ni, n-nj, -1, n, k, ni-k, nj-k, n-ni-nj+k) *
float64(k) / float64(n) * math.Log(float64(k*n)/float64(ni*nj))
}
}
}
// ---------------------------------------------------------------------------------------------
// step 2: return the result
return result
}
// =================================================================================================
// func (g *CitationGraph) CompareByAMI
func (g *CitationGraph) CompareByAMI(communities []map[uint]bool,
memberships map[int64][]float64) float64 {
// ---------------------------------------------------------------------------------------------
// step 1: compute the entropies
crossEntropy, entropy1, entropy2 := g.ComputeEntropies(communities, memberships)
// ---------------------------------------------------------------------------------------------
// step 2: compute and expected mutual entropy
emi := g.ComputeEMI(communities, memberships)
// ---------------------------------------------------------------------------------------------
// step 3: compute and return the result
mi := entropy1 + entropy2 - crossEntropy
return (mi - emi) / (math.Max(entropy1, entropy2) - emi)
}
// =================================================================================================
// func SaveMemberships
func SaveMemberships(memberships map[int64][]float64, fileName string) {
fileMembership, err := os.Create(fileName)
if err != nil {
log.Fatal(err)
}
jsonBytes, err := json.Marshal(memberships)
if err != nil {
log.Fatal(err)
}
jsonString := string(jsonBytes)
_, err = fileMembership.WriteString(jsonString)
if err != nil {
log.Fatal(err)
}
fileMembership.Close()
}
// =================================================================================================
// func LoadMemberships
func LoadMemberships(fileName string) map[int64][]float64 {
memberships := map[int64][]float64{}
fileMembership, err := os.Open(fileName)
if err != nil {
log.Fatal(err)
}
bufLen := 1024 * 1024
bytes := make([]byte, bufLen)
jsonBytes := []byte{}
readLen, err := fileMembership.Read(bytes)
if err != nil {
log.Fatal(err)
}
for readLen == bufLen {
jsonBytes = append(jsonBytes, bytes...)
readLen, err = fileMembership.Read(bytes)
if err != nil {
log.Fatal(err)
}
}
if readLen > 0 {
jsonBytes = append(jsonBytes, bytes[0:readLen]...)
}
fileMembership.Close()
json.Unmarshal(jsonBytes, &memberships)
return memberships
}
// =================================================================================================
// struct PhrasePair
type PhrasePair struct {
Phrase1, Phrase2 string
}
type PairFreq struct {
Actual float64
Expected float64
}
// =================================================================================================
// func (g *CitationGraph) preparePhrasePairSearch
func (g *CitationGraph) preparePhrasePairSearch() (numDocs int,
docConcurrences map[uint]map[uint]float64, docFreqsOf []float64, phrases []string) {
// --------------------------------------------------------------------------------------------
// step 1: create corpus using all nodes
corpus := g.CreateCorpusX(2)
// --------------------------------------------------------------------------------------------
// step 2: get concurrences and exclusions from corpuse
docConcurrences = corpus.GetDocConcurrences()
// --------------------------------------------------------------------------------------------
// step 3: compute sum of document frequencies for each phrase
numPhrases := len(corpus.Vocab)
docFreqsOf = make([]float64, numPhrases)
for i := 0; i < numPhrases; i++ {
docFreqsOf[i] = 0.0
}
numCPUs := runtime.NumCPU()
numDocs = len(corpus.Docs)
chI := make(chan int)
chResult := make(chan map[int]int)
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
go func() {
myResult := map[int]int{}
for i0 := range chI {
i1 := i0 + 100
if i1 > numDocs {
i1 = numDocs
}
for i := i0; i < i1; i++ {
candidates := corpus.Docs[i]
// first find the set of texts in this document
docResult := map[int]bool{}
for _, candidate := range candidates {
for wordID, _ := range candidate {
docResult[wordID] = true
}
}
// then update my document frequency with this set
for wordID, _ := range docResult {
oldFreq, exists := myResult[wordID]
if !exists {
oldFreq = 0
}
myResult[wordID] = oldFreq + 1
}
}
}
chResult <- myResult
}()
}
for i := 0; i < numDocs; i += 100 {
chI <- i
}
close(chI)
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
localResult := <-chResult
for wordID, freq := range localResult {
docFreqsOf[wordID] += float64(freq)
}
}
// --------------------------------------------------------------------------------------------
// step 4: create concurrence model from concurrences
phrases = make([]string, numPhrases)
for phrase, idx := range corpus.Vocab {
phrases[idx] = phrase
}
return
}
// =================================================================================================
// func (g *CitationGraph) findStronglyConnectedPhrases
func (g *CitationGraph) findStronglyConnectedPhrases(numDocs int,
docConcurrences map[uint]map[uint]float64, docFreqsOf []float64, phrases []string,
thresFreq, thresRatio float64,
) map[PhrasePair]PairFreq {
// --------------------------------------------------------------------------------------------
// step 1: apply thresholds in filtering
n := len(g.Nodes)
result := map[PhrasePair]PairFreq{}
for w1, concurrencesOfW1 := range docConcurrences {
relFreq1 := docFreqsOf[w1] / float64(numDocs)
for w2, freq := range concurrencesOfW1 {
if w1 >= w2 {
continue
}
if freq < thresFreq {
continue
}
relFreq2 := docFreqsOf[w2] / float64(numDocs)
expectedFreq := relFreq1 * relFreq2 * float64(n)
if freq >= thresRatio*expectedFreq {
result[PhrasePair{phrases[w1], phrases[w2]}] =
PairFreq{Actual: freq, Expected: expectedFreq}
}
}
}
// ---------------------------------------------------------------------------------------------
// step 2: return the result
return result
}
// =================================================================================================
// func (g *CitationGraph) GetStronglyConnectedPhrases
func (g *CitationGraph) GetStronglyConnectedPhrases(thresFreq, thresRatio float64) map[PhrasePair]PairFreq {
// --------------------------------------------------------------------------------------------
// step 1: get concurrences, sumConcurrencesOf, and phrases
numDocs, docConcurrences, docFreqsOf, phrases := g.preparePhrasePairSearch()
// --------------------------------------------------------------------------------------------
// step 2: apply thresholds in filtering and return the result
return g.findStronglyConnectedPhrases(numDocs, docConcurrences, docFreqsOf, phrases, thresFreq, thresRatio)
}
// =================================================================================================
// func (g *CitationGraph) SaveWord2VecTrainingData
// brief description:
// save tokenized and stemmed publication titles to a file for training of word2vec
func (g *CitationGraph) SaveWord2VecTrainingData(fileNamePrefix string, numIters, minFreq int,
minScore float64, useW2PEx bool, yearStartFrom int) {
// ---------------------------------------------------------------------------------------------
// step 1: create the sequential corpus
corpus, years, IDs, isEnglish := g.CreateCorpusSeq(2)
// ---------------------------------------------------------------------------------------------
// step 2: use word2phrase to detect multi-word phrases
if useW2PEx {
corpus = corpus.Word2PhraseEx(numIters, minFreq, minScore)
} else {
corpus = corpus.Word2Phrase(numIters, minFreq, minScore)
}
// ---------------------------------------------------------------------------------------------
// step 3: extract a list of phrases from corpus
words := make([]string, len(corpus.Vocab))
for word, idxWord := range corpus.Vocab {
words[idxWord] = word
}
// ---------------------------------------------------------------------------------------------
// step 4: detect year ranges
minYear := years[0]
maxYear := years[0]
lenYears := len(years)
for idxYear := 1; idxYear < lenYears; idxYear++ {
if years[idxYear] < minYear {
minYear = years[idxYear]
} else if years[idxYear] > maxYear {
maxYear = years[idxYear]
}
}
if minYear < yearStartFrom {
minYear = yearStartFrom
}
// ---------------------------------------------------------------------------------------------
// step 5: create a file for each [minYear, year] range
for year := minYear; year <= maxYear; year++ {
// (5.1) create the file
fileName := fileNamePrefix + fmt.Sprintf("-%d.txt", year)
idFileName := fileNamePrefix + fmt.Sprintf("-IDs-%d.txt", year)
file, err := os.Create(fileName)
if err != nil {
log.Fatalln(fmt.Sprintf("cannot create file %s", fileName))
}
defer file.Close()
idFile, err := os.Create(idFileName)
if err != nil {
log.Fatalln(fmt.Sprintf("cannot create file %s", idFileName))
}
defer idFile.Close()
// (5.2) write the file
for idxDoc, doc := range corpus.Docs {
if years[idxDoc] > year || !isEnglish[idxDoc] {
continue
}
for _, idxWord := range doc {
word := strings.Replace(words[idxWord], " ", "_", -1)
file.WriteString(word + " ")
}
file.WriteString("\n")
idFile.WriteString(fmt.Sprintf("%d\n", IDs[idxDoc]))
}
}
}
// =================================================================================================
// func (g *CitationGraph) GetEmergingTrends
func (g *CitationGraph) GetEmergingTrends(yearToday, yearRecent, yearFarAway, lowThreshold,
highThreshold int) map[string][]int {
if yearToday <= yearRecent || yearRecent <= yearFarAway {
log.Fatal("Must make sure yearFarAway < yearRecent < yearToday")
}
result := map[string][]int{}
numYears := yearToday - yearFarAway + 1
numYearsFarAway := yearRecent - yearFarAway
for _, nodeID := range g.ToBeAnalyzed {
node := g.Nodes[int64(nodeID)]
year := int(node.Year)
if year < yearFarAway || year > yearToday {
continue
}
stemmedLabels := KeyphraseExtraction.StemPhrases(node.Labels)
for _, label := range stemmedLabels {
labelHistory, exists := result[label]
if !exists {
labelHistory = make([]int, numYears)
result[label] = labelHistory
for i := 0; i < numYears; i++ {
labelHistory[i] = 0
}
}
labelHistory[year-yearFarAway]++
}
}
toBeDeleted := map[string]bool{}
for label, labelHistory := range result {
countFarAway := 0
for i := 0; i < numYearsFarAway; i++ {
countFarAway += labelHistory[i]
}
if countFarAway >= lowThreshold {
toBeDeleted[label] = true
continue
}
countRecent := 0
for i := numYearsFarAway; i < numYears; i++ {
countRecent += labelHistory[i]
}
if countRecent < highThreshold {
toBeDeleted[label] = true
}
}
for label, _ := range toBeDeleted {
delete(result, label)
}
return result
}
// =================================================================================================
// func (g *CitationGraph) GetEmergingTopicPublications
// note: The publications of emerging topic have the following characteristics:
// (1) cold start: in the year of publication, it receives citations <= lowThreshold per year
// (2) break out: in recent years, it receives avg citations >= highThreshold per year
func (g *CitationGraph) GetEmergingTopicPublications(yearToday, yearRecent, yearFarAway,
lowThreshold, highThreshold int) map[int64][]int {
if yearToday <= yearRecent || yearRecent <= yearFarAway {
log.Fatal("Must make sure yearFarAway < yearRecent < yearToday")
}
result := map[int64][]int{}
numYears := yearToday - yearFarAway + 1
for _, nodeID := range g.ToBeAnalyzed {
node := g.Nodes[int64(nodeID)]
year := int(node.Year)
if year < yearFarAway || year > yearToday {
continue
}
citeHistory := make([]int, numYears)
result[nodeID] = citeHistory
for i := 0; i < numYears; i++ {
citeHistory[i] = 0
}
for _, citeID := range node.Cites {
citer := g.Nodes[citeID]
citeYear := int(citer.Year)
if citeYear < yearFarAway || citeYear > yearToday {
continue
}
citeHistory[citeYear-yearFarAway]++
}
}
toBeDeleted := map[int64]bool{}
for nodeID, citeHistory := range result {
// find cold start
node := g.Nodes[int64(nodeID)]
year := int(node.Year)
if citeHistory[year-yearFarAway] > lowThreshold {
toBeDeleted[nodeID] = true
continue
}
coldYear := year
for coldYear+1 < yearToday {
if citeHistory[coldYear+1-yearFarAway] <= lowThreshold {
coldYear++
} else {
break
}
}
// skip those cold year ends at next year of publication
if coldYear-year < 2 {
toBeDeleted[nodeID] = true
continue
}
// skip those cold year ends before yearRecent and those no hot year
if coldYear < yearRecent || coldYear >= yearToday {
toBeDeleted[nodeID] = true
continue
}
// compute hot year average
hotYearAvg := 0.0
numHotYears := 0
for hotYear := coldYear + 1; hotYear <= yearToday; hotYear++ {
hotYearAvg += float64(citeHistory[hotYear-yearFarAway])
numHotYears++
}
hotYearAvg /= float64(numHotYears)
// skip those with too low average during hot years
if hotYearAvg < float64(highThreshold) {
toBeDeleted[nodeID] = true
}
}
for nodeID, _ := range toBeDeleted {
delete(result, nodeID)
}
return result
}
// =================================================================================================
// func (g *CitationGraph) GetHotTopicPublications
// note: The publications of hot topic have the following characteristics:
// hot start: in the year of publication or next year, it receives citations >= highThreshold per year
func (g *CitationGraph) GetHotTopicPublications(yearToday, yearRecent, yearFarAway,
lowThreshold, highThreshold int) map[int64][]int {
if yearToday <= yearRecent || yearRecent <= yearFarAway {
log.Fatal("Must make sure yearFarAway < yearRecent < yearToday")
}
result := map[int64][]int{}
numYears := yearToday - yearFarAway + 1
for _, nodeID := range g.ToBeAnalyzed {
node := g.Nodes[int64(nodeID)]
year := int(node.Year)
if year < yearFarAway || year > yearToday {
continue
}
citeHistory := make([]int, numYears)
result[nodeID] = citeHistory
for i := 0; i < numYears; i++ {
citeHistory[i] = 0
}
for _, citeID := range node.Cites {
citer := g.Nodes[citeID]
citeYear := int(citer.Year)
if citeYear < yearFarAway || citeYear > yearToday || citeYear < year {
continue
}
citeHistory[citeYear-yearFarAway]++
}
}
toBeDeleted := map[int64]bool{}
for nodeID, citeHistory := range result {
// find hot start
node := g.Nodes[int64(nodeID)]
year := int(node.Year)
if citeHistory[year-yearFarAway] < highThreshold {
if year+1 <= yearToday {
if citeHistory[year+1-yearFarAway] < highThreshold {
toBeDeleted[nodeID] = true
continue
}
} else {
toBeDeleted[nodeID] = true
continue
}
}
}
for nodeID, _ := range toBeDeleted {
delete(result, nodeID)
}
return result
}
// =================================================================================================
// func (g *CitationGraph) SortByYear
// brief description: return a sorted list of titles
// input:
// None
// output:
// A map with year as key and slice of titles as value.
func (g *CitationGraph) SortByYear() map[int][]string {
// ---------------------------------------------------------------------------------------------
// step 1: create an empty map of the result
result := map[int][]string{}
// ---------------------------------------------------------------------------------------------
// step 2: iterate through the citation graph and put titles into result according to year
for _, node := range g.Nodes {
year := int(node.Year)
resultOfYear, exists := result[year]
if !exists {
resultOfYear = []string{}
}
result[year] = append(resultOfYear, node.Title)
}
// ---------------------------------------------------------------------------------------------
// step 3: return the result
return result
}
func (g *CitationGraph) Word2Vec(fileNamePrefix string, numIters, minFreq int,
minScore float64, useW2PEx bool, yearStartFrom, yearEndWith int) {
g.SaveWord2VecTrainingData(fileNamePrefix, numIters, minFreq, minScore, useW2PEx, yearStartFrom)
for year := yearStartFrom; year <= yearEndWith; year++ {
model, err := word2vec.New(
word2vec.Window(5),
word2vec.Model(word2vec.Cbow),
word2vec.Optimizer(word2vec.NegativeSampling),
word2vec.NegativeSampleSize(5),
word2vec.Dim(100),
word2vec.Iter(100),
//word2vec.Verbose(),
word2vec.LogBatch(10000),
)
if err != nil {
log.Fatalln(err)
}
inputFileName := fileNamePrefix + fmt.Sprintf("-%d.txt", year)
input, _ := os.Open(inputFileName)
defer input.Close()
if err = model.Train(input); err != nil {
log.Fatalln(err)
}
// write word vector.
outputFileName := fileNamePrefix + fmt.Sprintf("-cbow-%d.vec", year)
output, _ := os.Create(outputFileName)
model.Save(output, vector.Agg)
}
}
func computePhraseSimilarities(phraseVectors map[string][]float64, highFreqPhrases map[string]bool,
) map[string]map[string]float64 {
chPhrases := make(chan []string)
chSimilarities := make(chan map[string]map[string]float64)
numCPUs := runtime.NumCPU()
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
go func() {
mySimilarities := map[string]map[string]float64{}
for phrases := range chPhrases {
for _, phrase := range phrases {
vector, _ := phraseVectors[phrase]
simToPhrase := map[string]float64{}
mySimilarities[phrase] = simToPhrase
for another, anotherVector := range phraseVectors {
_, isHighFreq := highFreqPhrases[another]
if !isHighFreq {
continue
}
sim := 0.0
size1 := 0.0
size2 := 0.0
for idx, elem := range vector {
anotherElem := anotherVector[idx]
sim += elem * anotherElem
size1 += elem * elem
size2 += anotherElem * anotherElem
}
sim /= math.Sqrt(size1 * size2)
simToPhrase[another] = sim
}
}
}
chSimilarities <- mySimilarities
}()
}
phrases := make([]string, 100)
idxPhrase := 0
for phrase, _ := range phraseVectors {
_, isHighFreq := highFreqPhrases[phrase]
if !isHighFreq {
continue
}
phrases[idxPhrase] = phrase
idxPhrase++
if idxPhrase == 100 {
chPhrases <- phrases
phrases = make([]string, 100)
idxPhrase = 0
}
}
if idxPhrase > 0 {
chPhrases <- phrases[0:idxPhrase]
}
close(chPhrases)
similarities := map[string]map[string]float64{}
for idxCPU := 0; idxCPU < numCPUs; idxCPU++ {
localSimilarities := <-chSimilarities
for phrase, simToPhrase := range localSimilarities {
similarities[phrase] = simToPhrase
}
}
return similarities
}
func computeSimRanks(similarities map[string]map[string]float64) map[string]map[string]int {
result := map[string]map[string]int{}
for phrase1, simToPhrase1 := range similarities {
numNeighbors := len(simToPhrase1) - 1
neighbors := make([]string, numNeighbors)
idxNeighbor := 0
for phrase2, _ := range simToPhrase1 {
if phrase2 == phrase1 {
continue
}
neighbors[idxNeighbor] = phrase2
idxNeighbor++
}
sort.Slice(neighbors, func(i, j int) bool {
return simToPhrase1[neighbors[i]] > simToPhrase1[neighbors[j]]
})
resultOfPhrase1 := map[string]int{}
for idx, neighbor := range neighbors {
resultOfPhrase1[neighbor] = idx
}
result[phrase1] = resultOfPhrase1
}
return result
}
type RankJump struct {
phrase1, phrase2 string
jump int
}
func (g *CitationGraph) Leap2Trend(fileNamePrefix string, yearStartFrom, yearEndWith, minFreq, minJump int) {
prevSimRanks := map[string]map[string]int{}
for year := yearStartFrom; year <= yearEndWith; year++ {
// open vector file and id file
vectorFileName := fileNamePrefix + fmt.Sprintf("-cbow-%d.vec", year)
idFileName := fileNamePrefix + fmt.Sprintf("-IDs-%d.txt", year)
trainFileName := fileNamePrefix + fmt.Sprintf("-%d.txt", year)
vectorFile, _ := os.Open(vectorFileName)
idFile, _ := os.Open(idFileName)
trainFile, _ := os.Open(trainFileName)
defer vectorFile.Close()
defer idFile.Close()
defer trainFile.Close()
// load vectors from vector file
vectorScanner := bufio.NewScanner(vectorFile)
phraseVectors := map[string][]float64{}
for vectorScanner.Scan() {
textLine := vectorScanner.Text()
fields := strings.Split(textLine, " ")
if len(fields) < 101 {
continue
}
phrase := fields[0]
phraseVector := make([]float64, 100)
phraseVectors[phrase] = phraseVector
for j := 0; j < 100; j++ {
phraseVector[j], _ = strconv.ParseFloat(fields[j+1], 64)
}
}
// load ids from id file
idScanner := bufio.NewScanner(idFile)
ids := []int64{}
for idScanner.Scan() {
idText := idScanner.Text()
id, _ := strconv.ParseInt(idText, 10, 64)
ids = append(ids, id)
}
// find high freq phrases
phraseFreqs := map[string]int{}
trainScanner := bufio.NewScanner(trainFile)
for trainScanner.Scan() {
phrases := strings.Split(trainScanner.Text(), " ")
for _, phrase := range phrases {
oldFreq, exists := phraseFreqs[phrase]
if !exists {
oldFreq = 0
}
phraseFreqs[phrase] = oldFreq + 1
}
}
highFreqPhrases := map[string]bool{}
for phrase, freq := range phraseFreqs {
if freq < minFreq {
continue
}
highFreqPhrases[phrase] = true
}
// compute phrase similarity matrix
phraseSimilarities := computePhraseSimilarities(phraseVectors, highFreqPhrases)
// compute ranks
simRanks := computeSimRanks(phraseSimilarities)
// find top jumps in phrase similarities
jumps := []RankJump{}
for phrase1, rankWithPhrase1 := range prevSimRanks {
for phrase2, prevRank := range rankWithPhrase1 {
rank := simRanks[phrase1][phrase2]
jump := rank - prevRank
if jump < minJump {
continue
}
jumps = append(jumps, RankJump{phrase1: phrase1, phrase2: phrase2, jump: jump})
}
}
sort.Slice(jumps, func(i, j int) bool {
return jumps[i].jump > jumps[j].jump
})
outputFileName := fileNamePrefix + fmt.Sprintf("-jumpranking-%d.csv", year)
outputFile, _ := os.Create(outputFileName)
defer outputFile.Close()
for idx, rankJump := range jumps {
outputFile.WriteString(fmt.Sprintf("%d, %s, %s, %d\n", idx, rankJump.phrase1, rankJump.phrase2, rankJump.jump))
}
// record phraseSimilarities for next round
prevSimRanks = simRanks
}
}
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: about 3 years ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 2
- Total downloads: unknown
-
Total dependent packages: 0
(may contain duplicates) -
Total dependent repositories: 0
(may contain duplicates) - Total versions: 2
proxy.golang.org: github.com/wujunfeng1/citationgraphs
CitationGraphs.go project CitationGraphs.go CitationGraphs.go document
- Homepage: https://github.com/wujunfeng1/citationgraphs
- Documentation: https://pkg.go.dev/github.com/wujunfeng1/citationgraphs#section-documentation
- License: MIT
-
Latest release: v0.0.0-20210401000901-cc6e43bb5f00
published about 5 years ago
Rankings
Dependent packages count: 7.0%
Average: 8.2%
Dependent repos count: 9.3%
Last synced:
10 months ago
proxy.golang.org: github.com/wujunfeng1/CitationGraphs
CitationGraphs.go project CitationGraphs.go CitationGraphs.go document
- Homepage: https://github.com/wujunfeng1/CitationGraphs
- Documentation: https://pkg.go.dev/github.com/wujunfeng1/CitationGraphs#section-documentation
- License: MIT
-
Latest release: v0.0.0-20210401000901-cc6e43bb5f00
published about 5 years ago
Rankings
Dependent packages count: 7.0%
Average: 8.2%
Dependent repos count: 9.3%
Last synced:
over 1 year ago