https://github.com/bitcs-information-retrieval-2021-2022/project1-let-s-go-for-neurips
project1-let-s-go-for-neurips created by GitHub Classroom
https://github.com/bitcs-information-retrieval-2021-2022/project1-let-s-go-for-neurips
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: sciencedirect.com, springer.com, acm.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (1.6%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
project1-let-s-go-for-neurips created by GitHub Classroom
Basic Info
- Host: GitHub
- Owner: BITCS-Information-Retrieval-2021-2022
- Language: Python
- Default Branch: main
- Size: 2.98 MB
Statistics
- Stars: 2
- Watchers: 0
- Forks: 1
- Open Issues: 0
- Releases: 0
Created over 4 years ago
· Last pushed over 4 years ago
https://github.com/BITCS-Information-Retrieval-2021-2022/project1-let-s-go-for-neurips/blob/main/
# Let's Go For NeurIPS
- [Let's Go For NeurIPS](#head1)
- [ ](#head2)
- [ ](#head3)
- [ ](#head4)
- [ 1.](#head5)
- [ 2.IP](#head6)
- [ 3.](#head7)
- [ 4.](#head8)
- [ 5.](#head9)
- [ 6.](#head10)
- [ ](#head11)
- [ ](#head12)
- [ 1.](#head13)
- [ 2.](#head14)
- [ 3.](#head30)
- [ 3.1 ElasticSearch](#head31)
- [ 3.2 MongoDB](#head32)
- [ 3.3 Kibana](#head33)
- [ 3.4 ProxyPool](#head34)
- [ 4.](#head15)
- [ ](#head16)
- [ ](#head17)
- [ 1.](#head18)
- [ 2.](#head19)
- [ ](#head20)
- [ 1.ACM](#head21)
- [ 2.Springer](#head22)
- [ 3.ScienceDirect](#head23)
- [ 4.IP](#head24)
- [ 5.](#head25)
- [ 6.Elasticsearch](#head26)
- [ ](#head27)
# Let's Go For NeurIPS
##
[scrapy](https://scrapy.org)[ACM](https://dl.acm.org)[Springer](https://www.springer.com)[ScienceDirect](https://www.sciencedirect.com)pdfurl[MongoDB](https://www.mongodb.com)[Elasticsearch](https://www.elastic.co)+[Kibana](https://www.elastic.co)
##
| | | |
| --------------------------------------- | ---------- | ------------------------------ |
| | 3120211034 | ACMElasticsearch |
| | 3120211080 | MongoDB |
| | 3120211035 | ScienceDirectIP |
| |3120211055 | IP |
| |3120211026 | Springer |
| | 3120211001 | ACM |
##
#### 1
-
#### 2IP
- IPIPIP
#### 3
-
#### 4
- try-catch
#### 5
-
#### 6
-
##
- 1815845
- PDF237977
****
| | | PDF |
| ------------- | -------- | ----------- |
| ACM | 1014506 | 197884 |
| Springer | 197272 | 40093 |
| ScienceDirect | 604067 | 0 |
| | | |
| ------------ | ------ | ------ |
| title | 1815845 | 100.00% |
| abstract | 1254346 | 69.08% |
| authors | 1790760 | 98.62% |
| doi | 1595599 | 87.87%|
| url | 1815845 | 100.00% |
| year | 1760724 | 96.96%|
| month | 1760722 | 96.96% |
| type | 1815845 | 100.00%|
| venue | 1815833 | 100.00% |
| source | 1815845 | 100.00% |
| video_url | 18841 | 1.04% |
| video_path | 18841 | 1.04% |
| thumbnail_url | 18841 | 1.46% |
| pdf_url | 916891 | 50.49% |
| pdf_path | 916891 | 50.49% |
| inCitations | 1202192 | 66.21% |
| outCitations | 1285093 | 70.78% |
##
### 1.
WindowsLinuxMacOS
python3
### 2.
```
pip install -r requirements.txt
```
### 3.
#### 3.1 ElasticSearch
1. ElasticSearch
``` wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.16.2-linux-x86_64.tar.gz ```
2.
``` tar -xzf elasticsearch-7.16.2-linux-x86_64.tar.gz```
3. ElasticSearch
``` cd elasticsearch-7.16.2/```
``` ./bin/elasticsearch```
4.
```curl 'localhost:9200'```
#### 3.2 MongoDB
1.
LinuxLinuxUbuntu
```
sudo apt-get install libcurl4 openssl
```
2.
MongoDBhttps://www.mongodb.com/download-center/community

tgz
```
wget https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-ubuntu1804-5.0.5.tgz #
tar -zxvf mongodb-linux-x86_64-ubuntu1804-5.0.5.tgz #
```
```
mv mongodb-src-r5.0.5 /usr/local/mongodb #
```
MongoDB bin PATH :
```
export PATH=/usr/local/mongodb/bin:$PATH
```
3. MongoDB
MongoDB
```
sudo mkdir -p /var/lib/mongo #
sudo mkdir -p /var/log/mongodb #
sudo chown `whoami` /var/lib/mongo #
sudo chown `whoami` /var/log/mongodb #
```
`.conf``mongodb.conf`
```
#
dbpath=/var/lib/mongo
#
logpath=/var/log/mongodb/mongod.log
#
logappend=true
# 27017
port=27017
# IP
bind_ip=0.0.0.0
#
fork=true
```
MongoDB
```
cd /usr/local/mongodb/
./bin/mongod -f ./bin/mongodb.conf
```
5.
```
./bin/mongo
```
#### 3.3 Kibana
1. Kibana
``` curl -O https://artifacts.elastic.co/downloads/kibana/kibana-7.16.2-darwin-x86_64.tar.gz ```
2.
```tar -xzf kibana-7.16.2-darwin-x86_64.tar.gz```
3. ElasticSearch
``` cd kibana-7.16.2-darwin-x86_64/```
``` ./bin/kibana```
4.
'localhost:5601"
#### 3.4 ProxyPool
1. ProxyPool
```
git clone https://github.com/Python3WebSpider/ProxyPool.git
cd ProxyPool
```
2.
``` pip3 install -r requirements.txt ```
3. Redis
https://www.runoob.com/redis/redis-install.html
```
export REDIS_HOST='localhost'
export REDIS_PORT=6379
export REDIS_PASSWORD=''
export REDIS_DB=0
```
4. Rdies
``` ./redis-server ```
5.
``` python3 run.py```
5.
http://localhost:5555/random
### 4.
-
- ```python prepare.py```
- ```cd Reptiles```
- ```chmod +x run.sh```
- ```./run.sh ACM```
ACMSpringer,ScienceDirectLogs
##
`./requirements.txt`
##
###


ScrapyMiddelwarePipelineSpidersIP\
****
1. Spiders()Responses,ItemURLScheduler()
2. Engine()SpiderItemPipelineDownloaderScheduler
3. Scheduler()Request
4. Downloader()Scrapy Engine()RequestsResponsesScrapy Engine()Spider
5. Spider MiddlewaresSpiderSpiderSpiderResponsesSpiderRequests
6. Item Pipeline((): Spideritem,()
###


IPIPIPPDFMongoDB
##
### 1ACM
#### 1.1
- ACM"/dl.acm.org/action/doSearch"
-
- AfterYear/AfterMonth/AfterDay//
- BeforeYear/BeforeMonth/BeforeDay //
- concept:
- sorted:
- startPage
- pageSize
- ACM2000
- "/dl.acm.org/action/doSearch"startPage
- pdf
- pdf
#### 1.2
xpath
1. (inCitations)
2. (video_url)(thumbnail_url)"videodelivery.net/" + source +"/thumbnails/thumbnail.jpg?time=10.0s"
3. (abstract):pdf
4. (pdf_url)(pdf_path):


### 2. Springer
#### 2.1
- Springer
- 10002020000
- 20000
- JournalConferenceJournalConference
-
- facet-content-typeJournalConferenceProceedings
- search-withinJournalChapter
- facet-journal-id
- query
- facet-eisbn
#### 2.2
xpath
1. JournalConference
2. (year)(month)
3. (outCitations)



### 3. ScienceDirect
#### 3.1
- ScienceDirect
- jsonjson

- ScienceDirect"www.sciencedirect.com/search/api?"
-
- date
- cidid
- show:
- offset:
- ttoken
- ScienceDirect60006000
- "www.sciencedirect.com/search/api?" datecidoffset
- json'searchResults'
#### 3.2
- xpath

- (title)meta

- (authors)given_namesurname

- (year)(month)scriptsPublication date

- (type)scriptspublicationType

- (outCitations)scriptsdocument-references

- (inCitations)jsonhitCount

### 4. IP
- ACMSpringerScienceDirectIPGitHubProxyPool https://github.com/Python3WebSpider/ProxyPool
- IPredisredisIPIP
- ProxiesMiddleware
```
class ProxiesMiddleware(object):
def __init__(self, settings):
super(ProxiesMiddleware, self).__init__()
self.step = 0
self.proxypool_url = 'http://127.0.0.1:5555/random'
self.proxy = self.get_random_proxy()
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
def get_random_proxy(self):
proxy = requests.get(self.proxypool_url).text.strip()
logging.info('---get_random_proxy--- ' + str(proxy))
return proxy
def process_request(self, request, spider):
self.step += 1
if self.step % 1000 == 0:
self.proxy = self.get_random_proxy()
request.meta['proxy'] = 'http://' + self.proxy
request.headers["Connection"] = "close"
```
### 5.
MongoDBjson
#### 5.1
`checksum``checksum`
#### 5.2 checksum
```python
checksum = re.sub(r'[\W\d\_]', "", info['title']).lower()
```
`checksum`
#### 5.3
`python``pymongo`
`./Reptiles/Reptiles/data_manager.py`
-
```python
Mongo = MongoManager()
```
-
```python
Mongo.mongodb_insert(site, info)
```
-
```python
Mongo.mongodb_delete(site, field, value)
```
-
```python
Mongo.mongodb_find(site, field, value)
```
### 6.Elasticsearch
1.
elasticsearchKibanaitemmongodbapiElasticsearch
2.
Elasticsearchlocalhost:5601dashboardpaperapi

##
-
```
README.md
requirements.txt
prepare.py //
Reptiles
scrapy.cfg
run.sh //
Reptiles
convert_json.py
items.py //
middlewares.py //ip
data_manager.py //
venue_cid // ScienceDirectid
pipelines.py // PDF
proxy.py //ip
settings.py //
__init__.py
configs
proxylist_big.txt //ip
spiders
ACM.py // ACM
ScienceDirect.py //ScienceDirect
Springer.py //Springer
__init__.py
```
Owner
- Name: BITCS-Information-Retrieval-2021-2022
- Login: BITCS-Information-Retrieval-2021-2022
- Kind: organization
- Repositories: 1
- Profile: https://github.com/BITCS-Information-Retrieval-2021-2022