https://github.com/bitcs-information-retrieval-2021-2022/project1-let-s-go-for-neurips

project1-let-s-go-for-neurips created by GitHub Classroom

https://github.com/bitcs-information-retrieval-2021-2022/project1-let-s-go-for-neurips

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: sciencedirect.com, springer.com, acm.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (1.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

project1-let-s-go-for-neurips created by GitHub Classroom

Basic Info
  • Host: GitHub
  • Owner: BITCS-Information-Retrieval-2021-2022
  • Language: Python
  • Default Branch: main
  • Size: 2.98 MB
Statistics
  • Stars: 2
  • Watchers: 0
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created over 4 years ago · Last pushed over 4 years ago

https://github.com/BITCS-Information-Retrieval-2021-2022/project1-let-s-go-for-neurips/blob/main/

# Let's Go For NeurIPS

- [Let's Go For NeurIPS](#head1)
    - [ ](#head2)
    - [ ](#head3)
    - [ ](#head4)
        - [ 1.](#head5)
        - [ 2.IP](#head6)
        - [ 3.](#head7)
        - [ 4.](#head8)
        - [ 5.](#head9)
        - [ 6.](#head10)
    - [ ](#head11)
    - [ ](#head12)
        - [ 1.](#head13)
        - [ 2.](#head14)
        - [ 3.](#head30)
          - [ 3.1 ElasticSearch](#head31)
          - [ 3.2 MongoDB](#head32)
          - [ 3.3 Kibana](#head33)
          - [ 3.4 ProxyPool](#head34)
        - [ 4.](#head15)
    - [ ](#head16)
    - [ ](#head17)
        - [ 1.](#head18)
        - [ 2.](#head19)
    - [ ](#head20)
        - [ 1.ACM](#head21)
        - [ 2.Springer](#head22)
        - [ 3.ScienceDirect](#head23)
        - [ 4.IP](#head24)
        - [ 5.](#head25)
        - [ 6.Elasticsearch](#head26)
    - [ ](#head27)
#  Let's Go For NeurIPS

##  

[scrapy](https://scrapy.org)[ACM](https://dl.acm.org)[Springer](https://www.springer.com)[ScienceDirect](https://www.sciencedirect.com)pdfurl[MongoDB](https://www.mongodb.com)[Elasticsearch](https://www.elastic.co)+[Kibana](https://www.elastic.co)

##  

|                                     |        |                            |
| --------------------------------------- | ---------- | ------------------------------ |
|     | 3120211034 | ACMElasticsearch         |
|   | 3120211080 | MongoDB        |
|     | 3120211035 | ScienceDirectIP |
|   |3120211055  |   IP       |
|   |3120211026  |   Springer       |
|     | 3120211001 |  ACM   |


##  

####  1

- 

####  2IP

- IPIPIP

#### 3

- 

#### 4

- try-catch

####  5

- 

####  6

- 

##  

- 1815845

- PDF237977


  ****

  |         |  | PDF |
  | ------------- | -------- | ----------- |
  | ACM           |   1014506    |   197884       |
  | Springer      |  197272   |  40093     |
  | ScienceDirect |  604067  |    0    |

  |           |    |  | 
  | ------------   | ------ | ------ |
  | title          | 1815845 |  100.00% |
  | abstract       |  1254346 |  69.08% |                                                              
  | authors        | 1790760 |  98.62%  |                                                              
  | doi            | 1595599 |  87.87%|                                                              
  | url            | 1815845 |  100.00% |  
  | year           |  1760724 |  96.96%|                                          
  | month          | 1760722 |  96.96% |                                                              
  | type           | 1815845 |  100.00%|                                                              
  | venue          | 1815833 |  100.00% |                                                              
  | source         | 1815845 | 100.00% |                                                              
  | video_url      | 18841  |  1.04%  |                                         
  | video_path     |   18841 |  1.04%  |                                                              
  | thumbnail_url  | 18841  | 1.46% |                                                              
  | pdf_url        | 916891 |  50.49% |                                                              
  | pdf_path       | 916891  |  50.49% |                                         
  | inCitations    | 1202192   |  66.21%  |                                                              
  | outCitations   | 1285093  | 70.78%  |

##  

###  1.

WindowsLinuxMacOS

python3

###  2.

```
pip install -r requirements.txt
```


###  3.

####  3.1 ElasticSearch

1. ElasticSearch

   ``` wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.16.2-linux-x86_64.tar.gz ```

2. 

   ``` tar -xzf elasticsearch-7.16.2-linux-x86_64.tar.gz```

3. ElasticSearch

   ``` cd elasticsearch-7.16.2/```

   ``` ./bin/elasticsearch```

4. 
   
   
   ```curl 'localhost:9200'```
    

#### 3.2 MongoDB 
1. 

LinuxLinuxUbuntu

```
sudo apt-get install libcurl4 openssl
```

2. 

MongoDBhttps://www.mongodb.com/download-center/community

![MongoDB](./extra/DownloadMongoDB.png)

tgz

```
wget https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-ubuntu1804-5.0.5.tgz    # 
tar -zxvf mongodb-linux-x86_64-ubuntu1804-5.0.5.tgz                                # 
```



```
mv mongodb-src-r5.0.5  /usr/local/mongodb                         # 
```

MongoDB  bin  PATH :
```
export PATH=/usr/local/mongodb/bin:$PATH
```

3. MongoDB

MongoDB

```
sudo mkdir -p /var/lib/mongo           # 
sudo mkdir -p /var/log/mongodb         # 
sudo chown `whoami` /var/lib/mongo     # 
sudo chown `whoami` /var/log/mongodb   # 
```

`.conf``mongodb.conf`

```
# 
dbpath=/var/lib/mongo
# 
logpath=/var/log/mongodb/mongod.log
# 
logappend=true
# 27017
port=27017
# IP
bind_ip=0.0.0.0
# 
fork=true
```

MongoDB
```
cd /usr/local/mongodb/
./bin/mongod -f ./bin/mongodb.conf
```

5. 


```
./bin/mongo
```


#### 3.3 Kibana 

1. Kibana

   ``` curl -O https://artifacts.elastic.co/downloads/kibana/kibana-7.16.2-darwin-x86_64.tar.gz ```

2. 

   ```tar -xzf kibana-7.16.2-darwin-x86_64.tar.gz```

3. ElasticSearch

   ``` cd kibana-7.16.2-darwin-x86_64/```

   ``` ./bin/kibana```
   
4. 
   
    'localhost:5601"

#### 3.4 ProxyPool 

1. ProxyPool
    ```
    git clone https://github.com/Python3WebSpider/ProxyPool.git
    cd ProxyPool
    ```

2. 

    ``` pip3 install -r requirements.txt ```

3.  Redis
    https://www.runoob.com/redis/redis-install.html
    ```
    export REDIS_HOST='localhost'
    export REDIS_PORT=6379
    export REDIS_PASSWORD=''
    export REDIS_DB=0
    ```

4.  Rdies
  
    ``` ./redis-server ```

5. 

    ``` python3 run.py```


5. 
     http://localhost:5555/random 

###  4.

- 

-  ```python prepare.py```

-  ```cd Reptiles```

-  ```chmod +x run.sh```

-  ```./run.sh ACM```
ACMSpringer,ScienceDirectLogs

##  
`./requirements.txt`
##  

###  
![](./extra/overall.png)
![](./extra/structure.png)

ScrapyMiddelwarePipelineSpidersIP\
****
1. Spiders()Responses,ItemURLScheduler()
2. Engine()SpiderItemPipelineDownloaderScheduler
3. Scheduler()Request
4. Downloader()Scrapy Engine()RequestsResponsesScrapy Engine()Spider
5. Spider MiddlewaresSpiderSpiderSpiderResponsesSpiderRequests
6. Item Pipeline((): Spideritem,()
###  
![](./extra/pipeline.jpg)
![](./extra/scrapy.png)

IPIPIPPDFMongoDB


##  

###  1ACM

#### 1.1 

- ACM"/dl.acm.org/action/doSearch"
- 
  - AfterYear/AfterMonth/AfterDay//
  - BeforeYear/BeforeMonth/BeforeDay //
  - concept: 
  - sorted: 
  - startPage
  - pageSize
- ACM2000
  - "/dl.acm.org/action/doSearch"startPage
  - pdf
  - pdf

#### 1.2 

xpath

1. (inCitations)
2. (video_url)(thumbnail_url)"videodelivery.net/" + source +"/thumbnails/thumbnail.jpg?time=10.0s"
3. (abstract):pdf
4. (pdf_url)(pdf_path):

![ACM1](./extra/acm1.png)
![ACM1](./extra/acm2.png)

###  2. Springer

#### 2.1 

- Springer
  - 10002020000
  - 20000
- JournalConferenceJournalConference
- 
  - facet-content-typeJournalConferenceProceedings
  - search-withinJournalChapter
  - facet-journal-id
  - query
  - facet-eisbn

#### 2.2 

xpath

1. JournalConference
2. (year)(month)
3. (outCitations)

![Springer1](./extra/springer1.png)
![Springer2](./extra/springer2.png)
![Springer3](./extra/springer3.png)

###  3. ScienceDirect

#### 3.1 

- ScienceDirect
    - jsonjson

![sciencedirect1](./extra/sciencedirect1.png)

- ScienceDirect"www.sciencedirect.com/search/api?" 
- 

  - date

  - cidid

  - show: 

  - offset: 

  - ttoken

- ScienceDirect60006000

  - "www.sciencedirect.com/search/api?" datecidoffset
  - json'searchResults'

#### 3.2 

- xpath

![sciencedirect2](./extra/sciencedirect2.png)



- (title)meta

![sciencedirect3](./extra/sciencedirect3.png)



- (authors)given_namesurname

![sciencedirect9](./extra/sciencedirect9.png)

- (year)(month)scriptsPublication date

![sciencedirect4](./extra/sciencedirect4.png)

- (type)scriptspublicationType

![sciencedirect5](./extra/sciencedirect5.png)

- (outCitations)scriptsdocument-references

![sciencedirect7](./extra/sciencedirect7.png)

- (inCitations)jsonhitCount

![sciencedirect8](./extra/sciencedirect8.png)


###  4. IP
- ACMSpringerScienceDirectIPGitHubProxyPool https://github.com/Python3WebSpider/ProxyPool
- IPredisredisIPIP
- ProxiesMiddleware


  ```
    class ProxiesMiddleware(object):
      def __init__(self, settings):
          super(ProxiesMiddleware, self).__init__()
          self.step = 0
          self.proxypool_url = 'http://127.0.0.1:5555/random'
          self.proxy = self.get_random_proxy()

      @classmethod
      def from_crawler(cls, crawler):
          return cls(crawler.settings)

      def get_random_proxy(self):
          proxy = requests.get(self.proxypool_url).text.strip()
          logging.info('---get_random_proxy--- ' + str(proxy))
          return proxy

      def process_request(self, request, spider):
          self.step += 1
          if self.step % 1000 == 0:
              self.proxy = self.get_random_proxy()
          request.meta['proxy'] = 'http://' + self.proxy
          request.headers["Connection"] = "close"
  ```

###  5. 
MongoDBjson  
#### 5.1 
`checksum``checksum`
#### 5.2 checksum

```python
checksum = re.sub(r'[\W\d\_]', "", info['title']).lower()
```
`checksum`
#### 5.3 
`python``pymongo`  
`./Reptiles/Reptiles/data_manager.py`  
- 
```python
Mongo = MongoManager()
```
- 
```python
Mongo.mongodb_insert(site, info)
```
- 
```python
Mongo.mongodb_delete(site, field, value)
```
- 
```python
Mongo.mongodb_find(site, field, value)
```
###  6.Elasticsearch

1. 

   elasticsearchKibanaitemmongodbapiElasticsearch

2. 

    Elasticsearchlocalhost:5601dashboardpaperapi
![elastic](./extra/elastic.jpg)

##  
- 
```
  README.md
  requirements.txt
  prepare.py //
Reptiles
      scrapy.cfg
      run.sh //
    Reptiles
          convert_json.py
          items.py //
          middlewares.py //ip
          data_manager.py  // 
          venue_cid  // ScienceDirectid
          pipelines.py // PDF
          proxy.py //ip
          settings.py //
          __init__.py
        configs
              proxylist_big.txt //ip
        spiders
                ACM.py // ACM
                ScienceDirect.py //ScienceDirect
                Springer.py //Springer
                __init__.py
```

Owner

  • Name: BITCS-Information-Retrieval-2021-2022
  • Login: BITCS-Information-Retrieval-2021-2022
  • Kind: organization

GitHub Events

Total
Last Year