Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.4%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: CarlosGFiguerola
- Language: Python
- Default Branch: main
- Size: 20.5 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
robotito
robotito is a small light web crawler, that is, a program which navigates across the web in an autonomous way. It's main pourpose is to collect cybermetric data about the web pages, but it can be easily tunned for collect every kind of data.
Install
robotito is a single Python file plus a config file - requires Python 3 with libraries: - sys, os, re, requests, urllib, operator, magic and signal - assuming you already have Python 3, you certainment also have sys, os, re and operator; install the other with pip, example:
pip install requests
- then simply copy robotito.py in a folder of your convenience
NOTE: if you intend use robotito in the Tor network, you will need torsocks running in your computer (see below)
Basic Usage
- need, at less, one URL as starting point (seed).
- admit a number of parameters in order to set rules delimiting web pages to explore, set max of pages to explore, etc.
- most of those parameters can be setted in a config file So, the most basic usage is:
python3 robotito.py -c config_file
Configuration
robotito needs some parameters to do its work. Those parameters are setted in three steps, which are
- some safe defaults, setted inside the code
- parameter values setted by means of a configuration file(they override default values)
- parameter values passed as option in the command line (they override values in config file)
Config file
Config file has directives in the form parametername:parametervalue Lines starting with # are ignored. Available parameters are:
- max_level: integer value, telling the max recursion level
- max_nodes: integer value telling wanted max of visited pages
- maxlistsize: integer value wiht the max size of to_visit list
- seed: either a filepath with a list of seeds; or a single URL as only seed. If filepath, expect to be a textfile wiht an URL per line
- cyberfile: filepath where saving finded links for later analysis; optional, if None no nothing is recorded
- session_mode: value can be 'fresh' (for a new crawl starting from the begining) or 'resume' (resuming a previous interrumpted crawl)
- sessionvisitedfile: filename where keeping visited URLs, preventing a crawl interrumption
- sessiontovisitfile: filename where keeping URLs to yet visit
- proxyhost: if connecting through a proxy, the proxy host address
- proxyport: if connectign thorugh a proxy, the proxy port number
- rule
- can (should) be several rules defining which links the crawler must follow, and which not.
- rules are regular expressions preceded with '+' or '-'.
- '-' rules mean the crawler not follow links matching regular expression
- '+' rules mean the crawler must follow links matching regular expression
- cyberrule: same as rule, but just for save links in cyberfile
- proxyhost: proxy address, if any
- proxyport: proxy port, if any
- mode: queue|stack|freq
- useragent: agent self-identification (defalut='robotito')
Options in command line
Some parameter can be setted also as option in command line
- -c: pathfile of config file
- -o: pathfile to cyberfile
- -l: max level recursion
- -s: single seed or filepath with a list of seeds
- -resume: continue with a prevous interrumpted crawl
- -p: print parameters and settings
robotito and Tor
robotito can to explore and crawl the Tor network. To do that, it needs software torsocks (see https://github.com/dgoulet/torsocks) working on your computer. torsocks also comes in major Linux distros repositories.
torsocks set a kind of proxy, tipically running on your localhost, port 9050 (although you can change those parameters trough the configuration of torsocks). All you need, in addition to torsocks itself, is set proxyhoist y proxyport in your config file; as well as some good Tor addresses as seeds.
Owner
- Name: Carlos G. Figuerola
- Login: CarlosGFiguerola
- Kind: user
- Location: Salamanca, Spain
- Company: University of Salamanca
- Website: http://buho.usal.es/figuerola
- Repositories: 1
- Profile: https://github.com/CarlosGFiguerola
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Figuerola
given-names: Carlos G.
title: "robotito. A light crawler"
version: 2.0
identifiers:
- type: doi
value: 10.5281/zenodo.1234
date-released: 2024-01-06
url: https://github.com/CarlosGFiguerola/robotito