robotito

https://github.com/carlosgfiguerola/robotito

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.4%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: CarlosGFiguerola
Language: Python
Default Branch: main
Size: 20.5 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 3 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Citation

robotito

robotito is a small light web crawler, that is, a program which navigates across the web in an autonomous way. It's main pourpose is to collect cybermetric data about the web pages, but it can be easily tunned for collect every kind of data.

Install

robotito is a single Python file plus a config file - requires Python 3 with libraries: - sys, os, re, requests, urllib, operator, magic and signal - assuming you already have Python 3, you certainment also have sys, os, re and operator; install the other with pip, example:

pip install requests - then simply copy robotito.py in a folder of your convenience

NOTE: if you intend use robotito in the Tor network, you will need torsocks running in your computer (see below)

Basic Usage

need, at less, one URL as starting point (seed).
admit a number of parameters in order to set rules delimiting web pages to explore, set max of pages to explore, etc.
most of those parameters can be setted in a config file So, the most basic usage is:

python3 robotito.py -c config_file

Configuration

robotito needs some parameters to do its work. Those parameters are setted in three steps, which are

some safe defaults, setted inside the code
parameter values setted by means of a configuration file(they override default values)
parameter values passed as option in the command line (they override values in config file)

Config file

Config file has directives in the form parametername:parametervalue Lines starting with # are ignored. Available parameters are:

max_level: integer value, telling the max recursion level
max_nodes: integer value telling wanted max of visited pages
maxlistsize: integer value wiht the max size of to_visit list
seed: either a filepath with a list of seeds; or a single URL as only seed. If filepath, expect to be a textfile wiht an URL per line
cyberfile: filepath where saving finded links for later analysis; optional, if None no nothing is recorded
session_mode: value can be 'fresh' (for a new crawl starting from the begining) or 'resume' (resuming a previous interrumpted crawl)
sessionvisitedfile: filename where keeping visited URLs, preventing a crawl interrumption
sessiontovisitfile: filename where keeping URLs to yet visit
proxyhost: if connecting through a proxy, the proxy host address
proxyport: if connectign thorugh a proxy, the proxy port number
rule
- can (should) be several rules defining which links the crawler must follow, and which not.
- rules are regular expressions preceded with '+' or '-'.
- '-' rules mean the crawler not follow links matching regular expression
- '+' rules mean the crawler must follow links matching regular expression
cyberrule: same as rule, but just for save links in cyberfile
proxyhost: proxy address, if any
proxyport: proxy port, if any
mode: queue|stack|freq
useragent: agent self-identification (defalut='robotito')

Options in command line

Some parameter can be setted also as option in command line

-c: pathfile of config file
-o: pathfile to cyberfile
-l: max level recursion
-s: single seed or filepath with a list of seeds
-resume: continue with a prevous interrumpted crawl
-p: print parameters and settings

robotito and Tor

robotito can to explore and crawl the Tor network. To do that, it needs software torsocks (see https://github.com/dgoulet/torsocks) working on your computer. torsocks also comes in major Linux distros repositories.

torsocks set a kind of proxy, tipically running on your localhost, port 9050 (although you can change those parameters trough the configuration of torsocks). All you need, in addition to torsocks itself, is set proxyhoist y proxyport in your config file; as well as some good Tor addresses as seeds.

Owner

Name: Carlos G. Figuerola
Login: CarlosGFiguerola
Kind: user
Location: Salamanca, Spain
Company: University of Salamanca

Website: http://buho.usal.es/figuerola
Repositories: 1
Profile: https://github.com/CarlosGFiguerola

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Figuerola
    given-names: Carlos G.
title: "robotito. A light crawler"
version: 2.0
identifiers:
  - type: doi
    value: 10.5281/zenodo.1234
date-released: 2024-01-06
url: https://github.com/CarlosGFiguerola/robotito

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

robotito

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

robotito

Install

Basic Usage

Configuration

Config file

Options in command line

robotito and Tor

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year