https://github.com/adbar/courlan

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters

Keywords

crawler crawling recon tld uri url url-checker url-normalization url-parser url-parsing url-validation

Keywords from Contributors

lemmatization tokenization text-preprocessing text-extraction text-cleaning tei scraping rss-feed readability rag

Last synced: 6 months ago · JSON representation

Repository

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters

Basic Info

Host: GitHub
Owner: adbar
License: apache-2.0
Language: Python
Default Branch: master
Homepage: https://adrien.barbaresi.eu/blog/easy-content-aware-url-filtering.html
Size: 547 KB

Statistics

Stars: 145
Watchers: 1
Forks: 9
Open Issues: 10
Releases: 28

Topics

crawler crawling recon tld uri url url-checker url-normalization url-parser url-parsing url-validation

Created over 10 years ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog Contributing Funding License

coURLan: Clean, filter, normalize, and sample URLs

Why coURLan?

"It is important for the crawler to visit 'important' pages first, so that the fraction of the Web that is visited (and kept up to date) is more meaningful." (Cho et al. 1998)

"Given that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained." (Edwards et al. 2001)

This library provides an additional "brain" for web crawling, scraping and document management. It facilitates web navigation through a set of filters, enhancing the quality of resulting document collections:

Save bandwidth and processing time by steering clear of pages deemed low-value
Identify specific pages based on language or text content
Pinpoint pages relevant for efficient link gathering

Additional utilities needed include URL storage, filtering, and deduplication.

Features

Separate the wheat from the chaff and optimize document discovery and retrieval:

URL handling
- Validation
- Normalization
- Sampling
Heuristics for link filtering
- Spam, trackers, and content-types
- Locales and internationalization
- Web crawling (frontier, scheduling)
Data store specifically designed for URLs
Usable with Python or on the command-line

Let the coURLan fish up juicy bits for you!

Courlan bird

Here is a courlan (source: Limpkin at Harn's Marsh by Russ, CC BY 2.0).

Installation

This package is compatible with with all common versions of Python, it is tested on Linux, macOS and Windows systems.

Courlan is available on the package repository PyPI and can notably be installed with the Python package manager pip:

bash $ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed $ pip install --upgrade courlan # to make sure you have the latest version $ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)

The last version to support Python 3.6 and 3.7 is courlan==1.2.0.

Python

Most filters revolve around the strict and language arguments.

check_url()

All useful operations chained in check_url(url):

``` python

from courlan import check_url

return url and domain name

check_url('https://github.com/adbar/courlan') ('https://github.com/adbar/courlan', 'github.com')

filter out bogus domains

check_url('http://666.0.0.1/')

tracker removal

checkurl('http://test.net/foo.html?utmsource=twitter#gclid=123') ('http://test.net/foo.html', 'test.net')

use strict for further trimming

myurl = 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org' checkurl(my_url, strict=True) ('https://httpbin.org/redirect-to', 'httpbin.org')

check for redirects (HEAD request)

url, domainname = checkurl(myurl, withredirects=True)

include navigation pages instead of discarding them

checkurl('http://www.example.org/page/10/', withnav=True)

remove trailing slash

checkurl('https://github.com/adbar/courlan/', trailingslash=False) ```

Language-aware heuristics, notably internationalization in URLs, are available in lang_filter(url, language):

``` python

optional language argument

url = 'https://www.un.org/en/about-us'

success: returns clean URL and domain name

check_url(url, language='en') ('https://www.un.org/en/about-us', 'un.org')

failure: doesn't return anything

check_url(url, language='de')

optional argument: strict

url = 'https://en.wikipedia.org/' checkurl(url, language='de', strict=False) ('https://en.wikipedia.org', 'wikipedia.org') checkurl(url, language='de', strict=True)

```

Define stricter restrictions on the expected content type with strict=True. This also blocks certain platforms and page types where machines get lost.

``` python

strict filtering: blocked as it is a major platform

check_url('https://www.twitch.com/', strict=True)

```

Sampling by domain name

``` python

from courlan import sampleurls myurls = ['https://example.org/' + str(x) for x in range(100)] mysample = sampleurls(my_urls, 10)

optional: excludemin=None, excludemax=None, strict=False, verbose=False

```

Web crawling and URL handling

Link extraction and preprocessing:

``` python

from courlan import extractlinks doc = 'Link' url = "https://example.org" extractlinks(doc, url) {'https://example.org/test/link.html'}

other options: externalbool, nofilter, language, strict, redirects, ...

```

The filter_links() function provides additional filters for crawling purposes: use of robots.txt rules and link priorization. See courlan.core for details.

Determine if a link leads to another host:

``` python

from courlan import isexternal isexternal('https://github.com/', 'https://www.microsoft.com/') True

default

isexternal('https://google.com/', 'https://www.google.co.uk/', ignoresuffix=True) False

taking suffixes into account

isexternal('https://google.com/', 'https://www.google.co.uk/', ignoresuffix=False) True ```

Other useful functions dedicated to URL handling:

extract_domain(url, fast=True): find domain and subdomain or just domain with fast=False
get_base_url(url): strip the URL of some of its parts
get_host_and_path(url): decompose URLs in two parts: protocol + host/domain and path
get_hostinfo(url): extract domain and host info (protocol + host/domain)
fix_relative_urls(baseurl, url): prepend necessary information to relative links

``` python

from courlan import * url = 'https://www.un.org/en/about-us'

getbaseurl(url) 'https://www.un.org'

gethostand_path(url) ('https://www.un.org', '/en/about-us')

get_hostinfo(url) ('un.org', 'https://www.un.org')

fixrelativeurls('https://www.un.org', 'en/about-us') 'https://www.un.org/en/about-us' ```

Other filters dedicated to crawl frontier management:

is_not_crawlable(url): check for deep web or pages generally not usable in a crawling context
is_navigation_page(url): check for navigation and overview pages

``` python

from courlan import isnavigationpage, isnotcrawlable isnavigationpage('https://www.randomblog.net/category/myposts') True isnotcrawlable('https://www.randomblog.net/login') True ```

See also URL management page of the Trafilatura documentation.

Python helpers

Helper function, scrub and normalize:

``` python

from courlan import cleanurl cleanurl('HTTPS://WWW.DWDS.DE:80/') 'https://www.dwds.de' ```

Basic scrubbing only:

``` python

from courlan import scrub_url ```

Basic canonicalization/normalization only, i.e. modifying and standardizing URLs in a consistent manner:

``` python

from urllib.parse import urlparse from courlan import normalizeurl myurl = normalizeurl(urlparse(myurl))

passing URL strings directly also works

myurl = normalizeurl(my_url)

remove unnecessary components and re-order query elements

normalizeurl('http://test.net/foo.html?utmsource=twitter&post=abc&page=2#fragment', strict=True) 'http://test.net/foo.html?page=2&post=abc' ```

Basic URL validation only:

``` python

from courlan import validateurl validateurl('http://1234') (False, None) validate_url('http://www.example.org/') (True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment='')) ```

Troubleshooting

Courlan uses an internal cache to speed up URL parsing. It can be reset as follows:

``` python

from courlan.meta import clearcaches clearcaches() ```

UrlStore class

The UrlStore class allow for storing and retrieving domain-classified URLs, where a URL like https://example.org/path/testpage is stored as the path /path/testpage within the domain https://example.org. It features the following methods:

URL management
- add_urls(urls=[], appendleft=None, visited=False): Add a list of URLs to the (possibly) existing one. Optional: append certain URLs to the left, specify if the URLs have already been visited.
- add_from_html(htmlstring, url, external=False, lang=None, with_nav=True): Extract and filter links in a HTML string.
- discard(domains): Declare domains void and prune the store.
- dump_urls(): Return a list of all known URLs.
- print_urls(): Print all URLs in store (URL + TAB + visited or not).
- print_unvisited_urls(): Print all unvisited URLs in store.
- get_all_counts(): Return all download counts for the hosts in store.
- get_known_domains(): Return all known domains as a list.
- get_unvisited_domains(): Find all domains for which there are unvisited URLs.
- total_url_number(): Find number of all URLs in store.
- is_known(url): Check if the given URL has already been stored.
- has_been_visited(url): Check if the given URL has already been visited.
- filter_unknown_urls(urls): Take a list of URLs and return the currently unknown ones.
- filter_unvisited_urls(urls): Take a list of URLs and return the currently unvisited ones.
- find_known_urls(domain): Get all already known URLs for the given domain (ex. https://example.org).
- find_unvisited_urls(domain): Get all unvisited URLs for the given domain.
- get_unvisited_domains(): Return all domains which have not been all visited.
- reset(): Re-initialize the URL store.
Crawling and downloads
- get_url(domain): Retrieve a single URL and consider it to be visited (with corresponding timestamp).
- get_rules(domain): Return the stored crawling rules for the given website.
- store_rules(website, rules=None): Store crawling rules for a given website.
- get_crawl_delay(): Return the delay as extracted from robots.txt, or a given default.
- get_download_urls(max_urls=100, time_limit=10): Get a list of immediately downloadable URLs according to the given time limit per domain.
- establish_download_schedule(max_urls=100, time_limit=10): Get up to the specified number of URLs along with a suitable backoff schedule (in seconds).
- download_threshold_reached(threshold): Find out if the download limit (in seconds) has been reached for one of the websites in store.
- unvisited_websites_number(): Return the number of websites for which there are still URLs to visit.
- is_exhausted_domain(domain): Tell if all known URLs for the website have been visited.
Persistance
- write(filename): Save the store to disk.
- load_store(filename): Read a UrlStore from disk (separate function, not class method).
Optional settings:
- compressed=True: activate compression of URLs and rules
- language=XX: focus on a particular target language (two-letter code)
- strict=True: stricter URL filtering
- verbose=True: dump URLs if interrupted (requires use of signal)

Command-line

The main fonctions are also available through a command-line utility:

``` bash $ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt $ courlan --help usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v] [-p PARALLEL] [--strict] [-l LANGUAGE] [-r] [--sample SAMPLE] [--exclude-max EXCLUDEMAX] [--exclude-min EXCLUDEMIN]

Command-line interface for Courlan

options: -h, --help show this help message and exit

I/O: Manage input and output

-i INPUTFILE, --inputfile INPUTFILE name of input file (required) -o OUTPUTFILE, --outputfile OUTPUTFILE name of output file (required) -d DISCARDEDFILE, --discardedfile DISCARDEDFILE name of file to store discarded URLs (optional) -v, --verbose increase output verbosity -p PARALLEL, --parallel PARALLEL number of parallel processes (not used for sampling)

Filtering: Configure URL filters

--strict perform more restrictive tests -l LANGUAGE, --language LANGUAGE use language filter (ISO 639-1 code) -r, --redirects check redirects

Sampling: Use sampling by host, configure sample size

--sample SAMPLE size of sample per domain --exclude-max EXCLUDEMAX exclude domains with more than n URLs --exclude-min EXCLUDEMIN exclude domains with less than n URLs ```

License

coURLan is distributed under the Apache 2.0 license.

Versions prior to v1 were under GPLv3+ license.

Settings

courlan is optimized for English and German but its generic approach is also usable in other contexts.

Details of strict URL filtering can be reviewed and changed in the file settings.py. To override the default settings, clone the repository and re-install the package locally.

Author

Initially launched to create text databases for research purposes at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units), this package continues to be maintained but its future development depends on community support.

If you value this software or depend on it for your product, consider sponsoring it and contributing to its codebase. Your support on GitHub or ko-fi.com will help maintain and enhance this package. Visit the Contributing page for more information.

Reach out via the software repository or the contact page for inquiries, collaborations, or feedback.

For more on Courlan's' software ecosystem see this graphic.

Similar work

These Python libraries perform similar handling and normalization tasks but do not entail language or content filters. They also do not primarily focus on crawl optimization:

References

Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering. Computer networks and ISDN systems, 30(1-7), 161–172.
Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). "An adaptive model for optimizing performance of an incremental web crawler". In Proceedings of the 10th international conference on World Wide Web - WWW'01, pp. 106–113.

Owner

Name: Adrien Barbaresi
Login: adbar
Kind: user
Location: Berlin
Company: Berlin-Brg. Academy of Sciences (BBAW)

Website: adrien.barbaresi.eu
Twitter: adbarbaresi
Repositories: 37
Profile: https://github.com/adbar

Research scientist – natural language processing, web scraping and text analytics. Mostly with Python.

GitHub Events

Total

Create event: 7
Release event: 1
Issues event: 2
Watch event: 25
Delete event: 5
Issue comment event: 5
Push event: 8
Pull request event: 11

Last Year

Create event: 7
Release event: 1
Issues event: 2
Watch event: 25
Delete event: 5
Issue comment event: 5
Push event: 8
Pull request event: 11

Committers

Last synced: about 2 years ago

All Time

Total Commits: 281
Total Committers: 3
Avg Commits per committer: 93.667
Development Distribution Score (DDS): 0.011

Past Year

Commits: 68
Committers: 2
Avg Commits per committer: 34.0
Development Distribution Score (DDS): 0.015

Top Committers

Name	Email	Commits
Adrien Barbaresi	b**i@b**e	278
sourcery-ai[bot]	5****]	2
feltcat	5****t	1

Committer Domains (Top 20 + Academic)

bbaw.de: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 32
Total pull requests: 93
Average time to close issues: 25 days
Average time to close pull requests: 8 days
Total issue authors: 5
Total pull request authors: 5
Average comments per issue: 0.72
Average comments per pull request: 0.99
Merged pull requests: 79
Bot issues: 0
Bot pull requests: 14

Past Year

Issues: 6
Pull requests: 13
Average time to close issues: 7 days
Average time to close pull requests: about 6 hours
Issue authors: 2
Pull request authors: 2
Average comments per issue: 0.17
Average comments per pull request: 1.0
Merged pull requests: 12
Bot issues: 0
Bot pull requests: 1

View more stats

Top Authors

Issue Authors

adbar (27)
mikewolfd (1)
sbusso (1)
donbowman (1)
Ristellise (1)
drFerg (1)

Pull Request Authors

adbar (104)
sourcery-ai[bot] (12)
dependabot[bot] (5)
naz-theori (2)
feltcat (1)

Top Labels

Issue Labels

enhancement (15) maintenance (7) bug (4) documentation (2) question (1)

Pull Request Labels

dependencies (5) github_actions (4)

Packages

Total packages: 1
Total downloads:
- pypi 1,235,925 last-month
Total docker downloads: 548

Total dependent packages: 8
Total dependent repositories: 31
Total versions: 31
Total maintainers: 1

pypi.org: courlan

Clean, filter and sample URLs to optimize data collection – includes spam, content type and language filters.

Homepage: https://github.com/adbar/courlan
Documentation: https://courlan.readthedocs.io/
License: Apache 2.0
Latest release: 1.3.2
published over 1 year ago

Versions: 31
Dependent Packages: 8
Dependent Repositories: 31
Downloads: 1,235,925 Last month
Docker Downloads: 548

Rankings

Downloads: 0.7%

Dependent packages count: 2.4%

Docker downloads count: 2.4%

Dependent repos count: 2.6%

Average: 5.4%

Stargazers count: 10.2%

Forks count: 14.2%

Maintainers (1)

adbar

Last synced: 6 months ago

https://github.com/adbar/courlan

Science Score: 13.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

coURLan: Clean, filter, normalize, and sample URLs

Why coURLan?

Features

Installation

Python

check_url()

return url and domain name

filter out bogus domains

tracker removal

use strict for further trimming

check for redirects (HEAD request)

include navigation pages instead of discarding them

remove trailing slash

optional language argument

success: returns clean URL and domain name

failure: doesn't return anything

optional argument: strict

strict filtering: blocked as it is a major platform

Sampling by domain name

optional: excludemin=None, excludemax=None, strict=False, verbose=False

Web crawling and URL handling

other options: externalbool, nofilter, language, strict, redirects, ...

default

taking suffixes into account

Python helpers

passing URL strings directly also works

remove unnecessary components and re-order query elements

Troubleshooting

UrlStore class

Command-line

License

Settings

Author

Similar work

References

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: courlan

Rankings

Maintainers (1)

Dependencies