Recent Releases of https://github.com/adbar/courlan
https://github.com/adbar/courlan - courlan-1.2.0
- more compact UrlStore: use bytes instead of str for URL paths (#88)
- UrlStore maintenance: deprecate
timelimitargument (#101) - maintenance: simplify code (#103)
- support for Python 3.13
- Python
Published by adbar over 1 year ago
https://github.com/adbar/courlan - courlan-1.1.0
- replace
langcodesbybabeland use its information on locales (#89, #92) - simplified and faster code: domain extraction, cleaning, filters and UrlStore (#90, #93, #94, #95)
- UrlStore: better url batches, replace
timelimitparameter bytime_limit(#91) - maintenance: update readme and convert it to markdown (#97)
- Python
Published by adbar almost 2 years ago
https://github.com/adbar/courlan - courlan-1.0.0
- license change from GPLv3+ to Apache 2.0 (#81)
- UrlStore:
write()method andload_store()function added (#83) - add parameter
trailing_slashto keep of discard slashes at the end of URLs (#52) - maintenance: fix whitespace in
clean_url()(#77), simplify code (#79)
- Python
Published by adbar about 2 years ago
https://github.com/adbar/courlan - courlan-0.9.5
- IRI to URI normalization: encode path, query and fragments (#58, #60)
- normalization: strip common trackers (#65)
- new function
is_valid_url()(#63) - hardening of domain filter (#64)
Full Changelog: https://github.com/adbar/courlan/compare/v0.9.4...v0.9.5
- Python
Published by adbar about 2 years ago
https://github.com/adbar/courlan - courlan-0.9.4
- new UrlStore functions:
add_from_html()(#42),discard()(#44),get_unvisited_domains - CLI: removed
--samplesize, use--samplewith an integer instead (#54) - added plausibility filter for domains/hosts (#48)
- speedups and more efficient processing (#47, #49, #50)
- fixed handling of relative URLs with @feltcat in #46
- fixed bugs and ensured compatibility (#41, #43, #51, #56)
- official support for Python 3.12
Full Changelog: https://github.com/adbar/courlan/compare/v0.9.3...v0.9.4
- Python
Published by adbar over 2 years ago
https://github.com/adbar/courlan - courlan-0.9.3
- more efficient URL parsing (#33)
- refined link extraction and link filters (#30, #36)
- more efficient normalization (#32)
- more efficient sampling strategy (#31, #35)
- added meta function to clear LRU caches (#34)
- added parallel option in command-line interface (#37, #39)
- added
get_unvisited_domains()method toUrlStore(#40)
Full Changelog: https://github.com/adbar/courlan/compare/v0.9.2...v0.9.3
- Python
Published by adbar over 2 years ago
https://github.com/adbar/courlan - courlan-0.9.2
- add blogspot archives to type filter
- maintenance: upgrade urllib3 and review code
- Python
Published by adbar almost 3 years ago
https://github.com/adbar/courlan - courlan-0.9.1
- network tests: larger throughput
- UrlStore: optional compression of rules (#21), added
reset()(#22) andget_all_counts()methods - UrlStore fixes:
signalin #18,total_url_number - updated Readme
Full Changelog: https://github.com/adbar/courlan/compare/v0.9.0...v0.9.1
- Python
Published by adbar almost 3 years ago
https://github.com/adbar/courlan - courlan-0.9.0
- hardening of filters and URL parses (#14)
- normalize punicode to unicode
- methods added to
UrlStore:get_crawl_delay(),print_unvisited_urls() UrlStorenow triggers exit code 1 when interrupted- argument added to
extract_links():no_filter - code refactoring: simplifications
Full Changelog: https://github.com/adbar/courlan/compare/v0.8.3...v0.9.0
- Python
Published by adbar almost 3 years ago
https://github.com/adbar/courlan - courlan-0.8.3
- fixed bug in domain name extraction
- uniform logging parameters
Full Changelog: https://github.com/adbar/courlan/compare/v0.8.2...v0.8.3
- Python
Published by adbar over 3 years ago
https://github.com/adbar/courlan - courlan-0.8.2
- full type hinting
- maintenance: code linted
Full Changelog: https://github.com/adbar/courlan/compare/v0.8.1...v0.8.2
- Python
Published by adbar over 3 years ago
https://github.com/adbar/courlan - courlan-0.8.1
- add type annotations and check with
mypy url_filter()function moved from Trafilatura- code style: use
black
- Python
Published by adbar over 3 years ago
https://github.com/adbar/courlan - courlan-0.8.0
- performance optimizations
- fast track for domain extraction (
extract_domain(url, fast=True)), now taking subdomains into account
Full Changelog: https://github.com/adbar/courlan/compare/v0.7.2...v0.8.0
- Python
Published by adbar over 3 years ago
https://github.com/adbar/courlan - courlan-0.7.2
UrlStore: threading lock and convenience functions added
- Python
Published by adbar almost 4 years ago
https://github.com/adbar/courlan - courlan-0.7.1
- bug in sampling fixed
UrlStore: validation by default
Full Changelog: https://github.com/adbar/courlan/compare/v0.7.0...v0.7.1
- Python
Published by adbar almost 4 years ago
https://github.com/adbar/courlan - courlan-0.7.0
UrlStoreclass added: data store containing URLs with relevant information- code cleaning and maintenance (bugs, simplification)
Full Changelog: https://github.com/adbar/courlan/compare/v0.6.0...v0.7.0
- Python
Published by adbar almost 4 years ago
https://github.com/adbar/courlan - courlan-0.6.0
- reviewed code base: simplicity and execution speed
- dropped support for Python 3.5
- Python
Published by adbar over 4 years ago
https://github.com/adbar/courlan - courlan-0.5.0
- more complex language heuristics, use langcodes
- extended blacklists and whitelists
- more precise filters and more efficient code
- support for Python 3.10
- Python
Published by adbar over 4 years ago
https://github.com/adbar/courlan - courlan-0.4.2
- enhanced cleaning
- fixed language filter
- Python
Published by adbar over 4 years ago
https://github.com/adbar/courlan - courlan-0.4.1
- keep trailing slashes to avoid redirection
- fixes: normalization and crawlable URLs
- Python
Published by adbar over 4 years ago
https://github.com/adbar/courlan - courlan-0.4.0
- URL manipulation tools added: extract parts, fix relative URLs
- filters added: language, navigation and crawls
- more robust link handling and extraction
- removed support for Python 3.4
- Python
Published by adbar over 4 years ago
https://github.com/adbar/courlan - courlan-0.3.1
- improve filter precision
- Python
Published by adbar about 5 years ago
https://github.com/adbar/courlan - courlan-0.3.0
- reduced dependencies: replace
requestswith bareurllib3, andtldextractwithtldfor Python 3.6 upwards - better path and fragment normalization
- Python
Published by adbar about 5 years ago
https://github.com/adbar/courlan - courlan-0.2.3
- Python 3.9 compatibility
- Simplified imports
- Bug fixes
- Python
Published by adbar over 5 years ago
https://github.com/adbar/courlan - courlan-0.2.2
- English and German language filters
- Function to detect external links
- Support for domain blacklisting
- Python
Published by adbar over 5 years ago
https://github.com/adbar/courlan - courlan-0.2.1
- Less aggressive strict filters
- CLI bug fixed
- Python
Published by adbar over 5 years ago
https://github.com/adbar/courlan - courlan-0.2.0
- Cleaner and more efficient filtering
- Helper functions to scrub, clean and normalize
- Removed two dependencies with more extensive usage of
urllib.parse
- Python
Published by adbar over 5 years ago
https://github.com/adbar/courlan - courlan-0.1.0
- Cleaning and filtering targeting non-spam HTML pages with primarily text
- URL validation
- Sampling by domain name
- Command-line interface (CLI) and Python tool
- Python
Published by adbar over 5 years ago