Recent Releases of wacksy

wacksy - v0.0.2

This release involves some refactoring, different parts of the indexer are now in their own modules. As a result of this, it was easier to write unit tests for each resource, so I've now done that, along with two integration tests. The tests just cover the basics, I expect to expand these in future to check errors and other things.

The page record indexer now only indexes records according to a set of conditions which guarantee the record is a web document. Unfortunately the WACZ spec does not define what a page is in terms we can use here, so I have come up with the following conditions:

  • The WARC record type is either Response, Revisit, or Resource
  • The HTTP content-type is either text/html, application/xhtml+xml, or text/plain.
  • The HTTP status code is 200 OK.

This is an imperfect best-guess attempt to pick out things which might be pages from a WARC file. The reason I filter for successful status codes is I realised that some failed requests return HTML pages in the response along with a 404 error. Those are definitely pages, but I guess they're not what people want out of the pages.jsonl index.

I made a brief attempt to replace sha256 with the faster blake3 hashing algorithm, but this breaks compatibility with py-wacz. I think this is something which will have to wait until blake3 can be integrated into the python standard library as part of hashlib.

Dependencies

  • This library now depends on surt-rs to create searchable url strings. It's a fairly minimal library and is more comprehensive than my own attempt to write a surt-ing function.
  • Bump rawzip to 0.3 (#41), thanks @nickbabcock!

- Rust
Published by github-actions[bot] 11 months ago

wacksy - v0.0.1

As of this point, the WACZ and indexer can output (almost) everything needed from a WARC file to a fully spec-compliant WACZ file. The last thing missing was the pages.jsonl file, which is now produced when reading through the WARC file as part of the indexer. I want to avoid reading through the WARC twice to produce two files, so have wrapped everything into one indexer, again there's probably a better way of doing this.

The other happy change in this release is removing code duplication from the WARC reader in case of gzipped and non-gzipped files. First time I've tried using type generics in Rust, the code is messy, but it works.

Added

  • (indexer) Use type generics to eliminate code duplication when iterating through records, this finally gets rid of an awkward situation where I was having to maintain two separate iterators .
  • add pages indexer to wacz writer, with a struct for page records, this is the main thing in this release.

Fixed

  • add newline to page records, needed for pages.jsonl format, closes #37, nice and easy change
  • (indexer) skip serialising null fields in page record
  • (datapackage) pass cdxjindexbytes through to the datapackage

Other

Lots more little documentation/readme changes and additions. Code refactoring, etc.

  • (indexer) use core instead of standard libraries for error formatting
  • add serde features to dependencies, update cargofile
  • (datapackage) move compose_datapackage into datapackage implementation
  • (datapackage) DataPackageResource::new now returns a result/error rather than panicking
  • (indexer) use httparse to parse http status code from response and remove the happily redundant cuthttpheadersfromrecord function

- Rust
Published by github-actions[bot] 12 months ago

wacksy - v0.0.1-beta

Work on this version was mostly refactoring, adding structured types and error handling, and some documentation (only just started).

Still on my todo list is to use the indexer to also create pages.jsonl files.

Fixed

  • replace wrapping_add in loop counter with enumerate, closes #29
  • (indexer) return the same error message for gzipped and non-gzipped files. I have tried to simplify the code for processing both gzipped and non-gzipped files. There's still unnecessary duplication but it's the best I can do for the moment.

Other

  • document some DataPackage structs, better documentation coming once this is properly finished!
  • as a style change, this now uses explicit returns everywhere, and I have set lints in cargo.toml to enforce this
  • (indexer) many of the index functons are now implemented on types. The completed index is returned as a struct, which has a display implementation to write it out to json(l).
  • (datapackage) propogate errors upwards, there are still some panics, but structured error handling is a lot more comprehensive now. Happy and unhappy paths are a little clearer to identify.
  • update README with link to a funny meme :)

- Rust
Published by github-actions[bot] about 1 year ago

wacksy - v0.0.1-alpha

At this stage the library can read a WARC file to produce a CDXJ index, and a datapackage.

Added

  • (indexer) types for DataPackage and DataPackageResource
  • (indexer) various types for CXDJIndexRecord

- Rust
Published by github-actions[bot] about 1 year ago