Created
August 25, 2021 05:39
-
-
Save rvagg/1b34ca32e572896ad0e56707c9cfe289 to your computer and use it in GitHub Desktop.
Revisions
-
rvagg created this gist
Aug 25, 2021 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,44 @@ This post builds on the discussion https://github.com/multiformats/multicodec/pull/203 and is posted here because it might be a bit too long and in-the-weeds and I doubt many will actually read this all anyway! But I'd like this as a record of an ongiong broader disucssion about these topics. --- Working through the process of mapping blockchain block formats to IPLD has made me see this question slightly differently. First it was getting the full bitcoin chain format working, including the awkward segwit hacks, and now it's working with @i-norden through getting the full ethereum format mapped to IPLD ([e.g.](https://github.com/multiformats/multicodec/pull/223)). The primary goal we're trying to achieve with IPLD codec codes here (via CIDs usually) is describing what glasses we put on to _see_ the data. We want to get something into the data model in memory from the raw binary we've been handed. An obvious example of having different glasses would be these CIDs: * `bafyreiblwimnjbqcdoeafiobk6q27jcw64ew7n2fmmhdpldd63edmjecde` * `bafkreiblwimnjbqcdoeafiobk6q27jcw64ew7n2fmmhdpldd63edmjecde` They're both valid but one says it's a `raw` and one says it's a `dag-cbor`. IPLD just switches out its glasses when looking at the bytes. There's even some hacks going on in Filecoin-world to (ab)use `raw` in this way to get around some DAG-completeness problems. Another example might be the codecs `bitcoin-tx` and `bitcoin-witness-commitment`. They both deal with 64-byte blocks and decode them as tuples of 32-bytes. However, when we put on the `bitcoin-tx` glasses for a 64-byte block we see `[CID, CID]`, but when we put on the `bitcoin-witness-commitment` glasses we see `[CID, Bytes]`. The 32-bytes being used to _see_ the `CID` isn't even a proper CID, we make a CID emerge from those bytes by bringing knowledge of what the codec should be and what the hash function was. To further push this "glasses" analogy: in the latest version of the Bitcoin codec I wrote, when putting on the `bitcoin-block` to look at the header bytes, I made it see two CIDs that certainly aren't in the raw bytes in the way that you're decode them outside of IPLD. The schema for the header has this: ```ipldsch type BitcoinHeader struct { previousblockhash optional Bytes merkleroot Bytes parent optional &BitcoinHeader tx &BitcoinTransaction # ... other stuff here } ``` Those two links emerge out of the two previous `Bytes` fields, which are left intact (mainly because they're useful as bytes for various reasons, and because they're byte-reversed from what we normally need from a hash digest!). The glasses I made just happen to _see_ those things in the raw bytes even though a bitcoin purist would argue that they're not there. To take another angle, back to our `dag-cbor` and `raw` CIDs above, there's another form that's just as valid: * `bafireiblwimnjbqcdoeafiobk6q27jcw64ew7n2fmmhdpldd63edmjecde` This time plain `cbor`. It's not properly defined what a decoder should to do get this into the data model in some of the edges, such as what to do with tags. ipld-prime will bork at them, but another decoder could just skip over tags entirely or come up with another novel way of presenting them in the data model (an older JS one would instantiate its own custom objects in this case!). So what we're doing in the case of `dag-cbor` is insisting that you see the bytes through those glasses and make the CIDs emerge out of the section of bytes that are preceded by the appropriate tag. Back to the SoftWare Heritage identifiers question - could it be the case that our glasses when viewing this data insist on seeing things in the data that the plain `git*` codec(s) won't? Maybe it's as simple as seeing `string` field with the SWH URI for the resource or a CID that points to a unique SWH object that you wouldn't _see_ if you thought it was just plain `git*`. I generally still think we should try and avoid using CIDs to signal _where_ to get data from, that's not part of the CID spec in its current form. But I think that we have seem a class of problems around CIDs that suggest that they're not able to do all that people need them to do and without additional extensions to the spec we either have to reject certain use-cases entirely (which really isn't great for IPLD) or overload other pieces of functionality to make it work. But in this case, perhaps it's as simple as SWH data necessarily _looking_ different to Git data even though the byte format may be the same?