Cross-file micro-chunk exchange - Feature PROPOSAL

desodorante
Site User
Posts: 2
Joined: Sat Nov 22, 2014 12:52 pm

Cross-file micro-chunk exchange - Feature PROPOSAL

Postby desodorante » Sat Nov 22, 2014 1:27 pm

David, Ekliptor,

This is an idea I have been thinking for a while but I am too bad at coding to explore, let alone implement.
First of all, I've been a follower of your work since Neomule (not sure if Ekliptor participated on that too) and so I am very excited for this project.

The problem I am looking to solve:
Provide new sources for dead files both in emule and BT (those that have no sources / seeds / peers)

The core idea is as follows:
Find a minimum, likely variable, micro-chunk size (likely byte sized) that can be exhanged between files of differesent sizes, hash and type.

The solution I am not able to explore:
Taking advantage of the P2P network interoperability of Neoloader, able to translate emule parts to BT pieces, and using the emule chunk hash system (9500 kilobyte chunks using the MD4 algorithm), identify smaller byte streams within the chunk that match other files regardless of final hash, size, or type.

Two approaches:
-While file is "healthy" (above 100% available), perform analysis of random parts against other file's parts to identify matching byte streams. Save this as a "map" for when the file is below 100%
-When file is no longer healthy (below 100%), complete missing parts with other random file's parts and hash until file checks out. Also save "map" once complete.

For instance, say FileA.avi is incomplete and had no sources for X time at the client side. Let Neoloader request a low queued file, download a chunk, and perform analysis in the client side. It could take the missing X bytes from the downloaded file for the incomplete chunk of FileA.exe and try to hash. If it fails, it could divide the size by 2, and take 2 streams, and repeat until a minimum size is reached.

The key to this feature is to find a reliable size to perform a search upon, and an algorithm to identify the similarity of 2 files in the network. I am confident there has to be a byte-sized stream to inspect that has a probability above chance to match other files, given the enormous ammount of information we have.
Of course this is likely going to be <64 bytes and meant for files that are nearing completion, I am thinking <1024kb left, not to swarm the networks with redundant requests.
If this works, Neoloader could have an ultimate killer feature: to revive dead files in P2P networks.
Also, this would mean you did not download a file, but a mesh of several, which has other implications.

Thank you and I hope you don't think I am crazy :D
Keep up the great work!

User avatar
DavidXanatos
Site Admin
Posts: 769
Joined: Wed Jun 30, 2010 7:54 pm

Postby DavidXanatos » Sat Nov 22, 2014 1:53 pm

What you describe will Generally not work for any low entropy file like any video or audio file.

This file formats are compressed, that means in first approximation you can say about any chunk picked randomly from inside of the file that it contains 100% random data.

Of cause the data are not really random but their meaning becomes only apparent in the context of a few hundred kb or mb of other data from the file.

Looking on sufficiently small blocks of such a file gives you meaningless bytes with a high entropy.

The chance for a 4 byte long block of one particular file to be contained inside an other random file of the size of lets say 128Mb is below 1:4 000 000 000 (yes below 1 in 4 billion)

To be more exact to find a matching 4 byte long block you would need to download on average 2^32 random blocks, that is 16GB of data.
For a 32 byte long block the chance of finding a match withing 2 files created during the entire existence of the universe is for all intents and porpoises ~0.000 0.....


The only type of files where this could work would be files with a low entropy or files where it can be assumed 3rd party sections have been build in, like for example an ISO file of some software installation.
Here you can expect the ISO to contain files that are identical to files contained by other ISO's of an older version of the same software or some generic library's used by a lot of software.

Now for a ZIPed or RARed ISO the same applies as for Audio or video files, high entropy no matches.


So such a feature if present would only benefit a small subset of all files shared on P2P.


Also the apparent problem if such a feature finding matches of files without any in any way correlated starting offset is nearly impossible, there are some signature tricks one could explore but that is not realistic to work with any acceptable amount of overhead.



Neo's method of identifying ED2K files in BT or vice versa relays on neo clients first downloading a file form ed2k or BT only calculating all hashes and publishing a cheat sheet into Neo kad. Any Neo client that later on gets a ed2k hash or a torrent info hash can request the cheat sheet from NeoKad and bam he knows what hashes belong together.

There is no way to mathematically calculate a ed2k hash from a torrent info hash.
Live free or die trying!

desodorante
Site User
Posts: 2
Joined: Sat Nov 22, 2014 12:52 pm

Postby desodorante » Sat Nov 22, 2014 11:19 pm

Well it seems you got it covered already :)
Yet, you said you can find 4 bytes per 16gb of data on average, which means a "mapping" approach will not be so far fetched, specially if the client uses its own files instead of downloading, in the rare scenario where you are sitting at 99% complete and there are no sources.
Anyway, thanks for the lengthy technical explanation.


Return to “Feature Requests”

Who is online

Users browsing this forum: No registered users and 3 guests

cron
Fatal: Not able to open ./cache/data_global.php