Wednesday, July 21, 2021

Hackem Up! Disassembling Files into Chunks and Recombining Files from Chunks

So... I know I don't add much to my blog and for that, I apologize.   However, I had to do something recently that I thought my two subscribers might like.  ;)

I came across a situation where I needed to move many large files between disconnected environments and despite it being 2021, data transfer speeds can still be atrocious.

I was trying to move multiple 5GB-100GB disk backups from one state to another.  Aside from each copy being horrendously slow, when you experience a problem with an upload (e.g. timeout) at 45/100GB of a file and you lose that 8 hours, it can be rather frustrating.

However, if I chop the large files up into smaller fragments, and move (or sync) those fragments, I reduced the likelihood of running into a terminating condition or even, if I did, since each fragment is a small part of the whole, it wouldn't take as long to recover.

I tried to find a free tool or something that already existed but all of the tools I could locate either cost money or didn't do what I wanted to do, so I decided to write some code that would chop up the files myself.

And correspondingly, I needed to be able to recombine the fragments on the other side to have an identical file as was intended in the original source.  So without further adieu, here is Chunk-File and Recombine-File:

Examples:

## Split the file into fragments
Chunk-File -FileName somefile.ext -ChunkSize 1GB
Chunk-File -FileName somefile.ext
## Recombine the file; the recombined file will have a '_new' name on it
Recombine-File -PathToChunks 'some-directory-path'
Recombine-File

## Verify the bytes were written back in the correct order
Get-FileHash -Algorithm MD5 sourcefile, sourcefile_new

NOTE:

Recombine-File does *not* delete the chunks.  This was intentional.  If any exception gets thrown during the recombine effort, I wanted to provide a non-destructive means of being able to try again (without having to re-copy the fragments).

Chunk-File:

function Chunk-File {
param (
    [Parameter(Mandatory=$true)][System.String]$FileName,
    [Parameter(Mandatory=$false)][uint64]$ChunkSize
)

    try {
        ## Get a file object reference to the passed in filename.
        $File = Get-Item $FileName

        ## Open a filestream handle to the file
        $fs = New-Object System.IO.FileStream($File.FullName, [System.IO.FileMode]::Open)

        ## If a desired-size is not specified, automatically determine an appropriate chunk size
        if (-not($ChunkSize)) {
            if ($fs.length -gt 10GB) {    
                $ChunkSize = 10GB
            } elseif ($fs.length -gt 1GB) { 
                $ChunkSize = 1GB
            } elseif ($fs.length -gt 100MB) {                                
                $ChunkSize = 100MB 
            } elseif ($fs.length -gt 10MB) { 
                $ChunkSize = 10MB
            } elseif ($fs.length -gt 1MB) {                                
                $ChunkSize = 1MB 
            } elseif ($fs.length -gt 100KB) { 
                $ChunkSize = 100KB
            } elseif ($fs.length -gt 10KB) {                                
                $ChunkSize = 10KB 
            } elseif ($fs.length -gt 1KB) { 
                $ChunkSize = 1KB
            } else {
                $ChunkSize = 1
            }
        }
        
        ## Ensure the chunk size isn't larger than the filesize
        if ($ChunkSize -gt $fs.Length) {
            Write-Error "Chunk size should not be larger than the file size."
            break
        }

        ## Determine acceptable buffer size for speed/efficiency
        if ($fs.length -gt 1GB) {    
            $BufferSize = 1MB
        } elseif ($fs.length -gt 1MB) { 
            $BufferSize = 1KB
        } else {                                
            $BufferSize = 1 ## 1B buffer
        }

        #Write-Host "ChunkSize:  $ChunkSize"
        #Write-Host "BufferSize: $BufferSize"

        ## Set the first buffer size
        $buffer = New-Object byte[] ($BufferSize)
        
        ## Set some predefined parameters for use with the chunking
        $FileIncrement = 1
        $ZeroPadSize = ([int]($fs.Length / $ChunkSize)).ToString().Length + 1

        ## Set the auto-increment and auto-decrement values
        $BytesToRead = $fs.Length
        $BytesRead = 0

        ## Open a filestream handle to the first output fragment
        $cfs = New-Object System.IO.FileStream(("$($File.Directory)\$($File.BaseName)_$("$FileIncrement".PadLeft($ZeroPadSize, '0'))$($File.Extension)"), [System.IO.FileMode]::OpenOrCreate)

        ## Iterate through the source file to completion
        while ($BytesToRead -gt 0) {
            
            ## If the chunk file has reached the desired chunk size, close it and open the next chunk
            if ($BytesRead -gt 0 -and $BytesRead % $ChunkSize -eq 0) {
                $cfs.Dispose()
                $FileIncrement++
                $cfs = New-Object System.IO.FileStream(("$($File.Directory)\$($File.BaseName)_$("$FileIncrement".PadLeft($ZeroPadSize, '0'))$($File.Extension)"), [System.IO.FileMode]::OpenOrCreate)
            }
        
            ## Handle the case where the buffer is not a multiple of the file size.
            ## Without limiting the size of the final buffer, the last 'chunk' would be larger than it's supposed to be.
            ## The file would still be in tact and functional, but would contain a padding of zeroes at the end that would change a hash verification on the output file once it's recombined.
            if ($BytesToRead -lt $BufferSize) {
                $buffer = New-Object byte[] ($BytesToRead)
            } else {
                $buffer = New-Object byte[] ($BufferSize)
            }

            ## Read from the source
            [void]$fs.Read($buffer, 0, $buffer.Length)

            ## Write to the fragment
            $cfs.Write($buffer, 0, $buffer.Length)

            ## Increment/Decrement
            $BytesRead += $buffer.Length
            $BytesToRead -= $buffer.Length
        }
    } catch {
        $_
    } finally {
        $fs.Dispose()
        $cfs.Dispose()
    }
}

Recombine-File:


function Recombine-File {
param (
    [Parameter(Mandatory=$false)][System.String]$PathToChunks
)
    try {

        ## Get a collection of file fragments that match the naming convention from 'Chunk-File'
        ## If a path is not provided, the current directory is used
        if (-not($PathToChunks)) {
            $frags = Get-ChildItem | Where-Object { $_.BaseName -match '.*_[0-9]+$' }
        } else {
            $frags = Get-ChildItem $PathToChunks | Where-Object { $_.BaseName -match '.*_[0-9]+$' }
        }

        ## Ensure there are two-or-more fragments to recombine.
        if ($frags.Count -lt 2) {
            Write-Error "No chunks were found to recombine."
            break
        }
        
        ## Create a new file to write all of the fragmented data to
        $tfs = New-Object System.IO.FileStream(("$($frags[0].Directory)\$($frags[0].BaseName.Split('_')[0])_new$($frags[0].Extension)"), [System.IO.FileMode]::OpenOrCreate)

        ## Set an initial buffer
        $BufferSize = 1MB

        ## Iterate through each fragment to write to the new consolidated file
        $frags | ForEach-Object {
        
            ## Open a handle to the fragment
            $frag = New-Object System.IO.FileStream(("$($_.FullName)"), [System.IO.FileMode]::Open)

            ## Set the increment/decrement values for each fragment
            $BytesToRead = $frag.Length
            $BytesRead = 0

            ## Iterate over this fragment
            while ($BytesToRead -gt 0) {

                ## To ensure there's no extra data written to the consolidated file, adjust the buffer size for the final read
                if ($BytesToRead -lt $BufferSize) {
                    $buffer = New-Object byte[] ($BytesToRead)
                } else {
                    $buffer = New-Object byte[] ($BufferSize)
                }
                
                ## Read from the fragment
                [void]$frag.Read($buffer, 0, $buffer.Length)

                ## Write to the consolidated file
                $tfs.Write($buffer, 0, $buffer.Length)

                ## Increment/Decrement
                $BytesRead += $buffer.Length
                $BytesToRead -= $buffer.Length
            }

            $frag.Dispose()
        }

        Write-Output "Recombine successful:  $($tfs.Name)"
    } catch {
        $_
    } finally {
        $frag.Dispose()
        $tfs.Dispose()
    }
}

No comments:

Post a Comment