Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
how to split a file in not equally large parts?
View unanswered posts
View posts from last 24 hours

Goto page 1, 2  Next  
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
SarahS93
l33t
l33t


Joined: 21 Nov 2013
Posts: 725

PostPosted: Sat Nov 04, 2023 12:02 am    Post subject: how to split a file in not equally large parts? Reply with quote

how to split a 75GB file in to 7 not equally parts?
....:
75gb_file.1 = 11GB
75gb_file.2 = 3GB
75gb_file.3 = 26GB
75gb_file.4 = 17GB
75gb_file.5 = 6GB
75gb_file.6 = 8GB
75gb_file.7 = 4GB

with "split" all files have the same size, that does not work.


Last edited by SarahS93 on Wed Nov 08, 2023 10:05 pm; edited 4 times in total
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 22450

PostPosted: Sat Nov 04, 2023 1:01 am    Post subject: Reply with quote

Would this work?
Code:
declare -i counter
counter=1
for i in 11 3 26 17 6 8 4; do head -c${i}G > "part.$counter"; (( ++ counter )); done < 75G
Back to top
View user's profile Send private message
SarahS93
l33t
l33t


Joined: 21 Nov 2013
Posts: 725

PostPosted: Sat Nov 04, 2023 1:31 am    Post subject: Reply with quote

it works, thats great, thanks!
Back to top
View user's profile Send private message
toralf
Developer
Developer


Joined: 01 Feb 2004
Posts: 3934
Location: Hamburg

PostPosted: Sat Nov 04, 2023 9:09 am    Post subject: Reply with quote

Hu wrote:
Would this work?
Code:
declare -i counter
counter=1
for i in 11 3 26 17 6 8 4; do head -c${i}G > "part.$counter"; (( ++ counter )); done < 75G
wow - til this.
Back to top
View user's profile Send private message
SarahS93
l33t
l33t


Joined: 21 Nov 2013
Posts: 725

PostPosted: Wed Nov 08, 2023 10:26 pm    Post subject: Reply with quote

with
shuf -i 1-100 -n 4
i become 4 random numbers

but whats the way to split a number randomly into multiple numbers give the number and n groups?
for instance, if the number is 100 and the number of groups is 4 it should give any random list of 4 numbers that add upto 100:4
input number = 100
number of groups = 4
...
25 33 12 30
11 19 47 23
8 17 40 35

(i am looking for an other way for the "11 3 26 17 6 8 4" where i must not calculate myself)

i found
awk 'BEGIN{srand()} {s=int(rand()*7); print > (FILENAME"."s)}' file75G
but the spitet files togehter are 1 byte bigger the the not splittet file, dont know why
but the splittet files are all round about 10,7G, thats not what i am looking for.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9781
Location: almost Mile High in the USA

PostPosted: Wed Nov 08, 2023 10:52 pm    Post subject: Reply with quote

That's a NP problem... Hard to create a solution (though there are a lot), though this is checkable (almost like the knapsack problem).

Lots of heuristics however... easiest one is to take your max, pick a random number from 1..max, subtract it from the original, then rinse and repeat on the remaining. The latter pieces will tend to be unfairly small however -- hence this is a heuristic.

Another heuristic is to take max/n pieces at first and then randomly shuffle a bit to each bin until one is happy with the results... ugly and could take a lot of iterations.

I hope this is just a theoretical problem, else could you disclose why one would need this??!
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3350

PostPosted: Wed Nov 08, 2023 11:03 pm    Post subject: Reply with quote

Sure, let's play

Code:
i=4; j=100; while [[ $i -gt 1 ]]; do k=$((RANDOM%${j})); echo $k ; j=$((j-${k})); i=$((i-1)); done; echo $j
this will tend to produce bigger blocks at the beginning, it might be a good idea to modify $k before it's used to ensure it's within reasonable range

Code:
i=4; j=100; ( while [[ $i -gt 1 ]]; do echo $((RANDOM%${j})); i=$((i-1)); done ; ) | ( sort -hr; echo 0 ; ) | ( k=$j; while read line; do echo $(( $k - line )); k=$line; done )

pick random points, order them, and calculate distance between them

Both work in bash

eccerr0r: probably this answers your "why" question. Just a gut feeling though.
https://forums.gentoo.org/viewtopic-p-8806711.html#8806711
Back to top
View user's profile Send private message
SarahS93
l33t
l33t


Joined: 21 Nov 2013
Posts: 725

PostPosted: Thu Nov 09, 2023 2:02 pm    Post subject: Reply with quote

eccerr0r
i am looking for a way to split a file into equally parts.
The way from Hu in Post #2 is great, but i must calculate the partsize myself.

szatox
thanks for your lines, it works!
sometimes i become a number 0
is there a way to exclude the number 0 ?

bigger blocks is not what i am looking for.
how do i modify $k ?
can you give me please an example

i am looking for a way that 100:4 =
27 25 33 15 is good
25 23 35 17 is good too
5 7 6 82 is not good
10 10 10 60 is not good
...
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3350

PostPosted: Thu Nov 09, 2023 2:40 pm    Post subject: Reply with quote

Quote:
sometimes i become a number 0
is there a way to exclude the number 0 ?

bigger blocks is not what i am looking for.
how do i modify $k ?
I'd use multiplication, division and addition (in this particular order), a constant and the value of i. Current loop only holds 3 instructions, it's not difficult to analyze, so try doing it yourself.
Whatever you come up with, it goes right in front of "echo $k" though.

Code:
i=4; j=100; a=( ); while [[ $j -gt 0 ]]; do k=$((RANDOM%${i})); a[$k]=$((a[$k]+1)); j=$((j-1)); done; echo ${a[@]}

Here, chunks of normally distributed sizes, in random order.

I think the first version should result in Pareto distribution, the second is uniform, and the last one is normal.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9781
Location: almost Mile High in the USA

PostPosted: Thu Nov 09, 2023 11:48 pm    Post subject: Reply with quote

What I still don't get is why is

25 25 25 25

not the best solution if you want "equally" parts, this is fast, simple, and deterministic?
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 20424

PostPosted: Fri Nov 10, 2023 2:46 am    Post subject: Reply with quote

The title says "not equally large parts".
_________________
Quis separabit? Quo animo?
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 22450

PostPosted: Fri Nov 10, 2023 3:34 am    Post subject: Reply with quote

Yes, but a later post in the thread specifies equally large parts. If we want unequally large parts, my preference would be to divide the file into chunks of sizes 1, 1, 1, ... and N-(num chunks). Using one byte for most chunks makes the algorithm easy, and guarantees the parts are not all of equal size. If instead we need to have every chunk use a unique size, then make the first chunk 1 byte, the second 2 bytes, and so on, until the last chunk is the remainder of the file. Since there is no stated purpose for this exercise, it is difficult to determine whether this solution is acceptable.

OP: what are you trying to achieve with this split?
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9781
Location: almost Mile High in the USA

PostPosted: Fri Nov 10, 2023 9:01 am    Post subject: Reply with quote

Yes, I was wondering because this seems to have no practical purpose other than as a thought experiment... if there is one I'd like to know.
Does it even have to be random... 24 26 23 27 works (and this is a precooked heuristic - start with equal numbers and every other chunk you add one subtract one, add two, subtract two...keep going until you run out, should work fine n chunks less than total divided by n... no randomness at all and every file is different size and all are about the size of the average... up until there are a huge number of chunks for a small file....

... why....
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 20424

PostPosted: Sat Nov 11, 2023 4:31 pm    Post subject: Reply with quote

Hu wrote:
Yes, but a later post in the thread specifies equally large parts. If we want unequally large parts, my preference would be to divide the file into chunks of sizes 1, 1, 1, ... and N-(num chunks). Using one byte for most chunks makes the algorithm easy, and guarantees the parts are not all of equal size. If instead we need to have every chunk use a unique size, then make the first chunk 1 byte, the second 2 bytes, and so on, until the last chunk is the remainder of the file. Since there is no stated purpose for this exercise, it is difficult to determine whether this solution is acceptable.

OP: what are you trying to achieve with this split?
I think that use of "equally" was a typo / language issue. The rest of the post clarifies with examples what they are looking for. I interpret "100:4" to be 100% into 4 parts. The examples of "good" and "not good" suggest (my words) a "maximum size difference" between each of the 4 chunks. From the examples, 33:15 and 35:17 only differ by 18. Those are "good," whereas 82:7 and 60:10 are "not good." So a difference of something less than 50 with 18ish is acceptable. Obviously SarahS93 should be more specific, otherwise the correct solution may be impossible to achieve.

As for the big picture, and based partly on unrelated posts, I'm guessing the goal is to scramble a file as a means of "security" to then allow a recipient of the scrambled file to reassemble it when given the reconstruction information, presumably separate from the file.
_________________
Quis separabit? Quo animo?
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9781
Location: almost Mile High in the USA

PostPosted: Sat Nov 11, 2023 5:19 pm    Post subject: Reply with quote

random small pieces are just as difficult/annoying to properly reassemble as larger similar sized pieces, so hence a good reason for these pieces still remains...

The file is still plaintext anyway so important data can still be gleaned from it. If you encrypt it, well, the encryption adds entropy anyway and the piece scrambling is just another step to reconstruct where a larger key size would have been just as good with fewer steps to reconstruct.

Hence this still just looks like an academic exercise.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 20424

PostPosted: Sat Nov 11, 2023 9:07 pm    Post subject: Reply with quote

Well, if you somehow guess the first separation is at 10GB and are able to rejoin each with the same offset, that's easier than figuring different offsets at each step. I'm not saying it's a good idea or at all secure, but past posts lead me in that direction. Maybe it's academic, maybe it isn't.
_________________
Quis separabit? Quo animo?
Back to top
View user's profile Send private message
SarahS93
l33t
l33t


Joined: 21 Nov 2013
Posts: 725

PostPosted: Sat Nov 11, 2023 11:35 pm    Post subject: Reply with quote

the soloutions here are very acceptable, thanks!

it must not be random, 24, 26, 23 and 27 is good too.
100:4 is only an example

there is a practical purpose for me!

i have encrypted container files from 22gb upto 77gb
they i will split in 5 or upto 9 parts with unequel size.
after that, i do split each of the parts again into equel size parts and put them into a wrong sequence into a new file.


pjp wrote:

As for the big picture, and based partly on unrelated posts, I'm guessing the goal is to scramble a file as a means of "security" to then allow a recipient of the scrambled file to reassemble it when given the reconstruction information, presumably separate from the file.

yes, so it is
but for any file i do use other options
one file is in 5 parts, an other is in 9 parts and an othere one is in 7 parts and so on....
i will not have any pattern

eccerr0r
no it is not an academic exercise.

szatox
i=7; j=75; a=( ); while [[ $j -gt 0 ]]; do k=$((RANDOM%${i})); a[$k]=$((a[$k]+1)); j=$((j-1)); done; echo ${a[@]}
works very great for me, thanks!
any idea how to add a filter for lines like
10 10 5 11 5 19 15
there are two times "10"
i think about if there any number two or more times, than run the command again?


Code:
i=7; j=77; a=( ); while [[ $j -gt 0 ]]; do k=$((RANDOM%${i})); a[$k]=$((a[$k]+1)); j=$((j-1)); done; echo ${a[@]}
11 17 15 7 9 4 14

works! i can use it for gernate random "gigabyte" numbers

but if i run it with megabyte number like 77777 (77,~ gb)
Code:
i=7; j=77777; a=( ); while [[ $j -gt 0 ]]; do k=$((RANDOM%${i})); a[$k]=$((a[$k]+1)); j=$((j-1)); done; echo ${a[@]}
11127 11267 11064 11214 10972 11005 11128

than all numbers closer together


i will combine it with
Code:
declare -i counter
counter=1
for i in $(i=7; j=77; a=( ); while [[ $j -gt 0 ]]; do k=$((RANDOM%${i})); a[$k]=$((a[$k]+1)); j=$((j-1)); done; echo ${a[@]} ) ; do head -c${i}G > "part$counter"; (( ++ counter )); done < file
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3350

PostPosted: Sun Nov 12, 2023 12:41 am    Post subject: Reply with quote

Quote:
yes, so it is
but for any file i do use other options
one file is in 5 parts, an other is in 9 parts and an othere one is in 7 parts and so on....
i will not have any pattern

eccerr0r
no it is not an academic exercise.

Well... Let me just tell you this: everyone can come up with such a security scheme that he won't be able to imagine anyone breaking it.
E.g. a monkey might want to hide a banana in a hollow.
Anyway, if the file has a structure that can be traced, either with a dictionary or by magic numbers or whatever, it can probably be stitched back together, making your security measure ineffective.
If you encrypt it first to get rid of any traces of structure, breaking a proper encryption will take literally ages (even if we add currently unknown computing power boost from new technologies), making this extra step unnecessary while still inconvenient. Basically, you're doing something stupid, so it would actually be better to keep it in the realm or academic exercises.

Quote:
any idea how to add a filter for lines like
10 10 5 11 5 19 15
there are two times "10"
i think about if there any number two or more times, than run the command again?
Sure, you can run it mutiple times and discard results you don't like, or make it actually deterministic, like do integer division with remainder to determine average chunk size and desired bias, and then in a loop add and subtract counter to spread the values, skipping some chunks to let the resulting bias actually matching your remainder.
Play with the numbers a bit, there is no "one right way to do that". I mean, if you use it for security, all ways will be equally bad :lol: but it's still a fun little exercise, so whatever.
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 20424

PostPosted: Sun Nov 12, 2023 2:16 am    Post subject: Reply with quote

szatox wrote:
Well... Let me just tell you this: everyone can come up with such a security scheme that he won't be able to imagine anyone breaking it.
E.g. a monkey might want to hide a banana in a hollow.
Anyway, if the file has a structure that can be traced, either with a dictionary or by magic numbers or whatever, it can probably be stitched back together, making your security measure ineffective.
Given that the information is encrypted, it seems much less likely that it could be stitched back together. Wouldn't it have to be reassembled correctly _before_ they could break the encryption? If you have a file that was encrypted and then had its parts disassembled and randomly reassembled, it would not be possible for the person in possession of the file to be of assistance in either correctly reordering the file or decrypting it.

szatox wrote:
If you encrypt it first to get rid of any traces of structure, breaking a proper encryption will take literally ages (even if we add currently unknown computing power boost from new technologies), making this extra step unnecessary while still inconvenient. Basically, you're doing something stupid, so it would actually be better to keep it in the realm or academic exercises.
I disagree and think that perspective is primarily useful in academic theory. No matter how academically unbreakable an encryption might be, it becomes pointless when wrench-decryption is applied.
_________________
Quis separabit? Quo animo?
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3350

PostPosted: Sun Nov 12, 2023 1:09 pm    Post subject: Reply with quote

Quote:
No matter how academically unbreakable an encryption might be, it becomes pointless when wrench-decryption is applied.
Sure, but the same thing applies to reordering the pieces, and the attack does not become even a tiny bit more difficult. It won't even require upgrading a 5$ wrench to an 8$ one.
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 22450

PostPosted: Sun Nov 12, 2023 4:01 pm    Post subject: Reply with quote

Until the most recent post from OP, it looked like OP was trying to use this split+scramble in lieu of, rather than in addition to, proper cryptography. For the very first post, I was willing to consider that this was just an attempt to deal with weird size caps on an upload service, and splitting the file into pieces allowed them to be uploaded to a service that would not take the full file in a single step.

History shows many examples of people inventing self-perceived "great" designs to use to obfuscate data, calling it encryption, and then having it fail horribly when serious attackers (whether cryptography researchers who publish their attack, or malicious attackers who use the attack for direct gain) examine the algorithm and find that an output "ciphertext" can be converted back to the plaintext form in very reasonable time. Therefore, I (and, I think several other posters) have a strong bias toward assuming that anyone asking for this sort of thing is doing it as part of a homemade encryption scheme that will likely fail the first time a serious attacker tries to analyze it. The responsible thing to do in that case is to warn the requester that this is not a secure way to conceal data, and that using a well verified cryptographic protocol will be both easier and safer.

Yes, physical assault can break even the best cryptographic schemes, but one of the goals of good cryptography is that an attacker's only choices are (1) a brute force search of a large key space, which ideally should take so long on average that the attacker chooses not to bother or (2) attack the endpoint via espionage / assault.

If this is being used in addition to a known good cryptography scheme, then I think it is still pointless because it adds complexity with little value. However, if OP wants to overcomplicate a secure system, and does so in a way that does not reduce its security, that is OP's choice.
Back to top
View user's profile Send private message
SarahS93
l33t
l33t


Joined: 21 Nov 2013
Posts: 725

PostPosted: Sun Nov 12, 2023 9:52 pm    Post subject: Reply with quote

if i read it right, short form, scramble and split makes it not bad, but not better for more securtiy?

create an encrypted file, and into it an encrypted file too, and so on and so on ... how deep, 5 times? diferent chippers, no headers at the encrypted files (storage for all 5 separated) - is this more sercure than 1 encrypted file with scamble and split?

encrypted file (header detached) - is it possible to identify this file as an encrypted file and as an encrypted luks file?
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3350

PostPosted: Sun Nov 12, 2023 10:58 pm    Post subject: Reply with quote

SarahS93, explanation in a nutshell here: https://xkcd.com/538/

The wrench attack does have some disadvantages, like it's not exactly stealthy, but brute force is best applied to the weakest link, and it ain't AES right now.
Encrypting stuff with different algorithms and different (unrelated) passwords could help with not-yet-discovered vulnerabilities which could potentially make decryption trivial in the future, but it won't protect you from the wrench in hands of a determined attacker, while still piling up inconveniences for the users. And sufficiently inconvenienced users write passwords down on sticky notes next to their monitors.

Quote:
encrypted file (header detached) - is it possible to identify this file as an encrypted file and as an encrypted luks file?
Hard to say. Luks container with a detached header should look just like random noise (unless you made a mistake creating it. Like making it a sparse file, or using a bad encryption mode)
Do you mind explaining why you have a file with 100GB of noise though? That's sus.
Back to top
View user's profile Send private message
SarahS93
l33t
l33t


Joined: 21 Nov 2013
Posts: 725

PostPosted: Fri Jun 21, 2024 11:29 pm    Post subject: Reply with quote

with
Code:
declare -i counter ; counter=1 ; for i in 11 3 26 17 6 8 4; do head -c${i}G > "part.$counter"; (( ++ counter )); done < 75G

the file is reading and splitting, ok

how to run this throug pipe viewer ?
Code:
declare -i counter ; counter=1 ; for i in 11 3 26 17 6 8 4; do head -c${i}G | pv > "part.$counter"; (( ++ counter )); done < 75G

the output goes throug pv, but with each new file that is createt, pv makes a new line, not good to show.
dont know the way to send the input throug pipe viewer.
the input comes from "< 75G" how to send this throug pv and than to head?
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3350

PostPosted: Sat Jun 22, 2024 12:36 am    Post subject: Reply with quote

Maybe put the whole input through a single pv instead of putting each chunk through a different one?

Instead of
for ... pv ... done < input
use this
pv input | for ... done
_________________
Make Computing Fun Again
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum