Sieve of Eratosthenes (Bash)
From LiteratePrograms
- Other implementations: Alice ML | Bash | C++ | Forth | Haskell | Java | Python | Python, arrays | Scala
The Sieve of Eratosthenes is an algorithm for rapidly locating all the prime numbers in a certain range of integers. It operates by marking as composite all nontrivial multiples of each prime in sequence until only primes remain unmarked. It is most well-suited to implementation using arrays, which are compact and feature random access.
Contents |
theory
Die ganzen Zahlen hat der liebe Gott gemacht, alles andere ist Menschenwerk. — L Kronecker
By adapting a vectorized algorithm for the sieve, we minimize the work done in bash and maximize the work done by the utilities it calls.
cf. the λtU discussion of sieve complexities and Oleg's minimal implementation (which, per Kronecker's statement, relies on neither multiplication nor division, preferring successor and zero comparison).
practice
We start with a standard sieve: to find the primes less than a given number ($1) we take all (proper) multiples of 1 — i.e. all of them — and filter — sift out — all the composites with comm. The ones which remain are, by definition, the primes.
<<sieve primes>>= function primes() { [[ $1 -gt 2 ]] && comm -23 <(mults 1 $1) <(comps $1); } #alternative function primes() { [[ $1 -gt 2 ]] && comm -23 <(seq 2 1 $1) <(comps $1) }
Generating the proper multiples of a number ($1) less than a given number ($2) is an ancillary problem that can be delegated to jot.
<<generate multiples>>= function mults() { jot - $((2*$1)) $2 $1; } #alternative function mults() { seq 2 1 $1 }
The crux of the problem is generating the composites; but we simply generate all multiples of each of the potential prime factors. Because a prime greater than can't be a factor of any composite less than n, the recursion proceeds as loglogn. As the built-in bash arithmetic doesn't provide a √ function, we resort to the v command in dc.
<<generate composites>>= function comps() { n=$1; shift for p in $(primes $(dc -e "$n vp")); do mults $p $n done | sort -n | uniq } #alternative function comps() { for p in $(primes $(dc -e "$1 vp")); do seq $((2*$p)) $p $1 done | sort -nu }
Question: why does comm require that we use sort and uniq?
Question: what other representations are possible?
wrapping up
Finally, we add a small application wrapper allowing the user to specify an upper bound
<<sieve.bash>>= #!/bin/bash generate multiples generate composites sieve primes primes ${1:-1000} | fmt
but which defaults to producing the primes less than 1000:
2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 101 103 107 109 113 127 131 137 139 149 151 157 163 167 173 179 181 191 193 197 199 211 223 227 229 233 239 241 251 257 263 269 271 277 281 283 293 307 311 313 317 331 337 347 349 353 359 367 373 379 383 389 397 401 409 419 421 431 433 439 443 449 457 461 463 467 479 487 491 499 503 509 521 523 541 547 557 563 569 571 577 587 593 599 601 607 613 617 619 631 641 643 647 653 659 661 673 677 683 691 701 709 719 727 733 739 743 751 757 761 769 773 787 797 809 811 821 823 827 829 839 853 857 859 863 877 881 883 887 907 911 919 929 937 941 947 953 967 971 977 983 991 997
επίλογος
Given a language like J with built-in number theoretical primitives, finding primes becomes trivial:
<<primes in J using primitives>>= primes=.p:@i.@(_1&p:)
Howeer, if one tries to take the direct translation of the algorithm above into python
<<direct translation to python>>= primes = lambda n: (n > 2) and [ i for i in range(2,n) if i not in sum([range(2*p,n,p) for p in primes(int(n**.5))],[]) ] or [] if __name__ == "__main__": import sys n = (len(sys.argv) > 1) and int(sys.argv[1]) or 100 print primes(n)
it runs over 100 times more slowly to generate the first 5000 primes (22 sec vs. 0,17 sec). By minimizing the work done in bash and maximizing the work done by the C utilities, even with the process ond I/O overheads we easily beat a poorly optimized purely interpreted implementation.
Note: the reason, why this python implementation is >100 times slower, is that range() is used, instead of xrange(). range() returns a complete copy of a list, which also grows with every iteration step and allocates more and more memory. if the python programm was to be rewritten to use xrange() instead, the performace impacted would be MUCH smaller.
Download code |