Sieve of Eratosthenes (Bash)

From LiteratePrograms

Jump to: navigation, search
Other implementations: Alice ML | Bash | C++ | Forth | Haskell | Java | Python | Python, arrays | Scala

The Sieve of Eratosthenes is an algorithm for rapidly locating all the prime numbers in a certain range of integers. It operates by marking as composite all nontrivial multiples of each prime in sequence until only primes remain unmarked. It is most well-suited to implementation using arrays, which are compact and feature random access.


Contents

theory

Die ganzen Zahlen hat der liebe Gott gemacht, alles andere ist Menschenwerk. — L Kronecker

By adapting a vectorized algorithm for the sieve, we minimize the work done in bash and maximize the work done by the utilities it calls.

cf. the λtU discussion of sieve complexities and Oleg's minimal implementation (which, per Kronecker's statement, relies on neither multiplication nor division, preferring successor and zero comparison).

practice

We start with a standard sieve: to find the primes less than a given number ($1) we take all (proper) multiples of 1 — i.e. all of them — and filter — sift out — all the composites with comm. The ones which remain are, by definition, the primes.

<<sieve primes>>=
function primes() { [[ $1 -gt 2 ]] && comm -23 <(mults 1 $1) <(comps $1); }
#alternative
function primes() {
	[[ $1 -gt 2 ]] &&
	comm -23 <(seq 2 1 $1) <(comps $1)
}

Generating the proper multiples of a number ($1) less than a given number ($2) is an ancillary problem that can be delegated to jot.

<<generate multiples>>=
function mults() { jot - $((2*$1)) $2 $1; }
#alternative
function mults() {
	seq 2 1 $1
}

The crux of the problem is generating the composites; but we simply generate all multiples of each of the potential prime factors. Because a prime greater than can't be a factor of any composite less than n, the recursion proceeds as loglogn. As the built-in bash arithmetic doesn't provide a √ function, we resort to the v command in dc.

<<generate composites>>=
function comps() {
	n=$1; shift
	for p in $(primes $(dc -e "$n vp")); do
		mults $p $n
	done | sort -n | uniq
}
#alternative
function comps() {
	for p in $(primes $(dc -e "$1 vp")); do
		seq $((2*$p)) $p $1
	done | sort -nu
}

Question: why does comm require that we use sort and uniq?

Question: what other representations are possible?

wrapping up

Finally, we add a small application wrapper allowing the user to specify an upper bound

<<sieve.bash>>=
#!/bin/bash
generate multiples
generate composites
sieve primes
primes ${1:-1000} | fmt

but which defaults to producing the primes less than 1000:

2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83
89 97 101 103 107 109 113 127 131 137 139 149 151 157 163 167 173
179 181 191 193 197 199 211 223 227 229 233 239 241 251 257 263 269
271 277 281 283 293 307 311 313 317 331 337 347 349 353 359 367 373
379 383 389 397 401 409 419 421 431 433 439 443 449 457 461 463 467
479 487 491 499 503 509 521 523 541 547 557 563 569 571 577 587 593
599 601 607 613 617 619 631 641 643 647 653 659 661 673 677 683 691
701 709 719 727 733 739 743 751 757 761 769 773 787 797 809 811 821
823 827 829 839 853 857 859 863 877 881 883 887 907 911 919 929 937
941 947 953 967 971 977 983 991 997

επίλογος

Given a language like J with built-in number theoretical primitives, finding primes becomes trivial:

<<primes in J using primitives>>=
primes=.p:@i.@(_1&p:)

Howeer, if one tries to take the direct translation of the algorithm above into python

<<direct translation to python>>=
primes = lambda n: (n > 2) and [
		i for i in range(2,n)
		if i not in sum([range(2*p,n,p)
				for p in primes(int(n**.5))],[])
		] or []
if __name__ == "__main__":
	import sys
	n = (len(sys.argv) > 1) and int(sys.argv[1]) or 100
	print primes(n)

it runs over 100 times more slowly to generate the first 5000 primes (22 sec vs. 0,17 sec). By minimizing the work done in bash and maximizing the work done by the C utilities, even with the process ond I/O overheads we easily beat a poorly optimized purely interpreted implementation.

Note: the reason, why this python implementation is >100 times slower, is that range() is used, instead of xrange(). range() returns a complete copy of a list, which also grows with every iteration step and allocates more and more memory. if the python programm was to be rewritten to use xrange() instead, the performace impacted would be MUCH smaller.

Download code
Views