After about a year of idleness, here comes another entry. And this one's about my little project of late: CryptoFrame.
Lacking
a better name, and still in an early stage, it is meant to be a
workbench for generic math/crypto related activities. Of course, you
can do whatever you want with it, in your own plugin. I included 3
initial plugins to play with. One is the result of my experiments with
some elliptic curve stuff. Hasher is a revamp of something I did a
while ago. The other is a specialized calculator that I figure will be
kind of handy (at least for me).
Anyway, you can download and give it a try here.
18-05-2005 - Factoring software reviewed.
From what I've been able to see, people trying to
factor numbers often use slow (and old) tools to factor rather large
numbers. While this has the only disadvantage of slowing down the
process (as we'll see later, in some cases is a quite large slowdown),
I figured a summary of some good factoring tools would be appropriate.
Well, let's start with factor.exe, bundled with the MIRACL
multi-precision package. First off, it tries to factor using
consecutive divisions by small primes. Then, it uses Pollard's rho,
William's and Elliptic Curve methods, and finally the MPQS (Multiple
Polynomial Quadratic Sieve). While this is acceptable for rather small
numbers, for larger numbers this gets quite slow (factor.exe only
resorts to the MPQS after trying all other methods). Let's find some
alternatives, then.
Based only on the Quadratic Sieve algorithms (both the MPQS and SIQS), there are two options: ppsiqs (there's also ppmpqs, but it's always slower than ppsiqs, so I won't bother with it) and msieve.
ppsiqs already gives us a great speed-up over factor.exe (a 256 bit
number passes from 3 hours of factoring time to ~40 minutes). Problem
is, it is old. As in 2001 old. Being compiled with the Borland compiler
doesn't help either (I'm not a big fan of Borland, for those who don't
know already), especially since trying to compile it with other C++
compilers fails. Oh well, let's check msieve. Now, msieve looks
interesting. It is coded in C, so is it easier to build. And it is
faster than ppsiqs (again with a 256 bit number (the same), ~25
minutes). Still, it uses only its own MP functions, in pure C. I wonder
how much faster would it be using some highly optimized functions for
that...
On the NFS side, there are two implementations I'm aware of: snfs and ggnfs.
snfs is OLD. As in 1999 old. Besides, there's no source, we must have
already a polynomial for the number we want factored, and all the
documentation is in japanese. Compared to that, ggnfs is much better.
It's developed by Chris Monico (yes, the same guy who lead ECDLP-109
calculation a while ago), and it has impressive results when factoring
big numbers (much better than the Quadratic Sieve aproaches). However,
not all is good. This tool is harder to build and to work with.
Fortunately there's a Perl script bundled (factlat.pl) that helps the
process. And keep in mind that the GNFS only really compensates with
larger numbers (from what I've seen, >280 bit. Maybe less).
I'm also aware of mpqs4linux, but I haven't been able to test it yet. I hope this helps any of you interested in factoring numbers.
10-05-2005 - RSA-663 factored.
Seems the same team that factored RSA-576 has done it
again. This time a 663 bit (200 digit) number was factored. The
composite number was:
2799783391122132787082946763872260162107044678695542853756000992932612840010760934567105295536085606
1822351910951365788637105954482006576775098580557613579098734950144178863178946295187237869221823983
3532461934402770121272604978198464368671197400197625023649303468776121253679423200058547956528088349
7925869954478333033347085841480059687737975857364219960734330341455767872818152135381409304740185467
Yeah, that's it. I've been doing some fairly serious
optimization in Pentium 4 and have to say it is weird, at the very
least. I mean, instructions like ADC and SBB have a huge latency (6 clocks against 0.5 of regular ADD/SUB). INC/DEC is slower than ADD/SUB reg, 1. Why? From the Intel Architecture Optimization Manual:
"inc
and
dec instructions should be replaced with an add or sub instruction,
because add and sub overwrite all flags, whereas inc and dec do not,
therefore creating false dependencies on earlier instructions that set
the flags."
Ridiculous. Also, SHL/SHR has a 4 clock latency, 8 times more than regular arithmetic instructions. This means you get faster code doing sequencial ADDs, if the shift is relatively low (<= 3). Using LEA in this case wouldn't help either, since it maps internally into ADDs, MOVs and SHLs. BSWAP has a latency of 7 clocks in P4 Northwood, but looks like it gone down to 1 clock in Prescott. Same goes for SHL/SHR. Oh well.
For SIMD, we have the classic MOVQ/MOVDQA/MOVDQU example. Any of these attains a 6 clock latency. Using PXOR/POR sequency attains 2 clocks latency. But then, the MOVQs are done in the FP_MOVE unit, sparing the MMX_ALU for other uses. The PSHUFW trick used in MMX doesn't work too well with XMM registers (4 clock latency). It's a bit tricky to make optimal code for this.
Pentium 4 has a tiny L1 cache, and only 6 (using EBP
gives you 7) general purpose registers is very little, really (this is
not Pentium 4 related only, now). At least in x86-64 there are twice as
much general purpose registers. Let's wait and see.
By the way, for those wondering what latency means. Latency is the
number of clocks a given instruction takes to complete internally.
I've tried several multi-precision libraries in the
past, such as FreeLIP and MIRACL. So this time, I decided to give GMP
(stands for GNU Multi-Precision) a try.
The first problem was trying to compile it in Windows. MSVC or any
other Windows-only compiler are out of question, and the AT&T
syntax of all the assembly optimized stuff doesn't help at all. So, I
managed to get it compiled in GCC (with MingW32). Having a working lib,
I was happier than before, but when trying to build something using my
newly created lib I kept having a missing dependency, called
"__alloca". Linking libgcc.a along with my little test app seemed to
help. Later, I found the --disable-alloca switch in the configure script.
But the thing that annoyed me most with GMP was that there was no documentation bundled with it at all.
So, if I want to implement something, I have to be checking all the
source files, and try to find what I want. This is actually one of the
things I love about MIRACL: it comes with a detailed manual, where you
have everything you need. But, after overcoming those disconforts, and
actually making something that works, GMP is not that bad. It's pretty
fast, at least as much as MIRACL, and the API is pretty clean and
organized. Besides, it is multiplatform and optimized for most of the
platforms it supports. We can easily make something to run on a Cray or
on a PDA.
Here is a sample RSA and ElGamal implementation I made while trying to dig up GMP's integer functions. It's pretty easy to understand, so have fun.
Many symmetric encryption algorithms and hash
functions (e.g. SHA-1, HAS160, AES, etc) use bit rotations, among other
transformations. This kind of algorithms is specially known for
demanding speed. Yet, it is very common to see inefficient
implementations of some of these functions in software. For example, it
is very common to see a rotate n bits left operation implemented like that:
#define ROTL(x, n) ( (unsigned long)((x) << (n)) | (unsigned long)((x) >> (32-(n))) )
Hell, for code that has to run on every damn platform in the world this
actually makes sense, since we can't possibly know what instructions
the machine has. But when we do code for a specific platform, this is
not the best solution. In x86 we have a ROL
instruction that does the job of several logical instructions, and
faster. How much faster? Well, I've made a little benchmark to check
that out. Let's rotate an arbitrary number by 5 bits 2^32 times, using
both the regular aproach, and the optimized one. This is what I get on
my box (Pentium 4 1.7 GHz):
Regular - 18500 ms
I guess the results speak for themselves. But beware. Some compilers
(like GCC) actually recognize the rotation done and will use ROL. But
are stupid enough to use ROL with memory, thus making the loop much
slower than it would be even with logical operations (the version
compiled by GCC took 32704 ms). Obviously everything I'm saying here
for rotating left also applies for rotating right.
Optimized - 10578 ms
This is one example of stuff that could be easily optimized and often
isn't. Another interesting example is byte swap (big endian <->
little endian). Funny, rotating left is also used in implementations of
it. It could be substituted by BSWAP instruction easily in x86 (it's not like anyone still uses <486). Byte swap for a 32-bit value is often defined as:
#define BSWAP(x) ((ROTL((x), 8) & 0x00FF00FF) | (ROTL((x), 24) & 0xFF00FF00))
Well, I've done yet another benchmark to check the speed of this. These are the results:
Regular - 38468 ms
This one's even more impressive. So, the point is, if you see you can
optimize some often used code, do it. Specially in speed-demanding apps.
Optimized - 18563 ms
01-03-2005 - RSA low encryption exponent attack
As everyone knows, RSA encryption (and public key
cryptography generally) is much slower than symmetric algorithms.
Therefore, efforts are often done to optimize performance on these
algorithms. One way of doing it is by using a small encryption
exponent. In case you don't know, RSA encryption is basically the
following:
C = M^e (mod n)
, in which M is the data to be encrypted, e is the encryption exponent, and n is the public key (the '^' character here means "power of", not "xor"). C is obviously the ciphertext. So, using a smaller e results in faster encryption (or at least an e
which has not much 1's in binary, such as 2^k+1, k being a positive
integer. 65537 is often used too. Even though, 3 needs 2
multiplications, whereas 65537 needs 17). But there's a catch. If the
same message is sent e times, with different public keys, it is possible to recover the plaintext without knowing the private key. How so?
Let's imagine e = 3. It is very likely all three n used are co-primes, and therefore knowing both ciphertexts and public keys, it is a question of solving the expression:
There's 0 ≤ x < n1*n2*n3 forwhich
x ≡ c1 (mod n1)
x ≡ c2 (mod n2)
x ≡ c3 (mod n3)
This system can be solved using Gauss's or Garner's method (I won't
describe them here, at least for now). After finding a solution, you
must take into account that the x you got is still M^3. Therefore, you must compute the cube root of x, and you will have the decrypted message, with no need of d.
Even worse, if the plaintext is smaller than the eth root of n, it is only needed to compute the eth root of the ciphertext to recover the plaintext! For instance, encrypting the string "Dcoder" with a 2048-bit strong public key, using 3 as encryption exponent, allows one to recover the plaintext in less than a second. Scary.
How can one prevent this kind of attack? There are two main options: either use a large encryption exponent or add random garbage to the end of each message. The latter allow you to preserve the high performance that justifies the low exponent.
CMOV is an interesting instruction. It has been there since Pentium Pro, and yet it is hardly used, even in pure assembly written code. This instruction is as fast as a single MOV, and prevents branching, since no jumps and other stuff is needed. The code gets more readable too! So, why one wouldn't use it? Because 486's don't have it? Seriously, those processors aren't used anymore for serious work. So, if you're to write optimized code, do use CMOV, for it's good.
25-02-2005 - Hash functions and symmetric encryption algorithms
This bothers me. I have often people asking me how to decrypt MD5, or how to reverse SHA-1. I can only guess they are confusing those hash algorithms with common symmetric algorithms, like DES or Blowfish. Well, once and for all, let's find out what is a hash function, and how it differs from a symmetric cipher. In order not to be just talking out of my ass, i'll quote a known book on the field, Handbook of Applied Cryptography, by A. Menezes, P. van Oorschot and S. Vanstone.
Definition - A hash function (in the unrestricted sense) is a function h which has, as a minimum, the following two properties:
So, what these functions do is take an amount of data of arbitrary (but finite) size, and shrink it down to its unique n bits value. Anything else? In fact, yes. These are three basic properties of hash functions:
What do we get from these properties? Well, some things for sure. However, my point lies on the first property, which is also denominated one-way. Once you have some unknown data's hash value, it is impossible, except by brute-force (i.e. hashing every possible combination of bits until the wanted one is found), to know the original data. Therefore, you can't decrypt it! It's as simple as this.
Well, this is true in a perfect hashing algorithm. Most of them have known flaws though. Many of you probably heard of SHA-1 being broken, not long ago. Well, you still need 2^69 calculations, instead of the 2^80 calculations expected for a birthday attack. Even though, it is 2048 times easier to break SHA-1 like that. I guess SHA-1 is pretty much finished for serious cryptographic work, yet it can be of many uses, just like MD5 (and even MD4) still is today.