Random Thoughts

 

06-05-2006 - Yearly update.

After about a year of idleness, here comes another entry. And this one's about my little project of late: CryptoFrame.

Lacking a better name, and still in an early stage, it is meant to be a workbench for generic math/crypto related activities. Of course, you can do whatever you want with it, in your own plugin. I included 3 initial plugins to play with. One is the result of my experiments with some elliptic curve stuff. Hasher is a revamp of something I did a while ago. The other is a specialized calculator that I figure will be kind of handy (at least for me).

Anyway, you can download and give it a try here.

18-05-2005 - Factoring software reviewed.

From what I've been able to see, people trying to factor numbers often use slow (and old) tools to factor rather large numbers. While this has the only disadvantage of slowing down the process (as we'll see later, in some cases is a quite large slowdown), I figured a summary of some good factoring tools would be appropriate.

Well, let's start with factor.exe, bundled with the MIRACL multi-precision package. First off, it tries to factor using consecutive divisions by small primes. Then, it uses Pollard's rho, William's and Elliptic Curve methods, and finally the MPQS (Multiple Polynomial Quadratic Sieve). While this is acceptable for rather small numbers, for larger numbers this gets quite slow (factor.exe only resorts to the MPQS after trying all other methods). Let's find some alternatives, then.

Based only on the Quadratic Sieve algorithms (both the MPQS and SIQS), there are two options: ppsiqs (there's also ppmpqs, but it's always slower than ppsiqs, so I won't bother with it) and msieve. ppsiqs already gives us a great speed-up over factor.exe (a 256 bit number passes from 3 hours of factoring time to ~40 minutes). Problem is, it is old. As in 2001 old. Being compiled with the Borland compiler doesn't help either (I'm not a big fan of Borland, for those who don't know already), especially since trying to compile it with other C++ compilers fails. Oh well, let's check msieve. Now, msieve looks interesting. It is coded in C, so is it easier to build. And it is faster than ppsiqs (again with a 256 bit number (the same), ~25 minutes). Still, it uses only its own MP functions, in pure C. I wonder how much faster would it be using some highly optimized functions for that...

On the NFS side, there are two implementations I'm aware of: snfs and ggnfs. snfs is OLD. As in 1999 old. Besides, there's no source, we must have already a polynomial for the number we want factored, and all the documentation is in japanese. Compared to that, ggnfs is much better. It's developed by Chris Monico (yes, the same guy who lead ECDLP-109 calculation a while ago), and it has impressive results when factoring big numbers (much better than the Quadratic Sieve aproaches). However, not all is good. This tool is harder to build and to work with. Fortunately there's a Perl script bundled (factlat.pl) that helps the process. And keep in mind that the GNFS only really compensates with larger numbers (from what I've seen, >280 bit. Maybe less).

I'm also aware of mpqs4linux, but I haven't been able to test it yet. I hope this helps any of you interested in factoring numbers.

10-05-2005 - RSA-663 factored.

Seems the same team that factored RSA-576 has done it again. This time a 663 bit (200 digit) number was factored. The composite number was:

2799783391122132787082946763872260162107044678695542853756000992932612840010760934567105295536085606
1822351910951365788637105954482006576775098580557613579098734950144178863178946295187237869221823983

The factors were:

3532461934402770121272604978198464368671197400197625023649303468776121253679423200058547956528088349

7925869954478333033347085841480059687737975857364219960734330341455767872818152135381409304740185467

As expected, it was factored using the general field sieve. Read more about it here.

17-04-2005 - Pentium 4 sucks.

Yeah, that's it. I've been doing some fairly serious optimization in Pentium 4 and have to say it is weird, at the very least. I mean, instructions like ADC and SBB have a huge latency (6 clocks against 0.5 of regular ADD/SUB). INC/DEC is slower than ADD/SUB reg, 1. Why? From the Intel Architecture Optimization Manual:

"inc and dec instructions should be replaced with an add or sub instruction, because add and sub overwrite all flags, whereas inc and dec do not, therefore creating false dependencies on earlier instructions that set the flags."

Ridiculous. Also, SHL/SHR has a 4 clock latency, 8 times more than regular arithmetic instructions. This means you get faster code doing sequencial ADDs, if the shift is relatively low (<= 3). Using LEA in this case wouldn't help either, since it maps internally into ADDs, MOVs and SHLs. BSWAP has a latency of 7 clocks in P4 Northwood, but looks like it gone down to 1 clock in Prescott. Same goes for SHL/SHR. Oh well.

For SIMD, we have the classic MOVQ/MOVDQA/MOVDQU example. Any of these attains a 6 clock latency. Using PXOR/POR sequency attains 2 clocks latency. But then, the MOVQs are done in the FP_MOVE unit, sparing the MMX_ALU for other uses. The PSHUFW trick used in MMX doesn't work too well with XMM registers (4 clock latency). It's a bit tricky to make optimal code for this.

Pentium 4 has a tiny L1 cache, and only 6 (using EBP gives you 7) general purpose registers is very little, really (this is not Pentium 4 related only, now). At least in x86-64 there are twice as much general purpose registers. Let's wait and see.

By the way, for those wondering what latency means. Latency is the number of clocks a given instruction takes to complete internally.

17-03-2005 - GMP

I've tried several multi-precision libraries in the past, such as FreeLIP and MIRACL. So this time, I decided to give GMP (stands for GNU Multi-Precision) a try.

The first problem was trying to compile it in Windows. MSVC or any other Windows-only compiler are out of question, and the AT&T syntax of all the assembly optimized stuff doesn't help at all. So, I managed to get it compiled in GCC (with MingW32). Having a working lib, I was happier than before, but when trying to build something using my newly created lib I kept having a missing dependency, called "__alloca". Linking libgcc.a along with my little test app seemed to help. Later, I found the --disable-alloca switch in the configure script.

But the thing that annoyed me most with GMP was that there was no documentation bundled with it at all. So, if I want to implement something, I have to be checking all the source files, and try to find what I want. This is actually one of the things I love about MIRACL: it comes with a detailed manual, where you have everything you need. But, after overcoming those disconforts, and actually making something that works, GMP is not that bad. It's pretty fast, at least as much as MIRACL, and the API is pretty clean and organized. Besides, it is multiplatform and optimized for most of the platforms it supports. We can easily make something to run on a Cray or on a PDA.

Here is a sample RSA and ElGamal implementation I made while trying to dig up GMP's integer functions. It's pretty easy to understand, so have fun.

09-03-2005 - Bit rotations

Many symmetric encryption algorithms and hash functions (e.g. SHA-1, HAS160, AES, etc) use bit rotations, among other transformations. This kind of algorithms is specially known for demanding speed. Yet, it is very common to see inefficient implementations of some of these functions in software. For example, it is very common to see a rotate n bits left operation implemented like that:

#define ROTL(x, n) ( (unsigned long)((x) << (n)) | (unsigned long)((x) >> (32-(n))) )

Hell, for code that has to run on every damn platform in the world this actually makes sense, since we can't possibly know what instructions the machine has. But when we do code for a specific platform, this is not the best solution. In x86 we have a ROL instruction that does the job of several logical instructions, and faster. How much faster? Well, I've made a little benchmark to check that out. Let's rotate an arbitrary number by 5 bits 2^32 times, using both the regular aproach, and the optimized one. This is what I get on my box (Pentium 4 1.7 GHz):

Regular   - 18500 ms
Optimized - 10578 ms

I guess the results speak for themselves. But beware. Some compilers (like GCC) actually recognize the rotation done and will use ROL. But are stupid enough to use ROL with memory, thus making the loop much slower than it would be even with logical operations (the version compiled by GCC took 32704 ms). Obviously everything I'm saying here for rotating left also applies for rotating right.

This is one example of stuff that could be easily optimized and often isn't. Another interesting example is byte swap (big endian <-> little endian). Funny, rotating left is also used in implementations of it. It could be substituted by BSWAP instruction easily in x86 (it's not like anyone still uses <486). Byte swap for a 32-bit value is often defined as:

#define BSWAP(x) ((ROTL((x), 8) & 0x00FF00FF) | (ROTL((x), 24) & 0xFF00FF00))

Well, I've done yet another benchmark to check the speed of this. These are the results:

Regular   - 38468 ms
Optimized - 18563 ms

This one's even more impressive. So, the point is, if you see you can optimize some often used code, do it. Specially in speed-demanding apps.

01-03-2005 - RSA low encryption exponent attack

As everyone knows, RSA encryption (and public key cryptography generally) is much slower than symmetric algorithms. Therefore, efforts are often done to optimize performance on these algorithms. One way of doing it is by using a small encryption exponent. In case you don't know, RSA encryption is basically the following:

C = M^e (mod n)

, in which M is the data to be encrypted, e is the encryption exponent, and n is the public key (the '^' character here means "power of", not "xor"). C is obviously the ciphertext. So, using a smaller e results in faster encryption (or at least an e which has not much 1's in binary, such as 2^k+1, k being a positive integer. 65537 is often used too. Even though, 3 needs 2 multiplications, whereas 65537 needs 17). But there's a catch. If the same message is sent e times, with different public keys, it is possible to recover the plaintext without knowing the private key. How so?

Let's imagine e = 3. It is very likely all three n used are co-primes, and therefore knowing both ciphertexts and public keys, it is a question of solving the expression:

There's 0 ≤ x < n1*n2*n3 forwhich

    x ≡ c1 (mod n1)
    x ≡ c2 (mod n2)
    x ≡ c3 (mod n3)

This system can be solved using Gauss's or Garner's method (I won't describe them here, at least for now). After finding a solution, you must take into account that the x you got is still M^3. Therefore, you must compute the cube root of x, and you will have the decrypted message, with no need of d.

Even worse, if the plaintext is smaller than the eth root of n, it is only needed to compute the eth root of the ciphertext to recover the plaintext! For instance, encrypting the string "Dcoder" with a 2048-bit strong public key, using 3 as encryption exponent, allows one to recover the plaintext in less than a second. Scary.

How can one prevent this kind of attack? There are two main options: either use a large encryption exponent or add random garbage to the end of each message. The latter allow you to preserve the high performance that justifies the low exponent.

26-02-2005 - Do use CMOV

CMOV is an interesting instruction. It has been there since Pentium Pro, and yet it is hardly used, even in pure assembly written code. This instruction is as fast as a single MOV, and prevents branching, since no jumps and other stuff is needed. The code gets more readable too! So, why one wouldn't use it? Because 486's don't have it? Seriously, those processors aren't used anymore for serious work. So, if you're to write optimized code, do use CMOV, for it's good.

25-02-2005 - Hash functions and symmetric encryption algorithms

This bothers me. I have often people asking me how to decrypt MD5, or how to reverse SHA-1. I can only guess they are confusing those hash algorithms with common symmetric algorithms, like DES or Blowfish. Well, once and for all, let's find out what is a hash function, and how it differs from a symmetric cipher. In order not to be just talking out of my ass, i'll quote a known book on the field, Handbook of Applied Cryptography, by A. Menezes, P. van Oorschot and S. Vanstone.

Definition - A hash function (in the unrestricted sense) is a function h which has, as a minimum, the following two properties:

  1. compressionh maps an input x of arbitrary finite bitlength, to an output h(x) of fixed bitlength n.
  2. ease of computationgiven h and an input x, h(x) is easy to compute.

So, what these functions do is take an amount of data of arbitrary (but finite) size, and shrink it down to its unique n bits value. Anything else? In fact, yes. These are three basic properties of hash functions:

  1. preimage resistance - for essentially all pre-specified outputs, it is computationally infeasible to find any input which hashes to that output, i.e., to find any preimage x' such that h(x') = y when given any y forwhich a corresponding input is not known.
  2. 2nd-preimage resistance - it is computationally infeasible to find any second input which has the same output as any specified input, i.e., given x, to find a 2nd-preimage x' ≠ x such that h(x) = h(x').
  3. collision resistance - it is computationally infeasible to find any two distinct inputs x, x' which hash to the same output, i.e., such that h(x) = h(x'). (Note that here there is free choice of both inputs.)

What do we get from these properties? Well, some things for sure. However, my point lies on the first property, which is also denominated one-way. Once you have some unknown data's hash value, it is impossible, except by brute-force (i.e. hashing every possible combination of bits until the wanted one is found), to know the original data. Therefore, you can't decrypt it! It's as simple as this.

Well, this is true in a perfect hashing algorithm. Most of them have known flaws though. Many of you probably heard of SHA-1 being broken, not long ago. Well, you still need 2^69 calculations, instead of the 2^80 calculations expected for a birthday attack. Even though, it is 2048 times easier to break SHA-1 like that. I guess SHA-1 is pretty much finished for serious cryptographic work, yet it can be of many uses, just like MD5 (and even MD4) still is today.