从一道笔试题谈算法优化（上）
作者：赖勇浩（http://blog.csdn.net/lanphaday）

引子
每年十一月各大IT公司都不约而同、争后恐后地到各大高校进行全国巡回招聘。与此同时，网上也开始出现大量笔试面试题；网上流传的题目往往都很精巧，既能让考查基础知识，又在平淡中隐含了广阔的天地供优秀学生驰骋。

这两天在网上淘到一道笔试题目（注1），虽然真假未知，但的确是道好题，题目如下：

从10亿个浮点数中找出最大的1万个。

这是一道似易实难的题目，一般同学最容易中的陷阱就是没有重视这个"亿"字。因为有10亿个单精度浮点数元素的数组在32位平台上已经达到3.7GB之巨，在常见计算机平台（如Win32）上声明一个这样的数组将导致堆栈溢出。正确的解决方法是分治法，比如每次处理100万个数，然后再综合起来。不过这不是本文要讨论的主旨，所以本文把上题的10亿改为1亿，把浮点数改为整数，这样可以直接地完成这个问题，有利于清晰地讨论相关算法的优化（注2）。

不假思索
拿到这道题，马上就会想到的方法是建立一个数组把1亿个数装起来，然后用for循环遍历这个数组，找出最大的1万个数来。原因很简单，因为如果要找出最大的那个数，就是这样解决的；而找最大的1万个数，只是重复1万遍而已。

template< class T >
void solution_1( T BigArr[], T ResArr[] )
{
for( int i = 0; i < RES_ARR_SIZE; ++i )
{
int idx = i;
for( int j = i+1; j < BIG_ARR_SIZE; ++j )
{
if( BigArr[j] > BigArr[idx] )
idx = j;
}
ResArr[i] = BigArr[idx];
std::swap( BigArr[idx], BigArr[i] );
}
}

设BIG_ARR_SIZE ＝ 1亿，RES_ARR_SIZE = 1万，运行以上算法已经超过40分钟（注3），远远超过我们的可接受范围。

稍作思考
从上面的代码可以看出跟SelectSort算法的核心代码是一样的。因为SelectSort是一个O(n^2)的算法（solution_1的时间复杂度为O(nm)，因为solution_1 没有将整个大数组全部排序），而我们又知道排序算法可以优化到O(nlogn)，那们是否可以从这方面入手使用更快的排序算法如MergeSor、 QuickSort呢？但这些算法都不具备从大至小选择最大的N个数的功能，因此只有将1亿个数按从大到小用QuickSort排序，然后提取最前面的1 万个。

template< class T, class I >
void solution_2( T BigArr[], T ResArr[] )
{
std::sort( BigArr, BigArr + BIG_ARR_SIZE, std::greater_equal() );
memcpy( ResArr, BigArr, sizeof(T) RES_ARR_SIZE );
}

因为STL里的sort算法使用的是QuickSort，在这里直接拿来用了，是因为不想写一个写一个众人皆知的QuickSort代码来占篇幅（而且STL的sort高度优化、速度快）。

对solution_2进行测试，运行时间是32秒，约为solution_1的1.5%的时间，已经取得了几何数量级的进展。

深入思考
压抑住兴奋回头再仔细看看solution_2，你将发现一个大问题，那就是在solution_2里所有的元素都排序了！而事实上只需找出最大的1万个即可，我们不是做了很多无用功吗？应该怎么样来消除这些无用功？

如果你一时没有头绪，那就让我慢慢引导你。首先，发掘一个事实：如果这个大数组本身已经按从大到小有序，那么数组的前1万个元素就是结果；然后，可以假设这个大数组已经从大到小有序，并将前1万个元素放到结果数组；再次，事实上这结果数组里放的未必是最大的一万个，因此需要将前1万个数字后续的元素跟结果数组的最小的元素比较，如果所有后续的元素都比结果数组的最小元素还小，那结果数组就是想要的结果，如果某一后续的元素比结果数组的最小元素大，那就用它替换结果数组里最小的数字；最后，遍历完大数组，得到的结果数组就是想要的结果了。

template< class T >
void solution_3( T BigArr[], T ResArr[] )
{
//取最前面的一万个
memcpy( ResArr, BigArr, sizeof(T) * RES_ARR_SIZE );
//标记是否发生过交换
bool bExchanged = true;
//遍历后续的元素
for( int i = RES_ARR_SIZE; i < BIG_ARR_SIZE; ++i )
{
int idx;
//如果上一轮发生过交换
if( bExchanged )
{
//找出ResArr中最小的元素
int j;
for( idx = 0, j = 1; j < RES_ARR_SIZE; ++j )
{
if( ResArr[idx] > ResArr[j] )
idx = j;
}
}
//这个后续元素比ResArr中最小的元素大，则替换。
if( BigArr[i] > ResArr[idx] )
{
bExchanged = true;
ResArr[idx] = BigArr[i];
}
else
bExchanged = false;
}
}

上面的代码使用了一个布尔变量bExchanged标记是否发生过交换，这是一个前文没有谈到的优化手段――用以标记元素交换的状态，可以大大减少查找 ResArr中最小元素的次数。也对solution_3进行测试一下，结果用时2.0秒左右（不使用bExchanged则高达32分钟），远小于 solution_2的用时。

深思熟虑
在进入下一步优化之前，分析一下solution_3的成功之处。第一、 solution_3的算法只遍历大数组一次，即它是一个O(n)的算法，而solution_1是O(nm)的算法，solution_2是 O(nlogn)的算法，可见它在本质上有着天然的优越性；第二、在solution_3中引入了bExchanged这一标志变量，从测试数据可见引入 bExchanged减少了约99.99%的时间，这是一个非常大的成功。

上面这段话绝非仅仅说明了solution_3的优点，更重要的是把solution_3的主要矛盾摆上了桌面――为什么一个O(n)的算法效率会跟O(nm)的算法差不多（不使用bExchanged）？为什么使用了bExchanged能够减少99.99%的时间？带着这两个问题再次审视solution_3的代码，发现bExchanged的引入实际上减少了如下代码段的执行次数：

for( idx = 0, j = 1; j < RES_ARR_SIZE; ++j )
{
if( ResArr[idx] > ResArr[j] )
idx = j;
}

上面的代码段即是查找ResArr中最小元素的算法，分析它可知这是一个O(n)的算法，到此时就水落石出了！原来虽然solution_3是一个O(n) 的算法，但因为内部使用的查找最小元素的算法也是O(n)的算法，所以就退化为O(nm)的算法了。难怪不使用bExchanged使用的时间跟 solution_1差不多；这也从反面证明了solution_3被上面的这一代码段导致性能退化。使用了bExchanged之后因为减少了很多查找最小元素的代码段执行，所以能够节省99.99%的时间！

至此可知元凶就是查找最小元素的代码段，但查找最小元素是必不可少的操作，在这个两难的情况下该怎么去优化呢？答案就是保持结果数组（即ResArr）有序，那样的话最小的元素总是最后一个，从而省去查找最小元素的时间，解决上面的问题。但这也引入了一个新的问题：保持数组有序的插入算法的时间复杂度是O(n)的，虽然在这个问题里插入的数次比例较小，但因为基数太大（1亿），这一开销仍然会令本方案得不偿失。

难道就没有办法了吗？记得小学解应用题时老师教导过我们如果解题没有思路，那就多读几遍题目。再次审题，注意到题目并没有要求找到的最大的1万个数要有序（注4），这意味着可以通过如下算法来解决：

1) 将BigArr的前1万个元素复制到ResArr并用QuickSort使ResArr有序，并定义变量MinElemIdx保存最小元素的索引，并定义变量ZoneBeginIdx保存可能发生交换的区域的最小索引；

2) 遍历BigArr其它的元素，如果某一元素比ResArr最小元素小，则将ResArr中MinElemIdx指向的元素替换，如果ZoneBeginIdx == MinElemIdx则扩展ZoneBeginIdx；

3) 重新在ZoneBeginIdx至RES_ARR_SIZE元素段中寻找最小元素，并用MinElemIdx保存其它索引；

4) 重复2)直至遍历完所有BigArr的元素。

依上算法，写代码如下：

template< class T, class I >
void solution_4( T BigArr[], T ResArr[] )
{
//取最前面的一万个
memcpy( ResArr, BigArr, sizeof(T) RES_ARR_SIZE );
//排序
std::sort( ResArr, ResArr + RES_ARR_SIZE, std::greater_equal() );
//最小元素索引
unsigned int MinElemIdx = RES_ARR_SIZE - 1;
//可能产生交换的区域的最小索引
unsigned int ZoneBeginIdx = MinElemIdx;
//遍历后续的元素
for( unsigned int i = RES_ARR_SIZE; i < BIG_ARR_SIZE; ++i )
{
//这个后续元素比ResArr中最小的元素大，则替换。
if( BigArr[i] > ResArr[MinElemIdx] )
{
ResArr[MinElemIdx] = BigArr[i];
if( MinElemIdx == ZoneBeginIdx )
--ZoneBeginIdx;
//查找最小元素
unsigned int idx = ZoneBeginIdx;
unsigned int j = idx + 1;
for( ; j < RES_ARR_SIZE; ++j )
{
if( ResArr[idx] > ResArr[j] )
idx = j;
}
MinElemIdx = idx;
}
}
}
经过测试，同样情况下solution_4用时约1.8秒，较solution_3效率略高，总算不负一番努力。
待续……

fft - Discrete Fourier transform

Syntax

Y = fft(X) Y = fft(X,n) Y = fft(X,[],dim) Y = fft(X,n,dim)

Definition

The functions Y=fft(x) and y=ifft(X) implement the transform and inverse transform pair given for vectors of length by:

where

is an th root of unity.

Description

Y = fft(X) returns the discrete Fourier transform (DFT) of vector X, computed with a fast Fourier transform (FFT) algorithm.

If X is a matrix, fft returns the Fourier transform of each column of the matrix.

If X is a multidimensional array, fft operates on the first nonsingleton dimension.

Y = fft(X,n) returns the n-point DFT. If the length of X is less than n, X is padded with trailing zeros to length n. If the length of X is greater than n, the sequence X is truncated. When X is a matrix, the length of the columns are adjusted in the same manner.

Y = fft(X,[],dim) and Y = fft(X,n,dim) applies the FFT operation across the dimension dim.

Examples

A common use of Fourier transforms is to find the frequency components of a signal buried in a noisy time domain signal. Consider data sampled at 1000 Hz. Form a signal containing a 50 Hz sinusoid of amplitude 0.7 and 120 Hz sinusoid of amplitude 1 and corrupt it with some zero-mean random noise:

Fs = 1000;                    % Sampling frequency T = 1/Fs;                     % Sample time L = 1000;                     % Length of signal t = (0:L-1)*T;                % Time vector % Sum of a 50 Hz sinusoid and a 120 Hz sinusoid x = 0.7*sin(2*pi*50*t) + sin(2*pi*120*t);  y = x + 2*randn(size(t));     % Sinusoids plus noise plot(Fs*t(1:50),y(1:50)) title('Signal Corrupted with Zero-Mean Random Noise') xlabel('time (milliseconds)')

It is difficult to identify the frequency components by looking at the original signal. Converting to the frequency domain, the discrete Fourier transform of the noisy signal y is found by taking the fast Fourier transform (FFT):

NFFT = 2^nextpow2(L); % Next power of 2 from length of y Y = fft(y,NFFT)/L; f = Fs/2*linspace(0,1,NFFT/2+1);  % Plot single-sided amplitude spectrum. plot(f,2*abs(Y(1:NFFT/2+1)))  title('Single-Sided Amplitude Spectrum of y(t)') xlabel('Frequency (Hz)') ylabel('|Y(f)|')

The main reason the amplitudes are not exactly at 0.7 and 1 is because of the noise. Several executions of this code (including recomputation of y) will produce different approximations to 0.7 and 1. The other reason is that you have a finite length signal. Increasing L from 1000 to 10000 in the example above will produce much better approximations on average.

Algorithm

The FFT functions (fft, fft2, fftn, ifft, ifft2, ifftn) are based on a library called FFTW [3],[4]. To compute an -point DFT when is composite (that is, when ), the FFTW library decomposes the problem using the Cooley-Tukey algorithm [1], which first computes transforms of size , and then computes transforms of size . The decomposition is applied recursively to both the - and -point DFTs until the problem can be solved using one of several machine-generated fixed-size "codelets." The codelets in turn use several algorithms in combination, including a variation of Cooley-Tukey [5], a prime factor algorithm [6], and a split-radix algorithm [2]. The particular factorization of is chosen heuristically.

When is a prime number, the FFTW library first decomposes an -point problem into three ( )-point problems using Rader's algorithm [7]. It then uses the Cooley-Tukey decomposition described above to compute the ( )-point DFTs.

For most , real-input DFTs require roughly half the computation time of complex-input DFTs. However, when has large prime factors, there is little or no speed difference.

The execution time for fft depends on the length of the transform. It is fastest for powers of two. It is almost as fast for lengths that have only small prime factors. It is typically several times slower for lengths that are prime or which have large prime factors.

Flying with bugs...

Monday, October 19, 2009

赖勇浩：从一道笔试题谈算法优化

Thursday, October 15, 2009

FFT in Matlab, from WIKI