Vision-based motion estimation for structural systems has attracted significant interest in recent years. As the design of robust algorithms to accurately estimate motion still represents a challenge, a multi-step framework is proposed to deal with both large and small motion amplitudes. The solution combines a stochastic search method for coarse-level measurements with a deterministic method for fine-level measurements. A population-based block matching approach, featuring adaptive search limit selection for robust estimation and a subsampled block strategy, is implemented to reduce the computational burden of integer pixel motion estimation. A Reduced-Error Gradient-based method is next adopted to achieve subpixel resolution accuracy. This hybrid Smart Block Matching with Reduced-Error Gradient (SBM-REG) approach therefore provides a powerful solution for motion estimation. By employing Complexity Pursuit, a blind source separation method for output-only modal analysis, structural mode shapes and vibration frequencies are finally extracted from video data. The method’s efficiency and accuracy are assessed here against synthetic shifted patterns, a cantilever beam, and six-story laboratory tests.