x86/vp9lpf: simplify 2nd transpose in 44/48/88/84.
For non-avx optims, this saves 8 movs.
before:
1785 decicycles in ff_vp9_loop_filter_h_44_16_ssse3, 524129 runs, 159 skips
3327 decicycles in ff_vp9_loop_filter_h_48_16_ssse3, 262116 runs, 28 skips
2712 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193729 runs, 575 skips
3237 decicycles in ff_vp9_loop_filter_h_84_16_ssse3, 524061 runs, 227 skips
after:
1768 decicycles in ff_vp9_loop_filter_h_44_16_ssse3, 524062 runs, 226 skips
3310 decicycles in ff_vp9_loop_filter_h_48_16_ssse3, 262107 runs, 37 skips
2719 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193954 runs, 350 skips
3184 decicycles in ff_vp9_loop_filter_h_84_16_ssse3, 524236 runs, 52 skips