Why unaligned access suck(especially without cache)

We've all heard advice from someone, "align your data structures" or something along those lines. However, unless you're working with hardware that specifically prevents unaligned accesses (see: many RISC architectures, and my f&($#_% Sparc T5120), the performance benefit can be very small thanks to large cache lines.

However, I've been working on my own CPU, and I felt working interested in writing about the design compromises made so that I can make sense of the code when I'm done. The first one, being that the memory controller can't handle unaligned accesses. Why? Let me give you a look at how memory is arranged on the CPU.

Here, we have what the arrangment looks like in memory itself.
|00000000|
And here's what it looks like to the rest of the CPU.
|00|00|00|00|

However, that's not the full picture, as if the CPU could only read/write 1 byte at a time from "memory", there would be no unaligned accesses. This is untrue, as previously stated. The CPU can read or write up to 4 bytes in "memory" at a time. The good news here being that the cpu only reads up to one word from memory in most cases, except for one -- that being an unaligned access.

Let's read 4 bytes.
[00|00|00|00]
^ From here.
Okay, not an issue, we just read 1 word.
[00000000]
[00|00|00|00]
What about if we read from
^ here?
Oh no.

We fetch the upper half of the first word in memory, and then have to fetch a second word, get the lower word of that, and put them together. This is bad. Not only does this require a second a second access to memory(which is glacial with dram, not exactly with my sram), it massively complicates the memory controller, since it isn't as easy as asserting an address upon your memory, waiting one cycle(on SRAM) for the result, and then masking and shifting that in ways to produce the result. You would have to wait for 2 cycles, and have the generate more complex information on how to mask and shift the bits to create the result. This is unnecessary, since my instructions are always 4 bytes long and aligned. Yay for aligned memory accesses.