From the information you’ve provided so far, it may be the case that you are saturating the disks on your storage servers. Do you have any external evidence that could support or contradict that idea?
Based on the numbers you’ve provided (30K/s PUTs of size 7050 bytes, triple redundancy), you would be writing something like 211 MB/s logical and at least 633 MB/s physical. It sounds like you have all 32 processes on a host sharing the same striped volume, in which case there may be some inefficiencies being introduced there. For example, we recommend that the logs do not share disks with the storage servers, as they have rather different write patterns with the logs fsyncing frequently. Also, with only 20 disks, the 211 MB/s logical write rate seems higher than I would have initially thought to be sustainable.
Besides trying to rearrange things a little more efficiently (e.g. by separating the logs, as described above), I think the only real recourse if you are disk bound is to either reduce your write rate or add more disks.
As a side note, you mentioned that you were able to achieve a high rate for a while but eventually slowed down. We’ve seen similar behavior from SSDs before where the rates may decrease after a long time running or as they get fuller. How full are your disks now?
Another side note – it looks like the i3.8xlarge instances have 32 vCPUs, or 16 physical CPUs. Although it sounds like you aren’t currently CPU bound, we do also recommend that each process in a cluster gets a physical core (or at least something relatively close to that). If you try to run 1 process per logical core, you may find as the cluster gets a little busier CPU-wise (say around 50% on average) that processes start getting starved, and the stability of the cluster can be greatly affected. Depending on the severity of the starvation, this is a situation that may not be handled particularly gracefully by the cluster.