Expansion factor choices

Thank you for the clear and well-executed implementation.

Following up on this issue: #11 

May I kindly ask why you chose to expand the token-mixing MLP while bottlenecking the channel-mixing MLP? Is there a particular reason behind this design, or is it simply because this setup provides the best performance?