Chemotherapy is an important treatment for cancer patients, but it comes with risks. Therefore, effective chemotherapy response prediction is crucial. While whole slide image provides high-resolution insights into tumour environments, existing weakly supervised learning frameworks struggle to effectively integrate molecular data, such as gene expression, limiting their predictive power in complex chemotherapy response and small-sample scenarios. We present a bimodal chemotherapy response multi-instance learning framework, BiChemoCLAM, a novel multimodal deep learning framework that combines attention-driven multiple instance learning with multimodal compact bilinear pooling for interpretable and data-efficient chemotherapy response prediction. It achieves an Area Under Curve (AUC) of 80.91%, 71.68%, and 75.80% on ovarian serous cystadenocarcinoma, colorectal adenocarcinoma, and bladder urothelial carcinoma cancer datasets, respectively. The experimental results show that BiChemoCLAM is an effective model for predicting response to chemotherapy.