This paper evaluates how fusing multimodal features (audio, RGB, and depth) can enhance the task of gait recognition, as well as gender and shoe recognition. While most previous research has focused on visual descriptors like binary silhouettes, little attention has been given to audio or depth data associated with walking. The proposed multimodal system is tested on the TUM GAID dataset, which includes audio, depth, and image sequences. Results show that combining features from these modalities using early or late fusion techniques improves state-of-the-art performance in gait, gender, and shoe recognition. Additional experiments on CASIA-B (which only includes visual data) further support the advantages of feature fusion.