Towards Integrated Audio-Visual Learning: From Vision-to-Audio Generation to a Unified Audio-Visual Framework