AWS Deep Composer uses Generative AI, or specifically Generative Adversarial Networks (GANs), to generate music. GANs pit two networks, a generator and a discriminator, against each other to create new content
Table of Contents
AWS DeepComposer Workflow
- Use the AWS DeepComposer keyboard or play the virtual keyboard in the AWS DeepComposer console to input a melody.
- Use a model in the AWS DeepComposer console to generate an original musical composition. You can choose from jazz, rock, pop, symphony or Jonathan Coulton pre-trained models or you can also build your own custom genre model in Amazon SageMaker.
- Publish your tracks to SoundCloud or export MIDI files to your favourite Digital Audio Workstation (like Garage Band) and get even more creative.
Generate an Interface
As we are using Amazon DeepComposer, so for following this article, you will need access to Amazon DeepComposer. You will need an AWS Account ID to sign in to the console for this accessing DeepComposer. So make sure that you have an AWS account. Now you don’t need to worry about paying anything as if you are new to Amazon AWS, you will get up to 500 inference jobs in the 12 months for AWS DeepComposer service, so can easily follow this tutorial in it.
Step 1. Open link for DeepComposer and login to your AWS account with your credentials and select.
Step 2. After successful login, select N.Virginia (us-east-1) AWS region and click on ‘Get Started’.
Step 3. Select Music Studio from the left navigation menu and it will open Music Studio on the browser.
This tab shows a default melody and shows options for Generating a new melody.
Step 4. On the left side of Music composer, there are options for selecting Model parameters. Select your model from the dropdown and click on generate Generate Composition.
Click play to play the composition. Hey Great, you just used a pre-trained model for generating a composition.
DeepCompser uses something knows as GAN for generating this music for us. Now the question is that what GAN is?
Now to understand GAN, I have taken the following explanation from this article. Please read the complete article if you wish to know more about GANS.
What are GANs?
GANs, a generative AI technique, pit 2 networks against each other to generate new content. The algorithm consists of two competing networks: a generator and a discriminator.
A generator is a convolutional neural network (CNN) that learns to create new data resembling the source data it was trained on. The generator network used in AWS DeepComposer is adapted from the U-Net architecture.
The discriminator is another convolutional neural network (CNN) that is trained to differentiate between real and synthetic data.
The generator and the discriminator are trained in alternating cycles such that the generator learns to produce more and more realistic data while the discriminator iteratively gets better at learning to differentiate real data from the newly created data.
A GAN trains two different networks, one against the other, hence they are adversarial. One network generates an image (or any other sample, such as text or speech) by taking a real image and modifying it as much as it can. The other network tries to predict whether the image is “fake” or “real.” The first network, called the G network, learns to generate better images. The second network, the D network, learns to discriminate between the fake and real images.
For example, the G network might add sunglasses to a person’s face. The D network gets a set of images, some of the real people with sunglasses and some of the people that the G network added sunglasses to. If the D network can tell which are fake and which are real, the G network updates its parameters to generate even better fake sunglasses. If the D network is fooled by the fake images, it updates its parameters to better discriminate between fake and real sunglasses.
Please read this link for detailed information: Udacity Link.
A ratio of the number of times the discriminator is updated per generator training epoch. Updating the discriminator multiple times per generator training epoch is useful because it can improve the discriminators accuracy. Changing this ratio might allow the generator to learn more quickly early-on, but will increase the overall training time.
While we provide sensible defaults for these hyperparameters in the AWS DeepComposer console, you are encouraged to explore other settings to see how each changes your model’s performance and time required to finish training your model.
In machine learning, the goal of iterating and completing epochs is to improve the output or prediction of the model. Any output that deviates from the ground truth is referred to as an error. The measure of an error, given a set of weights, is called a loss function. Weights represent how important an associated feature is to determining the accuracy of a prediction, and loss functions are used to update the weights after every iteration. Ideally, as the weights update, the model improves making less and less errors. Convergence happens once the loss functions stabilize.
We use loss functions to measure how closely the output from the GAN models match the desired outcome. Or, in the case of DeepComposer, how well does DeepComposer’s output music match the training music. Once the loss functions from the Generator and Discriminator converges, this indicates the GAN model is no longer learning, and we can stop its training.
AWS decomposer uses AWS Lambda, Dynamo DB and Amazon Sagemaker.
How does AWS Composer work with GAN?
- Input melody captured on the AWS DeepComposer console
- Console makes a backend call to AWS DeepComposer APIs that triggers an execution Lambda.
- Book-keeping is recorded in Dynamo DB.
- The execution Lambda performs an inference query to SageMaker which hosts the model and the training inference container.
- The query is run on the Generative AI model.
- The model generates a composition.
- The generated composition is returned.
- The user can hear the composition in the console.
- The user can share the composition to SoundCloud.
Challenges with GANs
- Clean datasets are hard to obtain
- Not all melodies sound good in all genres
- Convergence in GAN is tricky – it can be fleeting rather than being a stable state
- Complexity in defining meaningful quantitive metrics to measure the quality of music created
How to measure the quality of generated Music?
- We can monitor the loss function to make sure the model is converging
- We can check the similarity index to see how close is the model to mimicking the style of the data. When the graph of the similarity index smoothes out and becomes less spikey, we can be confident that the model is converging
- Now above two points are a good way of measuring the quality of generated music but the best way to listen to generated music. The musical quality of the model should improve as the number of training epochs increases.
Typically when training any sort of model, it is a standard practice to monitor the value of the loss function throughout the duration of the training. The discriminator loss has been found to correlate well with sample quality. You should expect the discriminator loss to converge to zero and the generator loss to converge to some number which need not be zero. When the loss function plateaus, it is an indicator that the model is no longer learning. At this point, you can stop training the model. You can view these loss function graphs in the AWS DeepComposer console.
Sample output quality improves with more training
After 400 epochs of training, discriminator loss approaches near zero and the generator converges to a steady-state value. Loss is useful as an evaluation metric since the model will not improve as much or stop improving entirely when the loss plateaus.
While standard mechanisms exist for evaluating the accuracy of more traditional models like classification or regression, evaluating generative models is an active area of research. Within the domain of music generation, this hard problem is even less well-understood.
To address this, we take high-level measurements of our data and show how well our model produces music that aligns with those measurements. If our model produces music which is close to the mean value of these measurements for our training dataset, our music should match the general “shape”. You’ll see graphs of these measurements within the AWS DeepComposer console
Here are a few such measurements:
- Empty bar rate: The ratio of empty bars to total number of bars.
- Number of pitches used: A metric that captures the distribution and position of pitches.
- In Scale Ratio: Ratio of the number of notes that are in the key of C, which is a common key found in music, to the total number of notes.
Music to your ears
Of course, music is much more complex than a few measurements. It is often important to listen directly to the generated music to better understand changes in model performance. You’ll find this final mechanism available as well, allowing you to listen to the model outputs as it learns.
Once training has completed, you may use the model created by the generator network to create new musical compositions.
Once this model is trained, the generator network alone can be run to generate new accompaniments for a given input melody. If you recall, the model took as input a single-track piano roll representing melody and a noise vector to help generate varied output.
The final process for music generation then is as follows:
- Transform single-track music input into piano roll format.
- Create a series of random numbers to represent the random noise vector.
- Pass these as input to our trained generator model, producing a series of output piano rolls. Each output piano roll represents some instrument in the composition.
- Transform the series of piano rolls back into a common music format (MIDI), assigning an instrument for each track.
To explore this process firsthand, try loading a model in the music studio, using a sample model if you have not trained your own. After selecting from a prepared list of input melodies or recording your own, you may choose “Generate a composition” to generate accompaniments.