A couple of days ago I encountered this article: How Shazam Works

This got me interested in how a program like Shazam works… And more importantly, how hard is it to program something similar in Java?

About Shazam

Shazam is an application which you can use to analyse/match music. When you install it on your phone, and hold the microphone to some music for about 20 to 30 seconds, it will tell you which song it is.

When I first used it it gave me a magical feeling. “How did it do that!?”. And even today, after using it a lot, it still has a bit of magical feel to it.
Wouldn’t it be great if we can program something of our own that gives that same feeling? That was my goal for the past weekend.

Listen up..!

First things first, get the music sample to analyse we first need to listen to the microphone in our Java application…! This is something I hadn’t done yet in Java, so I had no idea how hard this was going to be.

But it turned out it was very easy:

1 final AudioFormat format = getFormat(); //Fill AudioFormat with the wanted settings
2 DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
3 final TargetDataLine line = (TargetDataLine) AudioSystem.getLine(info);
4 line.open(format);
5 line.start();

Now we can read the data from the TargetDataLine just like a normal InputStream:

01 // In another thread I start:
02  
03 OutputStream out = new ByteArrayOutputStream();
04 running = true;
05  
06 try {
07     while (running) {
08         int count = line.read(buffer, 0, buffer.length);
09         if (count > 0) {
10             out.write(buffer, 0, count);
11         }
12     }
13     out.close();
14 catch (IOException e) {
15     System.err.println("I/O problems: " + e);
16     System.exit(-1);
17 }

Using this method it is easy to open the microphone and record all the sounds! The AudioFormat I’m currently using is:

1 private AudioFormat getFormat() {
2     float sampleRate = 44100;
3     int sampleSizeInBits = 8;
4     int channels = 1//mono
5     boolean signed = true;
6     boolean bigEndian = true;
7     return new AudioFormat(sampleRate, sampleSizeInBits, channels, signed, bigEndian);
8 }

So, now we have the recorded data in a ByteArrayOutputStream, great! Step 1 complete.

Microphone data

The next challenge is analyzing the data, when I outputted the data I received in my byte array I got a long list of numbers, like this:

01 0
02 0
03 1
04 2
05 4
06 7
07 6
08 3
09 -1
10 -2
11 -4
12 -2
13 -5
14 -7
15 -8
16 (etc)

Erhm… yes? This is sound?

To see if the data could be visualized I took the output and placed it in Open Office to generate a line graph:

Ah yes! This kind of looks like ‘sound’. It looks like what you see when using for example Windows Sound Recorder.

This data is actually known as time domain. But these numbers are currently basically useless to us… if you read the above article on how Shazam works you’ll read that they use a spectrum analysis instead of direct time domain data.
So the next big question is: How do we transform the current data into a spectrum analysis?

Discrete Fourier transform

To turn our data into usable data we need to apply the so called Discrete Fourier Transformation. This turns the data from time domain into frequency domain.
There is just one problem, if you transform the data into the frequency domain you loose every bit of information regarding time. So you’ll know what the magnitude of all the frequencies are, but you have no idea when they appear.

To solve this we need a sliding window. We take chunks of data (in my case 4096 bytes of data) and transform just this bit of information. Then we know the magnitude of all frequencies that occur during just these 4096 bytes.

Implementing this

Instead of worrying about the Fourier Transformation I googled a bit and found code for the so called FFT (Fast Fourier Transformation). I’m calling this code with the chunks:

01 byte audio[] = out.toByteArray();
02  
03 final int totalSize = audio.length;
04  
05 int amountPossible = totalSize/Harvester.CHUNK_SIZE;
06  
07 //When turning into frequency domain we'll need complex numbers:
08 Complex[][] results = new Complex[amountPossible][];
09  
10 //For all the chunks:
11 for(int times = 0;times < amountPossible; times++) {
12     Complex[] complex = new Complex[Harvester.CHUNK_SIZE];
13     for(int i = 0;i < Harvester.CHUNK_SIZE;i++) {
14         //Put the time domain data into a complex number with imaginary part as 0:
15         complex[i] = new Complex(audio[(times*Harvester.CHUNK_SIZE)+i], 0);
16     }
17     //Perform FFT analysis on the chunk:
18     results[times] = FFT.fft(complex);
19 }
20  
21 //Done!

Now we have a double array containing all chunks as Complex[]. This array contains data about all frequencies. To visualize this data I decided to implement a full spectrum analyzer (just to make sure I got the math right).
To show the data I hacked this together:

01 for(int i = 0; i < results.length; i++) {
02     int freq = 1;
03     for(int line = 1; line < size; line++) {
04         // To get the magnitude of the sound at a given frequency slice
05         // get the abs() from the complex number.
06         // In this case I use Math.log to get a more managable number (used for color)
07         double magnitude = Math.log(results[i][freq].abs()+1);
08  
09         // The more blue in the color the more intensity for a given frequency point:
10         g2d.setColor(new Color(0,(int)magnitude*10,(int)magnitude*20));
11         // Fill:
12         g2d.fillRect(i*blockSizeX, (size-line)*blockSizeY,blockSizeX,blockSizeY);
13          
14         // I used a improviced logarithmic scale and normal scale:
15         if (logModeEnabled && (Math.log10(line) * Math.log10(line)) > 1) {
16             freq += (int) (Math.log10(line) * Math.log10(line));
17         else {
18             freq++;
19         }
20     }
21 }

Introducing, Aphex Twin

This seems a bit of OT (off-topic), but I’d like to tell you about a electronic musician called Aphex Twin (Richard David James). He makes crazy electronic music… but some songs have an interesting feature. His biggest hit for example, Windowlicker has a spectrogram image in it.
If you look at the song as spectral image it shows a nice spiral. Another song, called ‘Mathematical Equation’ shows the face of Twin! More information can be found here: Bastwood – Aphex Twin’s face.

When running this song against my spectral analyzer I get the following result:

Not perfect, but it seems to be Twin’s face!

Determining the key music points

The next step in Shazam’s algorithm is to determine some key points in the song, save those points as a hash and then try to match on them against their database of over 8 million songs. This is done for speed, the lookup of a hash is O(1) speed. That explains a lot of the awesome performance of Shazam!

Because I wanted to have everything working in one weekend (this is my maximum attention span sadly enough, then I need a new project to work on) I kept my algorithm as simple as possible. And to my surprise it worked.

For each line the in spectrum analysis I take the points with the highest magnitude from certain ranges. In my case: 40-80, 80-120, 120-180, 180-300.

01 //For every line of data:
02  
03 for (int freq = LOWER_LIMIT; freq < UPPER_LIMIT-1; freq++) {
04     //Get the magnitude:
05     double mag = Math.log(results[freq].abs() + 1);
06  
07     //Find out which range we are in:
08     int index = getIndex(freq);
09  
10     //Save the highest magnitude and corresponding frequency:
11     if (mag > highscores[index]) {
12         highscores[index] = mag;
13         recordPoints[index] = freq;
14     }
15 }
16  
17 //Write the points to a file:
18 for (int i = 0; i < AMOUNT_OF_POINTS; i++) {
19     fw.append(recordPoints[i] + "\t");
20 }
21 fw.append("\n");
22  
23 // ... snip ...
24  
25  
26 public static final int[] RANGE = new int[] {40,80,120,180, UPPER_LIMIT+1};
27  
28 //Find out in which range
29 public static int getIndex(int freq) {
30     int i = 0;
31     while(RANGE[i] < freq) i++;
32         return i;
33     }
34 }

When we record a song now, we get a list of numbers such as:

01 33  56  99  121 195
02 30  41  84  146 199
03 33  51  99  133 183
04 33  47  94  137 193
05 32  41  106 161 191
06 33  76  95  123 185
07 40  68  110 134 232
08 30  62  88  125 194
09 34  57  83  121 182
10 34  42  89  123 182
11 33  56  99  121 195
12 30  41  84  146 199
13 33  51  99  133 183
14 33  47  94  137 193
15 32  41  106 161 191
16 33  76  95  123 185

If I record a song and look at it visually it looks like this:


(all the red dots are ‘important points’)

Indexing my own music

With this algorithm in place I decided to index all my 3000 songs. Instead of using the microphone you can just open mp3 files, convert them to the correct format, and read them the same way we did with the microphone, using an AudioInputStream. Converting stereo music into mono-channel audio was a bit trickier then I hoped. Examples can be found online (requires a bit too much code to paste here) have to change the sampling a bit.

Matching!

The most important part of the program is the matching process. Reading Shazams paper they use hashing to get matches and the decide which song was the best match.

Instead of using difficult point-groupings in time I decided to use a line of our data (for example “33, 47, 94, 137″) as one hash: 1370944733
(in my tests using 3 or 4 points works best, but tweaking is difficult, I need to re-index my mp3 every time!)

Example hash-code using 4 points per line:

01 //Using a little bit of error-correction, damping
02 private static final int FUZ_FACTOR = 2;
03  
04 private long hash(String line) {
05     String[] p = line.split("\t");
06     long p1 = Long.parseLong(p[0]);
07     long p2 = Long.parseLong(p[1]);
08     long p3 = Long.parseLong(p[2]);
09     long p4 = Long.parseLong(p[3]);
10     return  (p4-(p4%FUZ_FACTOR)) * 100000000 + (p3-(p3%FUZ_FACTOR)) * 100000 + (p2-(p2%FUZ_FACTOR)) * 100+ (p1-(p1%FUZ_FACTOR));
11 }

Now I create two data sets:

– A list of songs, List<String> (List index is Song-ID, String is songname)
– Database of hashes: Map<Long, List<DataPoint>>

The long in the database of hashes represents the hash itself, and it has a bucket of DataPoints.

A DataPoint looks like:

01 private class DataPoint {
02  
03     private int time;
04     private int songId;
05  
06     public DataPoint(int songId, int time) {
07         this.songId = songId;
08         this.time = time;
09     }
10      
11     public int getTime() {
12         return time;
13     }
14     public int getSongId() {
15         return songId;
16     }
17 }

Now we already have everything in place to do a lookup. First I read all the songs and generate hashes for each point of data. This is put into the hash-database.
The second step is reading the data of the song we need to match. These hashes are retrieved and we look at the matching datapoints.

There is just one problem, for each hash there are some hits, but how do we determine which song is the correct song..? Looking at the amount of matches? No, this doesn’t work…
The most important thing is timing. We must overlap the timing…! But how can we do this if we don’t know where we are in the song? After all, we could just as easily have recorded the final chords of the song.

By looking at the data I discovered something interesting, because we have the following data:

– A hash of the recording
– A matching hash of the possible match
– A song ID of the possible match
– The current time in our own recording
– The time of the hash in the possible match

Now we can substract the current time in our recording (for example, line 34) with the time of the hash-match (for example, line 1352). This difference is stored together with the song ID. Because this offset, this difference, tells us where we possibly could be in the song.
When we have gone through all the hashes from our recording we are left with a lot of song id’s and offsets. The cool thing is, if you have a lot of hashes with matching offsets, you’ve found your song.

The results

For example, when listening to The Kooks – Match Box for just 20 seconds, this is the output of my program:

01 Done loading: 2921 songs
02  
03 Start matching song...
04  
05 Top 20 matches:
06  
07 01: 08_the_kooks_-_match_box.mp3 with 16 matches.
08 02: 04 Racoon - Smoothly.mp3 with 8 matches.
09 03: 05 Röyksopp - Poor Leno.mp3 with 7 matches.
10 04: 07_athlete_-_yesterday_threw_everyting_a_me.mp3 with 7 matches.
11 05: Flogging Molly - WMH - Dont Let Me Dia Still Wonderin.mp3 with 7 matches.
12 06: coldplay - 04 - sparks.mp3 with 7 matches.
13 07: Coldplay - Help Is Round The Corner (yellow b-side).mp3 with 7 matches.
14 08: the arcade fire - 09 - rebellion (lies).mp3 with 7 matches.
15 09: 01-coldplay-_clocks.mp3 with 6 matches.
16 10: 02 Scared Tonight.mp3 with 6 matches.
17 11: 02-radiohead-pyramid_song-ksi.mp3 with 6 matches.
18 12: 03 Shadows Fall.mp3 with 6 matches.
19 13: 04 Röyksopp - In Space.mp3 with 6 matches.
20 14: 04 Track04.mp3 with 6 matches.
21 15: 05 - Dress Up In You.mp3 with 6 matches.
22 16: 05 Supergrass - Can't Get Up.mp3 with 6 matches.
23 17: 05 Track05.mp3 with 6 matches.
24 18: 05The Fox In The Snow.mp3 with 6 matches.
25 19: 05_athlete_-_wires.mp3 with 6 matches.
26 20: 06 Racoon - Feel Like Flying.mp3 with 6 matches.
27  
28 Matching took: 259 ms
29  
30 Final prediction: 08_the_kooks_-_match_box.mp3.song with 16 matches.

It works!!

Listening for 20 seconds it can match almost all the songs I have. And even this live recording of the Editors could be matched to the correct song after listening 40 seconds!

Again it feels like magic! :-)

Currently, the code isn’t in a releasable state and it doesn’t work perfectly. It has been a pure weekend-hack, more like a proof-of-concept / algorithm exploration.

Maybe, if enough people ask about it, I’ll clean it up and release it somewhere.

Update:

The Shazam patent holders lawyers are sending me emails to stop me from releasing the code and removing this blogpost, read the story here.

转载于:https://www.cnblogs.com/zoucaitou/p/4626300.html

Creating Shazam in Java相关推荐

  1. 用java实现Shazam 译文

    Creating Shazam in Java By royvanrijn On June 1, 2010 翻译:windviki@gmail.com 2010/8/30 几天之前,我偶然看到一篇文章 ...

  2. akka---Getting Started Tutorial (Java): First Chapter

    原文地址:http://doc.akka.io/docs/akka/2.0.2/intro/getting-started-first-java.html Introduction Welcome t ...

  3. Java 注解指导手册 – 终极向导

    转载自  Java 注解指导手册 – 终极向导 译文出处: Toien Liu   原文出处:Dani Buiza 编者的话:注解是java的一个主要特性且每个java开发者都应该知道如何使用它. 我 ...

  4. java 正規表示 group_经验分享|Java+百度AI实现人脸识别

    之前尝试用python+opencv实现过人脸识别,接下来我们使用Java+百度AI来实现人脸识别的尝试. I 注册百度开放平台账号 打开百度AI官方网站(https://ai.baidu.com/? ...

  5. Using OpenCV Java with Eclipse

    转自:http://docs.opencv.org/trunk/doc/tutorials/introduction/java_eclipse/java_eclipse.html Using Open ...

  6. java并发多线程面试_Java多线程并发面试问答

    java并发多线程面试 Today we will go through Java Multithreading Interview Questions and Answers. We will al ...

  7. Java虚拟机(JVM源码):编译OpenJDK源码

    为什么要自己编译JDK源码 作为一个搞技术的同学,如果想在技术这条路上走下去,还是多研究些本质性的东西,修炼下内功.尤其是现在JDK都出到10了,如果你没有研究过,还是停留在之前的时代,那么确实说不过 ...

  8. java xjc_java – XJC的最新官方版本是什么?我在哪里可以获得它?

    XJC( XML Java编译器)是JAXB( Java XML Bindings)的一部分. 如果我看一下official website,最新版本的JAXB似乎是2.2.11.这是你刚刚安装最新的 ...

  9. Maven 快速生成Java项目结构

    Maven使用 archetype 来创建项目.要创建一个简单的 Java 应用程序,我们使用 maven-archetype-quickstart 插件.在下面的例子中,我们将创建一个基于Maven ...

最新文章

  1. C++ Boost库初步使用 - 使用CFree
  2. 对民营医院的网络推广--迅脉互联
  3. SQL Servr 2008空间数据应用系列三:SQL Server 2008空间数据类型
  4. MATLAB R2020a新鲜出炉,我来替各位尝尝鲜!
  5. Idea maven项目不能新建package和class的解决
  6. Webview中无法触发手势方法(ontouchevent,onfling...)的解决方法
  7. 常见设计模式描术(看完就把它忘记~~)
  8. 用PHP编写提供性别的发法,第6节 Laravel-通过模型处理性别
  9. tomcat 7 mysql,Tomcat 七设置mysql数据源
  10. C#中 JSON 序列化 与 反序列化
  11. Infopath入门到精通系列-1 Infopath文件说明
  12. 90%企业都适用,搭建性能监控体系照抄就行
  13. 手机APP——扫描全能王去除水印字样
  14. 浅谈微信小程序的发展
  15. 2023最新SSM计算机毕业设计选题大全(附源码+LW)之java社团管理系统0gl2e
  16. 微型计算机中的数据总线用来进行什么的传输,汽检2011级汽车车载网络复习题答案...
  17. 漫步数理统计三十——依概率收敛
  18. 我创业之路的“足迹”连缀——2008年新浪首届“我的创业路”征文
  19. 用计算机精确查找,Excel函数教程: 根据首列精确查找同一行的数据-excel技巧-电脑技巧收藏家...
  20. 动手学深度学习——权重衰退及代码实现

热门文章

  1. opencv python 高斯滤波_OpenCV 学习:8 高斯滤波GaussianBlur
  2. 龙芯3A4000处理器解读 ②
  3. 大数据Spark入门案例5–统计广告点击数量排行Top3(scala版本)
  4. Ettercap中间人攻击——DNS劫持、替换网页内容与ARP欺骗
  5. 大猴子和小猴子的故事
  6. 学计算机苹果电脑好吗,开学季你的电脑选好了吗 MacBook Air
  7. 一种威胁绝大多数蓝牙设备的攻击载体——BlueBorne
  8. 关于Support for password authentication was removed on August 13, 2021报错的解决方案
  9. 英雄会,会英雄,CSDN大会有感
  10. 【工业通讯】CAN基础内容详解(二)——物理层