The goal of this project is to create AV-Detect, a real-time program that implements the Hershey and Movellan (2000; hereafter “HM”) algorithm to detect synchrony between dynamic streams of audio and visual data.
Background
HM originally employed their algorithm to analyze video clips of people speaking in front of a camera, with the goal of determining the source of vocalization. Their algorithm was able to achieve this with good success by determining the region with the highest degree of synchrony between the audio and visual data (most commonly the lips) and taking this to be the location of the speaker.
In our research (see http://www.cprince.com/Projects/KidCause), the HM algorithm has been implemented in two other contexts: SenseStream (Mislivec, 2004) and Detect (Helder, 2003).
SenseStream is a Linux-based (C++) program that takes MPEG-1 video files as input and employs the HM algorithm to examine the degree of synchrony between the audio and visual data streams within the file. Thus the AV-Detect program will be functionally quite similar to the SenseStream program, as each program detects synchrony between audio and visual data streams. However, instead of working with offline video files, AV-Detect will operate in real-time, using data from a webcam and a microphone as input. Additionally, while SenseStream is implemented in C++, AV-Detect will be implemented in Java.
Detect is a Java-based, real-time program that uses the HM algorithm to look for synchrony between a command stream that is actively animating a shape on a computer screen and visual data from a webcam that is viewing that animated shape. One issue of note is that the current version of Detect (2.2.2) makes no effort to enforce synchronization between these two input streams: the command received as input to the HM algorithm at a specific time (say, t) may not correspond to the image frame received as input to the HM algorithm from the camera at the same time (t). Because both Detect and AV-Detect are Java-based, real-time applications, it may be possible to reuse much of the code from Detect in the current project. However, aside from using audio data rather than command data, AV-Detect will need to assure that the incoming audio and visual data are synchronized: at time t the image frame being processed by the HM algorithm needs to correspond with the time t audio data that is being processed.
The output of the AV-Detect program will be similar to the two above programs and will consist of a succession of mixelgram displays that are generated and presented at a display rate identical to the frame rate being used by the webcam. The HM algorithm results in a mutual information value for each pixel of a particular video frame; we refer to these values as mixels– mutual information pixels. Mixels represent the mutual information between the audio and visual data during the particular time window (see below), and are interpreted as measures of audio-visual synchrony. Taken together these mixels effectively form a spectrogram, and thus the use of the mixelgram label. The lower left image in Figure 1 gives an example mixelgram.
mixelgram

Figure
1. An example mixelgram obtained from processing audio-video data of someone
talking using the SenseStream program. Perceptual relevance (e.g., shapes)
generally indicates synchrony between the audio and visual streams (Vuppla, in
preparation).
Additionally, the AV-Detect program will need to be capable of displaying a visual representation of both the current webcam data (i.e., current video frame) and current audio data (i.e., amplitude waveform of the current audio samples). However, these features are to be used primarily for testing and debugging and thus don’t necessarily have to be displayed in real-time. Further, these features will need to be disabled during the actually running of AV-Detect, so as to allow as much of the CPU to be dedicated to the mixelgram computation and display as possible. (Again, as discussed above, the mixelgram output will need to operate in “real-time” – the mixelgrams are to be displayed at the same rate as the frame rate in use by the webcam, and the user should be notified if the current processing drops below this rate.)
This project is part of a larger research effort investigating robotic models of infant learning and cognitive development. See also Prince (2001), Prince, Helder, Mislivec, Ang, Lim, and Hollich (2003), Prince, Hollich, Helder, Mislivec, Reddy, Salunke, and Memon (in preparation).
Formal Requirements
AV-Detect needs to:
0. Run under Windows 2000 or Windows XP;
1. Acquire audio data from a connected microphone;
1b. Enable the user to select the audio data capture rate for the microphone (i.e., samples per second) during AV-Detect program setup;
1c. Enable the user to display (and turn off) a visual representation of this audio data (i.e., a waveform display);
2. Acquire visual data from a connected USB webcam;
2b. Enable the user to select the frame rate for the USB webcam during AV-Detect program setup;
2c. Enable the user to display (and turn off) a visual representation of this video data;
3. Synchronize the data from 1 and 2;
4. Implement the HM algorithm to calculate and display the mixelgram output at the same frame rate as the USB web cam;
4b. In processing using the HM algorithm, RMS audio and grayscale pixels are a minimum requirement (m = n = 1; see Hershey & Movellan, 2000), though it is best if the HM algorithm is implemented in its general form to allow processing more complex audio and video features (HM algorithm coded to allow: m > 1, n > 1);
4c. As part of the setup of the AV-Detect program, the user must be able to select S, the processing window-length (see Hershey & Movellan, 2000).
Testing the AV-Detect Program
In order to fully test the AV-Detect program, audio-visual inputs that: (a) are synchronized, and (b) are not synchronized will be needed. A cheap and effective way to generate synchronized audio-visual inputs is to simply talk while facing the webcam and microphone from which AV-Detect receiving its data. Perceptually relevant mixelgrams are expected in this scenario. That is, the HM algorithm typically generates perceptually relevant displays (e.g., mixelgrams containing shapes) when the audio-visual inputs are synchronized and non-perceptually relevant displays (e.g., noise) otherwise (see Vuppla, in preparation). Figure 1 above shows an example of a perceptually relevant mixelgram. Generating audio-visual inputs that are not synchronized, on the other hand, may be trickier. Example MPEG video data available at the Prince et al. (in preparation) web site may be of use (http://www.cprince.com/PubRes/EpiRob04). These MPEG videos could be used to test AV-Detect by playing them on a secondary computer (a LCD monitor is best on the secondary computer to avoid refresh rate issues), and aiming the AV-Detect webcam at that that video display. Additionally, multimedia authoring tools (e.g., Macromedia Director) may be useful for generating controlled input data for testing AV-Detect.
Additional Information and Resources
To accomplish Formal Requirements 1 and 2 in a Java environment, it will be necessary to make use of the Java Media Framework (JMF). The JMF gives Java applications the means necessary to access various hardware media devices connected to the user’s computer. Further details can be found at:
http://nerp.net/~nhelder/Research/References/jmf2.0api/
http://nerp.net/~nhelder/Research/References/jmf2.0guide/
Example code for Formal Requirement 2 can be obtained from:
http://nerp.net/~nhelder/Research/WebcamExample/
Formal Requirement 3 can be implemented in several ways. Two reasonable options are as follows. First, synchronization can be done within the JMF by use of the DataSink and related objects (see the “jmf2.0guide” link above). Second, synchronization can be performed outside the JMF by means of timestamps or frame numbers. Either implementation would be sufficient, but an exploration of both (and/or others) would be ideal.
It is highly recommended that program coding be done using the open-source NetBeans IDE, which can be downloaded from:
http://www.netbeans.org/downloads/ide/index.html
Necessary Software
Java 2 Platform Software Development Kit
Any version over 1.4 will suffice (1.4.2_03 preferred). http://nerp.net/~nhelder/Research/Software/j2sdk-1_4_2_03.exe
Java Media Framework
Version 2.1.1e is required. Please note that once your webcam and microphone are connected, you will need to run the jmfregistry.exe application to make these resources available within the JMF.
http://nerp.net/~nhelder/Research/Software/jmf-2_1_1e.exe
Webcam Drivers
Drivers for our webcams can be found at:
http://nerp.net/~nhelder/Research/Software/
Final Comments
One of the beauties of working
with Java – and a situation that oftentimes found in the workplace – is that
there is a lot of program code available.
Typically, being able to find and adapt code that has already been written is as
important a skill as writing new code.
Some places to begin your search
for the current project:
Sun Developer Forums, JMF-specific url:
Sun-provided programs implementing aspects of the JMF:
http://java.sun.com/products/java-media/jmf/2.1.1/solutions/index.html
For an example of
modifying audio data with a gain effect:
http://nerp.net/~nhelder/Research/References/jmf2.0guide/JMFExtending.html#104796
Google Groups Forums, Java-specific search page:
http://www.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&newwindow=1&safe=off&group=comp.lang.java
Contact Information
Christopher G. Prince, chris@cprince.com
Nathan A. Helder, nhelder@nerp.net
Project Web Page
http://www.cprince.com/PubRes/AV-Detect
References
Helder, N. A. (2003). A real-time, computational model of perceptually-based contingent behavior detection. Honors project, University of Minnesota Duluth, Department of Computer Science. Internet: http://www.cprince.com/projects/KidCause/Detect/
Hershey, J. & Movellan, J. (2000). Audio-vision: Using audio-visual synchrony to locate sounds. In S. A. Solla, T. K. Leen, & K. –R. Müller (Eds.), Advances in Neural Information Processing Systems 12 (pp. 813-819). Cambridge, MA: MIT Press.
Internet: http://www.cprince.com/Projects/KidCause/contingency/AudioVision.pdf
http://www.cprince.com/Projects/KidCause/contingency/HersheyAndMovellan.rm (Real Media demonstration).
Mislivec,
E. J. (2004). Audio-visual synchrony for
face location and segmentation. Undergraduate research opportunity project,
Internet:
http://www.cprince.com/PubRes/SenseStream
Prince, C. G. (2001). Theory Grounding in Embodied Artificially Intelligent Systems. Paper presented at The First International Workshop on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems, held Sept 17-18, 2001, in Lund, Sweden. Lund, Sweden: Lund University Cognitive Studies, Volume 85.
Internet: http://www.cprince.com/PubRes/EpigeneticRobotics2001/
Prince, C. G., Helder, N. A., Mislivec, E. J., Ang, B. J., Lim, M. S., & Hollich, G. J. (2003). Taking contingency seriously in sensory-based models of learning in infants. Poster presented at the 2003 Meeting of the Cognitive Development Society, held at Park City, Utah, USA, October 24-25, 2003.
Internet: http://www.cprince.com/PubRes/CogDevSoc03/
Prince, C. G., Hollich, G.
J., Helder, N. A., Mislivec, E. J., Reddy, A., Salunke, S., & Memon, N. (in
preparation). Taking synchrony seriously: Comparing infants with a
perceptual-level model. For submission to The
Fourth Annual Workshop on Epigenetic Robotics, to be held at
Vuppla, K. (in preparation). Evaluation of two synchrony detection implementations. Masters
Thesis,