GROUP NAME: Deep Thought
GROUP MEMBERS: Vivian Qu, Neil Chatterjee, Harvest Zhang, Alan Thorne

All four group members conducted contextual interviews. All four worked on answering task analysis questions and interface design together, organizing a meeting and collaborating on a google docs.

Harvest, Alan, and Vivian drew sketches and final storyboards for the 3 defined tasks. Neil and Vivian compiled the blog post information to publish on the website.


VAHN (pronounce vain) is Microsoft Kinect project that allows on-the-fly, gesture-controlled, audio editing for musical performers and recorders. Skeleton data allows the manipulation of recording software that includes features such as overlaying audio, playback, EQ modulation, and editing. Suppose you want to know what your a cappella performance is going to sound like, or you’re creating a one man a cappella song — stand in one location, use gestures, sing, record and edit. Physically move to the next location (simulating multiple singers). Use gestures, sing, record, and edit. Move to another location, realize you don’t want a middle of a snippet, and use gestures to remove that snippet. Finish the third section, and finally overlay them together. VAHN provides a cheap solution for gesture-controlled audio recording and editing.


Our target users are amateur musicians with high level of skill and therefore need more music recording functionality, but who may not have much technical ability, and may lack familiarity with or access to complicated audio recording systems. This is a reasonable target user group because we imagine our system being a lightweight, simple tool which lowers the barrier to creating and enjoying music.

Interviewees were students who are deeply involved with playing and creating music. The majority consisted of students in a cappella groups and a member of Princeton University Orchestra ( PUO). Priorities included arranging music, improving their music skills, and the ability to experiment. In terms of technical skills, one was tech-savvy singer, two non-technical singer, one non-technical instrumentalists — all agreed they wanted a simpler way to record music and disliked the complicated interfaces of usual digital audio workstations where users are often limited by the fact that they don’t even know many functionalities exist.

CI Interview Descriptions

When conducting our interviews we followed this process:

  • We shared general idea of our project and asked why he would or wouldn’t use it.
  • Asked what the users would like to see in an interface (ideas without prompting).
  • Posed our ideas for interfaces to get their feedback, likes and dislikes.

We interviewed the users in their rooms or in public spaces such as coffee shops and dining halls. The process of recording music is generally isolated in a quiet spaces (usually with easy accessible software, such as those that come by default on computers), so the environment itself was not important for understanding the music-making and recording process.

Common suggestions included ease of operation, gesture control while recording, and a general expansion of features to incorporate as much functionality as possible as a recording studio. The “undo” functionality is very important for any music editing software, and needs to be intuitively integrated in the system. Editing should be able to happen in real time and be fine-tuned, so users can be precise how they edit the tracks and cut out segments on the fly. Overall, this should include all the features of high tech recording software, but maintains the simplicity of any kinect gesture software.

Suggestions from singers included the need for visual indication of loop position (like the start and finish of tracks), visual indications of beats/pulse, recommended auto-tune mode and various sound effects such as compressors and EQs. This product concept is extremely appealing to singers because other available software (Musescore, Sibelius, Finale, etc) all have electronically-generated MIDI files which doesn’t correspond to how parts will sound together in real life.

Feedback from a instrumentalist included the need to address handling of gestures — its weird to give gestures while playing an instrument. It would be useful to have different “modes” (switch from skeleton data input for kinect to depth perception data), allowing the musician to add notes. A timer countdown before recording starts would also be useful. Really like the concept of using spatial location to have different tracks.

Users also suggested that a video component would be valuable to the music-making process, especially if users could see multiple visual recordings at the same time.


1. What are unrealized design opportunities?

We would like people who can sing and play well (serious instrumentalists) to have an affordable pick-up-and-play recording and mixing system. Technology available is either too simplistic or too expensive and with complicated interfaces. Gesture-controls for the kinect are a new concept that can be integrated during performances or recordings. This allows people of all levels of technical skills to do a lot with the product without touching the computer.

2. What tasks do they now perform?

For music recording, our target user group uses freely available, bare-bones recording equipment instead of professional systems which are too expensive and complicated. For example, GarageBand is relatively easy to use, but still complicated for non-technical people and many often don’t recognize the power it has, and attribute undeserved limitations in sound quality because of lack of knowledge. Additionally, the editing and recording process takes many hours, at least 4 hours minimum to produce a do-it-yourself quality recording. Results are similarly for other digital audio workstations like ProTools, LogicPro, etc. Users may also use tools not intended for music recording (such as Photobooth on Mac computers) simply because they are easy and quick to use, though the quality is bad. There seems to exist a trade-off between time investment and quality with ease of use.

3. What tasks are desired?

Users would like to have a simple recording process that allows you to quickly and dynamically create music content alone or possibly collaboratively. This is allows them to improve their skills and quickly share the music, often useful in situations where others need to learn (such as new arrangements, where people would like to hear the balance of the overall blend). In particular, users want an easy interface allows musical people to take advantage of sound manipulation techniques and mixing/layering tracks without a technical tech background.

4. How are the tasks learned?

Trial-by-fire — trying to muddle through and figure out the functionalities. Our interviewees said they learned by asking people who already know how to do them, so they could never learn themselves how to make things work. There are no formal classes that they can take. Documentation online exists for these systems but they can sometimes be hard to understand — some say they don’t know where to start.

5. Where are the tasks performed?

Task performed in quiet environments (usually home, dorm room, practice room, recording spaces) and open areas. No influence of environment except for the impact on sound quality. It’s important to pay attention to how recording equipment is positioned. For example, a cappella groups sing in circular formations, which is hard to capture with a single one-directional mic, so a 360 degree mic is invaluable to capturing a recording that best matches a live performance. There is no effect due to bystanders, usually recording alone or with people who you are comfortable working with musically.

6. What’s the relationship between user & data?

Handling of data should be local, because recording music often results in catastrophic failure (the song sounds bad, the singer is off-key, etc) so you don’t want to broadcast bad recordings to others over the internet. The recordings should be private with the option to share with others. Not much of a privacy issue because the system is offline.

7. What other tools does the user have?

Laptop, home PC, cell phone, musical instruments, microphones, speakers. Microphones are extremely important to guarantee quality of sound, which might be useful to integrate into our system. There are mobile recording and mixing applications, but they are intended for casual interaction and have no ability to edit.

As mentioned before, currently available digital audio workspaces includes GarageBand, ProTools, LogicPro, and musicians even use PhotoBooth to record music.

8. How do users communicate with each other?

Not relevant for tasks related to music recording. Users often upload their recordings onto websites and youtube to share it with a larger audience.

9. How often are the tasks performed?

The tasks are performed whenever the urge to record and mix music occurs, which could be daily or once in long spans of time. Music recording is estimated to take a minimum of 4 hours, which become longer is significant editing time is taken into account. However, the time can be broken up into different chunks and worked on across an extended period of time. Therefore, it’s necessary to allow saving the workspace so users can later return to pick up where they left off.

10. What are the time constraints on the tasks?

No time constraints for the tasks, used whenever. If our device is used for live performance and collaborative music making, need minimal amounts of time to set up.

11. What happens when things go wrong?

Delete and start over again. Editing should be available on a very fine-grained scale and undos are very important.


  1. Single recording — If you wanted simple gesture based recording and editing, one voice role could easily be performed this way. This would involve the performer recording with the kinect. Afterwards or during recording, the performer could edit different sections using gestures. The performer would spend minimal time interfacing with the computer this way and more time focusing on the actual performance/recording.
  2. A capella recording (“intro version”) — combining multiple tracks into one recording. This would be the “intro version” of recording because the performer would use default settings for everything (autotune on, equalizers set, beatbox helper).
  3. Real-time mixing and recording — The performer could manipulate sound properties while recording, including dubstep, EQs, and other sound processing features. This would mean that the performer has full control over every aspect of recording, much like the functionalities available in complex recording systems but with an easier, simpler interface.


Text Description: 

The device focuses on real-time recording music with smooth playback and simple, intuitive interface allowing for many different functionalities for sound manipulation. It will allow easier and more intuitive forms of recording compared to digital audio workstations, and a more quality alternative to rock band. Users can separately record multiple tracks and combine them into one recording, adjusting sound levels on each track individually. Can add filters and change other sound properties. Body gestures would facilitate the recording and editing process. The main innovation is using spatial location of the person’s body to detect which “part” they are recording (for a capella, would correspond to soprano, alto, tenor, and bass voice parts). This would be easier for the user to understand the multiple track layering since they have to physically move to start a new track.