In recent years, computers have got

In recent years, computers have gotten remarkably good at recognizing speech and images: Think of the dictation software on most cellphones, or the algorithms that automatically identify people in photos posted to Facebook.
But recognition of natural sounds — such as crowds cheering or waves crashing — has lagged behind. That’s because most automated recognition systems, whether they process audio or visual information, are the result of machine learning, in which computers search for patterns in huge compendia of training data. Usually, the training data has to be first annotated by hand, which is prohibitively expensive for all but the highest-demand applications.

Sound recognition may be catching up, however, thanks to researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL). At the Neural Information Processing Systems conference next week, they will present a sound-recognition system that outperforms its predecessors but didn’t require hand-annotated data during training.

Instead, the researchers trained the system on video. First, existing computer vision systems that recognize scenes and objects categorized the images in the video. The new system then found correlations between those visual categories and natural sounds.

“Computer vision has gotten so good that we can transfer it to other domains,” says Carl Vondrick, an MIT graduate student in electrical engineering and computer science and one of the paper’s two first authors. “We’re capitalizing on the natural synchronization between vision and sound. We scale up with tons of unlabeled video to learn to understand sound.”

The researchers tested their system on two standard databases of annotated sound recordings, and it was between 13 and 15 percent more accurate than the best-performing previous system. On a data set with 10 different sound categories, it could categorize sounds with 92 percent accuracy, and on a data set with 50 categories it performed with 74 percent accuracy. On those same data sets, humans are 96 percent and 81 percent accurate, respectively.

“Even humans are ambiguous,” says Yusuf Aytar, the paper’s other first author and a postdoc in the lab of MIT professor of electrical engineering and computer science Antonio Torralba. Torralba is the final co-author on the paper.

“We did an experiment with Carl,” Aytar says. “Carl was looking at the computer monitor, and I couldn’t see it. He would play a recording and I would try to guess what it was. It turns out this is really, really hard. I could tell indoor from outdoor, basic guesses, but when it comes to the details — ‘Is it a restaurant?’ — those details are missing. Even for annotation purposes, the task is really hard.”

Complementary modalities

Because it takes far less power to collect and process audio data than it does to collect and process visual data, the researchers envision that a sound-recognition system could be used to improve the context sensitivity of mobile devices.

When coupled with GPS data, for instance, a sound-recognition system could determine that a cellphone user is in a movie theater and that the movie has started, and the phone could automatically route calls to a prerecorded outgoing message. Similarly, sound recognition could improve the situational awareness of autonomous robots.

“For instance, think of a self-driving car,” Aytar says. “There’s an ambulance coming, and the car doesn’t see it. If it hears it, it can make future predictions for the ambulance — which path it’s going to take — just purely based on sound.”

Visual language

The researchers’ machine-learning system is a neural network, so called because its architecture loosely resembles that of the human brain. A neural net consists of processing nodes that, like individual neurons, can perform only rudimentary computations but are densely interconnected. Information — say, the pixel values of a digital image — is fed to the bottom layer of nodes, which processes it and feeds it to the next layer, which processes it and feeds it to the next layer, and so on. The training process continually modifies the settings of the individual nodes, until the output of the final layer reliably performs some classification of the data — say, identifying the objects in the image.

Vondrick, Aytar, and Torralba first trained a neural net on two large, annotated sets of images: one, the ImageNet data set, contains labeled examples of images of 1,000 different objects; the other, the Places data set created by Torralba’s group, contains labeled images of 401 different scene types, such as a playground, bedroom, or conference room.

Once the network was trained, the researchers fed it the video from 26 terabytes of video data downloaded from the photo-sharing site Flickr. “It’s about 2 million unique videos,” Vondrick says. “If you were to watch all of them back to back, it would take you about two years.” Then they trained a second neural network on the audio from the

Sound recognition may be catching up, however, thanks to researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL). At the Neural Information Processing Systems conference next week, they will present a sound-recognition system that outperforms its predecessors but didn’t require hand-annotated data during training.

Instead, the researchers trained the system on video. First, existing computer vision systems that recognize scenes and objects categorized the images in the video. The new system then found correlations between those visual categories and natural sounds.

“Computer vision has gotten so good that we can transfer it to other domains,” says Carl Vondrick, an MIT graduate student in electrical engineering and computer science and one of the paper’s two first authors. “We’re capitalizing on the natural synchronization between vision and sound. We scale up with tons of unlabeled video to learn to understand sound.”

The researchers tested their system on two standard databases of annotated sound recordings, and it was between 13 and 15 percent more accurate than the best-performing previous system. On a data set with 10 different sound categories, it could categorize sounds with 92 percent accuracy, and on a data set with 50 categories it performed with 74 percent accuracy. On those same data sets, humans are 96 percent and 81 percent accurate, respectively.

“Even humans are ambiguous,” says Yusuf Aytar, the paper’s other first author and a postdoc in the lab of MIT professor of electrical engineering and computer science Antonio Torralba. Torralba is the final co-author on the paper.

“We did an experiment with Carl,” Aytar says. “Carl was looking at the computer monitor, and I couldn’t see it. He would play a recording and I would try to guess what it was. It turns out this is really, really hard. I could tell indoor from outdoor, basic guesses, but when it comes to the details — ‘Is it a restaurant?’ — those details are missing. Even for annotation purposes, the task is really hard.”

Complementary modalities

Because it takes far less power to collect and process audio data than it does to collect and process visual data, the researchers envision that a sound-recognition system could be used to improve the context sensitivity of mobile devices.

When coupled with GPS data, for instance, a sound-recognition system could determine that a cellphone user is in a movie theater and that the movie has started, and the phone could automatically route calls to a prerecorded outgoing message. Similarly, sound recognition could improve the situational awareness of autonomous robots.

“For instance, think of a self-driving car,” Aytar says. “There’s an ambulance coming, and the car doesn’t see it. If it hears it, it can make future predictions for the ambulance — which path it’s going to take — just purely based on sound.”

Visual language

The researchers’ machine-learning system is a neural network, so called because its architecture loosely resembles that of the human brain. A neural net consists of processing nodes that, like individual neurons, can perform only rudimentary computations but are densely interconnected. Information — say, the pixel values of a digital image — is fed to the bottom layer of nodes, which processes it and feeds it to the next layer, which processes it and feeds it to the next layer, and so on. The training process continually modifies the settings of the individual nodes, until the output of the final layer reliably performs some classification of the data — say, identifying the objects in the image.

Vondrick, Aytar, and Torralba first trained a neural net on two large, annotated sets of images: one, the ImageNet data set, contains labeled examples of images of 1,000 different objects; the other, the Places data set created by Torralba’s group, contains labeled images of 401 different scene types, such as a playground, bedroom, or conference room.

Once the network was trained, the researchers fed it the video from 26 terabytes of video data downloaded from the photo-sharing site Flickr. “It’s about 2 million unique videos,” Vondrick says. “If you were to watch all of them back to back, it would take you about two years.” Then they trained a second neural network on the audio from the

0/5000

原始語言: -

目標語言: -

結果 (中文) 1: [復制]

復制成功！

近年来，计算机已经非常善于认识语音和图像︰认为大多数的手机或自动识别发布到 facebook 上的照片中人物的算法上的听写软件。但认识到自然的声音 — — 如在人群的欢呼声或海浪 — — 已经落后。这是因为大多数自动的识别系统，是否他们处理音频或视觉信息的机器学习，在巨大药典的训练数据中的模式的计算机搜索结果。通常情况下，培训数据有要先用手，批注是昂贵为所有，但最高要求的应用程序。声音识别可能赶上来，然而，由于在麻省理工学院的计算机科学和人工智能实验室 (CSAIL) 的研究人员。神经信息处理系统会议下周，他们将目前优于它的前辈，但在训练过程中不需要手工标注数据的声音识别系统。相反，研究人员训练视频的系统。第一，现有的计算机视觉识别场景和对象的系统分类中的视频图像。新系统然后发现那些视觉类别和自然的声音之间的相关性。"计算机视觉已经这么好，我们可以把它传输到其他域，"说卡尔 Vondrick、麻省理工学院毕业的电气工程和计算机科学的学生和两个第一作者之一。"我们将利用在视觉和声音之间的自然同步。我们向上扩展与未标记的视频，要学会理解声音吨"。研究人员测试了两个标准数据库的附加说明的录音，其系统，13%至 15%比表现最好的前系统更准确。上 10 不同声音分类数据集合，它可以有 92%的准确率，和与它执行有 74%的准确率 50 类别的数据集上分类声音。这些相同的数据集，人类分别为 96%和 81%的准确率。"甚至人类是含糊不清，"说，优素福 Aytar，本文的其他第一作者和一位在麻省理工学院电气工程和计算机科学安东尼奥 · 托拉尔巴的教授实验室博士后。托拉尔巴是在纸上最后的合著者。"我们做了一个实验和卡尔，"Aytar 说。"卡尔看着电脑显示器，和我看不见它。他会播放一段录音，我会试着猜猜它是什么。原来这是真的，真的非常辛苦。我可以告诉室内室外，从基本的猜测，但当它来到细节 — — 它是一家餐馆？ — — 这些细节被忽略。甚至为了注释，任务是真的很难。互补的方式因为它需要收集和处理音频数据，而不是收集和处理可视化数据远较少力量，研究人员设想一种声音识别系统可用于提高移动设备的上下文敏感性。当加 GPS 数据，例如，一种声音识别系统能确定手机用户是在一家电影院，电影已经开始了，和电话可以自动将呼叫路由到预先录制的传出消息。同样，声音识别可以提高态势感知能力的机器人。"例如，认为自动驾驶的汽车，"Aytar 说。"还有一辆救护车来了，和这辆车没有看到它。如果它听到它，它可以使未来的预测为救护车 — — 它正在采取哪些路径 — — 只是纯粹基于声音。"视觉语言研究者的机器学习系统是大脑的一个神经网络，这样称呼是大脑的因为它的体系结构松散类似于人类。一个神经网络组成的处理节点，像单个神经元，只有基本计算但都是浓密互连。信息 — — 说，数字图像的像素值 — — 美联储向底层节点，处理它，喂它到下一层，处理它，喂它到下一层，等等。训练的过程中不断修改的设置的单个节点，直到最后一层的输出可靠地执行一些分类数据 — — 说，识别图像中的对象。Vondrick、 Aytar 和托拉尔巴首先训练神经网络上两个大型的、带有注释集的图像︰一个 ImageNet 的数据集，包含标记的示例图像的 1000 个不同的对象;另，由托拉尔巴的组创建地方数据集包含 401 不同的场景类型，例如操场、卧室或会议的房间标记的的图像。一旦网络进行训练，研究人员给了它视频从 26 万亿字节的视频数据从照片分享网站 Flickr 上下载。"它是大约 200 万独特的视频，"Vondrick 说。"如果你在看他们都背回，它会带你大约两年。"然后他们训练第二个神经网络从音频

正在翻譯中..

結果 (中文) 2:[復制]

復制成功！

正在翻譯中..

結果 (中文) 3:[復制]

復制成功！

正在翻譯中..

其它語言

本翻譯工具支援: 世界語, 中文, 丹麥文, 亞塞拜然文, 亞美尼亞文, 伊博文, 俄文, 保加利亞文, 信德文, 偵測語言, 優魯巴文, 克林貢語, 克羅埃西亞文, 冰島文, 加泰羅尼亞文, 加里西亞文, 匈牙利文, 南非柯薩文, 南非祖魯文, 卡納達文, 印尼巽他文, 印尼文, 印度古哈拉地文, 印度文, 吉爾吉斯文, 哈薩克文, 喬治亞文, 土庫曼文, 土耳其文, 塔吉克文, 塞爾維亞文, 夏威夷文, 奇切瓦文, 威爾斯文, 孟加拉文, 宿霧文, 寮文, 尼泊爾文, 巴斯克文, 布爾文, 希伯來文, 希臘文, 帕施圖文, 庫德文, 弗利然文, 德文, 意第緒文, 愛沙尼亞文, 愛爾蘭文, 拉丁文, 拉脫維亞文, 挪威文, 捷克文, 斯洛伐克文, 斯洛維尼亞文, 斯瓦希里文, 旁遮普文, 日文, 歐利亞文 (奧里雅文), 毛利文, 法文, 波士尼亞文, 波斯文, 波蘭文, 泰文, 泰盧固文, 泰米爾文, 海地克里奧文, 烏克蘭文, 烏爾都文, 烏茲別克文, 爪哇文, 瑞典文, 瑟索托文, 白俄羅斯文, 盧安達文, 盧森堡文, 科西嘉文, 立陶宛文, 索馬里文, 紹納文, 維吾爾文, 緬甸文, 繁體中文, 羅馬尼亞文, 義大利文, 芬蘭文, 苗文, 英文, 荷蘭文, 菲律賓文, 葡萄牙文, 蒙古文, 薩摩亞文, 蘇格蘭的蓋爾文, 西班牙文, 豪沙文, 越南文, 錫蘭文, 阿姆哈拉文, 阿拉伯文, 阿爾巴尼亞文, 韃靼文, 韓文, 馬來文, 馬其頓文, 馬拉加斯文, 馬拉地文, 馬拉雅拉姆文, 馬耳他文, 高棉文, 等語言的翻譯.