We first pre-train weights with a t

We first pre-train weights with a traditional CNN for general feature learning. The convolutional neural network takes a video frame as its input and produce a feature map of the whole image. The convolutional weights are learned with ImageNet data of 1000 classes such that the network has a generalized understanding of almost arbitrary visual objects. During pre-training, the output of the first fully connected layer is a feature vector of size 4096, a dense representation of the mid-level visual features. In theory, the feature vector can be fed into any classification tool (such as an SVM or CNN) to achieve good classification results with proper training.Once we have the pre-trained weights able to generate visual features, we adopt the YOLO architecture as the detection module. On top of the convolutional layers, YOLO adopts fully connected layers to regress feature representation into region predictions. These predictions are encoded as an S X S X(B 5 + C) tensor. It denotes that the image is divided into S S splits. Each split has B bounding boxes predicted, represented by its 5 location parameters including x, y, w, h, and itsconfidence c. A one-hot feature vector of length C is also predicted, indicating the class label of each bounding box. In our framework, we follow the YOLO architecture and set S = 7, B = 2, C = 20. Each bounding box originally consists of 6 predictions: x, y, w, h, class label and confidence, but we nullify class label and confidence for visual tracking, as the evaluation consists of locations only.where (x; y) represent the coordinates of the bounding box center relative to the width and the height of the image, respectively. The width and height of the bounding box, are also relative to those of the image. Consequently, (x; y;w; h) 2 [0; 1], and it is easier for regression when they are concatenated with the 4096-dimensional visual features, which will be fed into the tracking module.

We first pre-train weights with a traditional CNN for general feature learning. The convolutional neural network takes a video frame as its input and produce a feature map of the whole image. The convolutional weights are learned with ImageNet data of 1000 classes such that the network has a generalized understanding of almost arbitrary visual objects. During pre-training, the output of the first fully connected layer is a feature vector of size 4096, a dense representation of the mid-level visual features. In theory, the feature vector can be fed into any classification tool (such as an SVM or CNN) to achieve good classification results with proper training.Once we have the pre-trained weights able to generate visual features, we adopt the YOLO architecture as the detection module. On top of the convolutional layers, YOLO adopts fully connected layers to regress feature representation into region predictions. These predictions are encoded as an S X S X(B  5 + C) tensor. It denotes that the image is divided into S  S splits. Each split has B bounding boxes predicted, represented by its 5 location parameters including x, y, w, h, and itsconfidence c. A one-hot feature vector of length C is also predicted, indicating the class label of each bounding box. In our framework, we follow the YOLO architecture and set S = 7, B = 2, C = 20. Each bounding box originally consists of 6 predictions: x, y, w, h, class label and confidence, but we nullify class label and confidence for visual tracking, as the evaluation consists of locations only.where (x; y) represent the coordinates of the bounding box center relative to the width and the height of the image, respectively. The width and height of the bounding box, are also relative to those of the image. Consequently, (x; y;w; h) 2 [0; 1], and it is easier for regression when they are concatenated with the 4096-dimensional visual features, which will be fed into the tracking module.

0/5000

原始語言: -

目標語言: -

結果 (繁體中文) 1: [復制]

復制成功！

我們先預火車的重量與傳統CNN的一般特徵的學習。卷積神經網絡需要的視頻幀作為其輸入，並產生整個圖像的特徵圖。卷積權重與1000級，使得網絡有一種普遍的幾乎任意可視化對象的理解ImageNet數據獲悉。期間前的訓練，第一完全連接層的輸出為大小4096的特徵向量，的中級視覺特徵的緻密表示。從理論上講，該特徵向量可以被送入任何分類工具（如SVM或CNN）來實現與適當的訓練良好的分類結果。 一旦我們有了能夠產生視覺特徵的預訓練的權重，我們採用YOLO架構檢測模塊。在卷積層的頂部，YOLO採用完全連接層，以回歸的特徵表示成區域的預測。這些預測被編碼為一個SXS？X（B？5 + C）張量。它表示該圖像被劃分為S' 小號分裂。每個分割具有乙邊界預測盒，其5位置參數，包括X，Y，W，H，其表示 信任度c。長度C的一個熱特徵矢量還預測，指示每個邊界框的類別標籤。在我們的框架，我們按照YOLO架構和集合S = 7，B = 2，C = 20。每個邊界框最初由6個預測的：X，Y，W，H，類別標籤和信心，但我們抵消類別標籤和視覺跟踪的信心，作為評價只包含位置。 其中（x; y）的表示包圍盒相對中心向寬度和圖像的高度，分別的坐標。寬度和邊界框的高度，也相對於那些圖像。因此，（X; Y; W; H）2 [0; 1]，並且當它們與4096維視覺特徵，這將被送入跟踪模塊級聯它是回歸更容易。

正在翻譯中..

結果 (繁體中文) 2:[復制]

復制成功！

我們首先使用傳統的 CNN 預訓練權重，用於一般功能學習。卷積神經網路以視頻幀為輸入，並生成整個圖像的特徵圖。卷積權重是使用 1000 個類的 ImageNet 資料學習的，以便網路對幾乎任意的可視物件有廣義的理解。在預訓練期間，第一個完全連接的圖層的輸出是大小為 4096 的要素向量，這是中介層視覺特徵的密集表示。理論上，特徵向量可以輸入任何分類工具（如 SVM 或 CNN），通過適當的培訓實現良好的分類結果。 一旦我們擁有能夠生成視覺特徵的預訓練權重，我們就採用 YOLO 架構作為檢測模組。在卷積層之上，YOLO 採用完全連接的圖層將要素表示回歸到區域預測中。這些預測被編碼為 S X X（B 5 + C）張數。它表示圖像被劃分為 S S 拆分。每個拆分都有預測的 B 邊界框，由其 5 個位置參數表示，包括 x、y、w、h 及其 信心 c.還預測了長度為 C 的單熱特徵向量，指示每個邊界框的類標籤。在我們的框架中，我們遵循 YOLO 體系結構，並將 S = 7、B = 2、C = 20。每個邊界框最初包含 6 個預測：x、y、w、h、類標籤和置信度，但我們取消了類標籤和視覺化跟蹤的置信度，因為評估僅包含位置。 其中（x;y）分別表示邊界框中心的座標相對於圖像的寬度和高度。邊界框的寬度和高度也相對於圖像的寬度和高度。因此，（x;y;w;h） 2 [0; 1]，當它們與 4096 維視覺特徵串聯時更容易回歸，這些特徵將饋入跟蹤模組。

正在翻譯中..

結果 (繁體中文) 3:[復制]

復制成功！

我們首先使用傳統的CNN對權重進行預訓練，以便進行一般特徵學習。卷積神經網路以視頻幀為輸入，生成整個影像的特徵圖。卷積權重是用1000類ImageNet數據學習的，這樣網絡對幾乎任意的視覺對象都有一個廣義的理解。在預訓練期間，第一完全連接層的輸出是大小為4096的特徵向量，是中級視覺特徵的密集表示。理論上，特徵向量可以被輸入到任何分類工具（如支持向量機或CNN）中，通過適當的訓練獲得良好的分類效果。 一旦預先訓練好的權值能够生成視覺特徵，我們就採用YOLO架構作為檢測模塊。在卷積層之上，YOLO採用全連通層將特徵表示回歸到區域預測中。這些預測被編碼為一個sxsx（B 5+C）張量。它表示影像被分成S個部分。每個分割都有預測的B個邊界框，由其5個位置參數表示，包括x、y、w、h和 置信度c。預測一個長度為c的熱特徵向量，表示每個包圍盒的類標籤。在我們的框架中，我們遵循YOLO架構並設定S=7，B=2，C=20。每個邊界框最初由6個預測組成：x、y、w、h、類標籤和置信度，但是我們為了視覺跟踪而取消類標籤和置信度，因為評估只包含位置。 其中（x；y）分別表示邊框中心相對於影像寬度和高度的座標。邊界框的寬度和高度也與影像的寬度和高度相關。囙此，（x；y；w；h）2[0；1]，並且當它們與4096維視覺特徵連接時，回歸更容易，這些視覺特徵將被輸入跟踪模塊。

正在翻譯中..

其它語言

本翻譯工具支援: 世界語, 中文, 丹麥文, 亞塞拜然文, 亞美尼亞文, 伊博文, 俄文, 保加利亞文, 信德文, 偵測語言, 優魯巴文, 克林貢語, 克羅埃西亞文, 冰島文, 加泰羅尼亞文, 加里西亞文, 匈牙利文, 南非柯薩文, 南非祖魯文, 卡納達文, 印尼巽他文, 印尼文, 印度古哈拉地文, 印度文, 吉爾吉斯文, 哈薩克文, 喬治亞文, 土庫曼文, 土耳其文, 塔吉克文, 塞爾維亞文, 夏威夷文, 奇切瓦文, 威爾斯文, 孟加拉文, 宿霧文, 寮文, 尼泊爾文, 巴斯克文, 布爾文, 希伯來文, 希臘文, 帕施圖文, 庫德文, 弗利然文, 德文, 意第緒文, 愛沙尼亞文, 愛爾蘭文, 拉丁文, 拉脫維亞文, 挪威文, 捷克文, 斯洛伐克文, 斯洛維尼亞文, 斯瓦希里文, 旁遮普文, 日文, 歐利亞文 (奧里雅文), 毛利文, 法文, 波士尼亞文, 波斯文, 波蘭文, 泰文, 泰盧固文, 泰米爾文, 海地克里奧文, 烏克蘭文, 烏爾都文, 烏茲別克文, 爪哇文, 瑞典文, 瑟索托文, 白俄羅斯文, 盧安達文, 盧森堡文, 科西嘉文, 立陶宛文, 索馬里文, 紹納文, 維吾爾文, 緬甸文, 繁體中文, 羅馬尼亞文, 義大利文, 芬蘭文, 苗文, 英文, 荷蘭文, 菲律賓文, 葡萄牙文, 蒙古文, 薩摩亞文, 蘇格蘭的蓋爾文, 西班牙文, 豪沙文, 越南文, 錫蘭文, 阿姆哈拉文, 阿拉伯文, 阿爾巴尼亞文, 韃靼文, 韓文, 馬來文, 馬其頓文, 馬拉加斯文, 馬拉地文, 馬拉雅拉姆文, 馬耳他文, 高棉文, 等語言的翻譯.