### Top 10 Arxiv Papers Today in Computer Vision And Pattern Recognition

##### #1. A Partially Reversible U-Net for Memory-Efficient Volumetric Image Segmentation
###### Robin Brügger, Christian F. Baumgartner, Ender Konukoglu
One of the key drawbacks of 3D convolutional neural networks for segmentation is their memory footprint, which necessitates compromises in the network architecture in order to fit into a given memory budget. Motivated by the RevNet for image classification, we propose a partially reversible U-Net architecture that reduces memory consumption substantially. The reversible architecture allows us to exactly recover each layer's outputs from the subsequent layer's ones, eliminating the need to store activations for backpropagation. This alleviates the biggest memory bottleneck and enables very deep (theoretically infinitely deep) 3D architectures. On the BraTS challenge dataset, we demonstrate substantial memory savings. We further show that the freed memory can be used for processing the whole field-of-view (FOV) instead of patches. Increasing network depth led to higher segmentation accuracy while growing the memory footprint only by a very small fraction, thanks to the partially reversible architecture.
more | pdf | html
###### Tweets
c_f_baumgartner: More exciting work from @bmic_eth accepted to @miccai2019! Master student Robin Bruegger used reversible units to develop a class of extremely memory-efficient 3D segmentation networks of almost unlimited depth. Pre-print: https://t.co/YrOAYVF8UP Code: https://t.co/6ThhPGHnNK https://t.co/Nz9gQSDfiP
c_f_baumgartner: The non-PDF link is here https://t.co/n9cZe5h6qO for those who prefer this.
arxiv_cscv: A Partially Reversible U-Net for Memory-Efficient Volumetric Image Segmentation https://t.co/OZkxjfL2Se
###### Github

Framework for creating (partially) reversible neural networks with PyTorch

Repository: RevTorch
User: RobinBruegger
Language: Python
Stargazers: 0
Subscribers: 0
Forks: 0
Open Issues: 0
None.
###### Other stats
Sample Sizes : None.
Authors: 3
Total Words: 3227
Unqiue Words: 1230

##### #2. Connecting Touch and Vision via Cross-Modal Prediction
###### Yunzhu Li, Jun-Yan Zhu, Russ Tedrake, Antonio Torralba
Humans perceive the world using multi-modal sensory inputs such as vision, audition, and touch. In this work, we investigate the cross-modal connection between vision and touch. The main challenge in this cross-domain modeling task lies in the significant scale discrepancy between the two: while our eyes perceive an entire visual scene at once, humans can only feel a small region of an object at any given moment. To connect vision and touch, we introduce new tasks of synthesizing plausible tactile signals from visual inputs as well as imagining how we interact with objects given tactile data as input. To accomplish our goals, we first equip robots with both visual and tactile sensors and collect a large-scale dataset of corresponding vision and tactile image sequences. To close the scale gap, we present a new conditional adversarial model that incorporates the scale and location information of the touch. Human perceptual studies demonstrate that our model can produce realistic visual images from tactile data and vice versa....
more | pdf | html
###### Tweets
BrundageBot: Connecting Touch and Vision via Cross-Modal Prediction. Yunzhu Li, Jun-Yan Zhu, Russ Tedrake, and Antonio Torralba https://t.co/hurcJL4Crx
arxivml: "Connecting Touch and Vision via Cross-Modal Prediction", Yunzhu Li, Jun-Yan Zhu, Russ Tedrake, Antonio Torralba https://t.co/wo3dmxpkTA
Memoirs: Connecting Touch and Vision via Cross-Modal Prediction. https://t.co/yfcSJJrZHa
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 4
Total Words: 7032
Unqiue Words: 2213

##### #3. Copy and Paste: A Simple But Effective Initialization Method for Black-Box Adversarial Attacks
###### Thomas Brunner, Frederik Diehl, Alois Knoll
Many optimization methods for generating black-box adversarial examples have been proposed, but the aspect of initializing said optimizers has not been considered in much detail. We show that the choice of starting points is indeed crucial, and that the performance of state-of-the-art attacks depends on it. First, we discuss desirable properties of starting points for attacking image classifiers, and how they can be chosen to increase query efficiency. Notably, we find that simply copying small patches from other images is a valid strategy. In an evaluation on ImageNet, we show that this initialization reduces the number of queries required for a state-of-the-art Boundary Attack by 81%, significantly outperforming previous results reported for targeted black-box adversarial examples.
more | pdf | html
None.
###### Tweets
BrundageBot: Copy and Paste: A Simple But Effective Initialization Method for Black-Box Adversarial Attacks. Thomas Brunner, Frederik Diehl, and Alois Knoll https://t.co/51H2HnJvZf
StatsPapers: Copy and Paste: A Simple But Effective Initialization Method for Black-Box Adversarial Attacks. https://t.co/fioqZehSyF
ballforest: RT @StatsPapers: Copy and Paste: A Simple But Effective Initialization Method for Black-Box Adversarial Attacks. https://t.co/fioqZehSyF
SythonUK: RT @StatsPapers: Copy and Paste: A Simple But Effective Initialization Method for Black-Box Adversarial Attacks. https://t.co/fioqZehSyF
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 3
Total Words: 0
Unqiue Words: 0

##### #4. Towards End-to-End Text Spotting in Natural Scenes
###### Hui Li, Peng Wang, Chunhua Shen
Text spotting in natural scene images is of great importance for many image understanding tasks. It includes two sub-tasks: text detection and recognition. In this work, we propose a unified network that simultaneously localizes and recognizes text with a single forward pass, avoiding intermediate processes such as image cropping and feature re-calculation, word separation, and character grouping. In contrast to existing approaches that consider text detection and recognition as two distinct tasks and tackle them one by one, the proposed framework settles these two tasks concurrently. The whole framework can be trained end-to-end and is able to handle text of arbitrary shapes. The convolutional features are calculated only once and shared by both detection and recognition modules. Through multi-task training, the learned features become more discriminate and improve the overall performance. By employing the $2$D attention model in word recognition, the irregularity of text can be robustly addressed. It provides the spatial...
more | pdf | html
###### Tweets
BrundageBot: Towards End-to-End Text Spotting in Natural Scenes. Hui Li, Peng Wang, and Chunhua Shen https://t.co/cAxObY3VUc
arxiv_cscv: Towards End-to-End Text Spotting in Natural Scenes https://t.co/aQRujtQDj5
keylinker: RT @arxiv_cscv: Towards End-to-End Text Spotting in Natural Scenes https://t.co/aQRujtQDj5
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 3
Total Words: 11862
Unqiue Words: 3047

##### #5. Cross-View Policy Learning for Street Navigation
###### Ang Li, Huiyi Hu, Piotr Mirowski, Mehrdad Farajtabar
The ability to navigate from visual observations in unfamiliar environments is a core component of intelligent agents and an ongoing challenge for Deep Reinforcement Learning (RL). Street View can be a sensible testbed for such RL agents, because it provides real-world photographic imagery at ground level, with diverse street appearances; it has been made into an interactive environment called StreetLearn and used for research on navigation. However, goal-driven street navigation agents have not so far been able to transfer to unseen areas without extensive retraining, and relying on simulation is not a scalable solution. Since aerial images are easily and globally accessible, we propose instead to train a multi-modal policy on ground and aerial views, then transfer the ground view policy to unseen (target) parts of the city by utilizing aerial view observations. Our core idea is to pair the ground view with an aerial view and to learn a joint policy that is transferable across views. We achieve this by learning a similar...
more | pdf | html
###### Tweets
BrundageBot: Cross-View Policy Learning for Street Navigation. Ang Li, Huiyi Hu, Piotr Mirowski, and Mehrdad Farajtabar https://t.co/02ZgOpupOd
Memoirs: Cross-View Policy Learning for Street Navigation. https://t.co/LZNIru1AwL
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 4
Total Words: 7297
Unqiue Words: 2266

##### #6. Divide and Conquer the Embedding Space for Metric Learning
###### Artsiom Sanakoyeu, Vadim Tschernezki, Uta Büchler, Björn Ommer
Learning the embedding space, where semantically similar objects are located close together and dissimilar objects far apart, is a cornerstone of many computer vision applications. Existing approaches usually learn a single metric in the embedding space for all available data points, which may have a very complex non-uniform distribution with different notions of similarity between objects, e.g. appearance, shape, color or semantic meaning. Approaches for learning a single distance metric often struggle to encode all different types of relationships and do not generalize well. In this work, we propose a novel easy-to-implement divide and conquer approach for deep metric learning, which significantly improves the state-of-the-art performance of metric learning. Our approach utilizes the embedding space more efficiently by jointly splitting the embedding space and data into $K$ smaller sub-problems. It divides both, the data and the embedding space into $K$ subsets and learns $K$ separate distance metrics in the non-overlapping...
more | pdf | html
None.
###### Tweets
BrundageBot: Divide and Conquer the Embedding Space for Metric Learning. Artsiom Sanakoyeu, Vadim Tschernezki, Uta Büchler, and Björn Ommer https://t.co/a6LAmDDtaw
Memoirs: Divide and Conquer the Embedding Space for Metric Learning. https://t.co/lT4u1STYvM
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 4
Total Words: 0
Unqiue Words: 0

##### #7. Utilizing the Instability in Weakly Supervised Object Detection
###### Yan Gao, Boxiao Liu, Nan Guo, Xiaochun Ye, Fang Wan, Haihang You, Dongrui Fan
Weakly supervised object detection (WSOD) focuses on training object detector with only image-level annotations, and is challenging due to the gap between the supervision and the objective. Most of existing approaches model WSOD as a multiple instance learning (MIL) problem. However, we observe that the result of MIL based detector is unstable, i.e., the most confident bounding boxes change significantly when using different initializations. We quantitatively demonstrate the instability by introducing a metric to measure it, and empirically analyze the reason of instability. Although the instability seems harmful for detection task, we argue that it can be utilized to improve the performance by fusing the results of differently initialized detectors. To implement this idea, we propose an end-to-end framework with multiple detection branches, and introduce a simple fusion strategy. We further propose an orthogonal initialization method to increase the difference between detection branches. By utilizing the instability, we achieve...
more | pdf | html
None.
###### Tweets
BrundageBot: Utilizing the Instability in Weakly Supervised Object Detection. Yan Gao, Boxiao Liu, Nan Guo, Xiaochun Ye, Fang Wan, Haihang You, and Dongrui Fan https://t.co/CeDKTHRVve
arxiv_cscv: Utilizing the Instability in Weakly Supervised Object Detection https://t.co/CC3JOmBANR
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 7
Total Words: 0
Unqiue Words: 0

##### #8. Semantics to Space(S2S): Embedding semantics into spatial space for zero-shot verb-object query inferencing
###### Sungmin Eum, Heesung Kwon
We present a novel deep zero-shot learning (ZSL) model for inferencing human-object-interaction with verb-object (VO) query. While the previous ZSL approaches only use the semantic/textual information to be fed into the query stream, we seek to incorporate and embed the semantics into the visual representation stream as well. Our approach is powered by Semantics-to-Space (S2S) architecture where semantics derived from the residing objects are embedded into a spatial space. This architecture allows the co-capturing of the semantic attributes of the human and the objects along with their location/size/silhouette information. As this is the first attempt to address the zero-shot human-object-interaction inferencing with VO query, we have constructed a new dataset, Verb-Transferability 60 (VT60). VT60 provides 60 different VO pairs with overlapping verbs tailored for testing ZSL approaches with VO query. Experimental evaluations show that our approach not only outperforms the state-of-the-art, but also shows the capability of...
more | pdf | html
None.
###### Tweets
BrundageBot: Semantics to Space(S2S): Embedding semantics into spatial space for zero-shot verb-object query inferencing. Sungmin Eum and Heesung Kwon https://t.co/H7dbeBuxo9
arxiv_cscv: Semantics to Space(S2S): Embedding semantics into spatial space for zero-shot verb-object query inferencing https://t.co/gssktKnZTJ
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 2
Total Words: 0
Unqiue Words: 0

##### #9. Universal Barcode Detector via Semantic Segmentation
###### Andrey Zharkov, Ivan Zagaynov
Universal Barcode Detector via Semantic Segmentation
more | pdf | html
None.
###### Tweets
arxivml: "Universal Barcode Detector via Semantic Segmentation", Andrey Zharkov, Ivan Zagaynov https://t.co/O5L1bnGU57
Memoirs: Universal Barcode Detector via Semantic Segmentation. https://t.co/IeXpA9WQpu
None.
None.
###### Other stats
Sample Sizes : None.
Authors: 2
Total Words: 0
Unqiue Words: 0

##### #10. MonoLoco: Monocular 3D Pedestrian Localization and Uncertainty Estimation
###### Lorenzo Bertoni, Sven Kreiss, Alexandre Alahi
We tackle the fundamentally ill-posed problem of 3D human localization from monocular RGB images. Driven by the limitation of neural networks outputting point estimates, we address the ambiguity in the task with a new neural network predicting confidence intervals through a loss function based on the Laplace distribution. Our architecture is a light-weight feed-forward neural network which predicts the 3D coordinates given 2D human pose. The design is particularly well suited for small training data and cross-dataset generalization. Our experiments show that (i) we outperform state-of-the art results on KITTI and nuScenes datasets, (ii) even outperform stereo for far-away pedestrians, and (iii) estimate meaningful confidence intervals. We further share insights on our model of uncertainty in case of limited observation and out-of-distribution samples.
more | pdf | html
###### Github
Repository: monoloco
User: vita-epfl
Language: Python
Stargazers: 1
Subscribers: 5
Forks: 0
Open Issues: 0
None.
###### Other stats
Sample Sizes : None.
Authors: 3
Total Words: 6457
Unqiue Words: 2084

Assert is a website where the best academic papers on arXiv (computer science, math, physics), bioRxiv (biology), BITSS (reproducibility), EarthArXiv (earth science), engrXiv (engineering), LawArXiv (law), PsyArXiv (psychology), SocArXiv (social science), and SportRxiv (sport research) bubble to the top each day.

Papers are scored (in real-time) based on how verifiable they are (as determined by their Github repos) and how interesting they are (based on Twitter).

To see top papers, follow us on twitter @assertpub_ (arXiv), @assert_pub (bioRxiv), and @assertpub_dev (everything else).

To see beautiful figures extracted from papers, follow us on Instagram.

Tracking 143,632 papers.

###### Search
Sort results based on if they are interesting or reproducible.
Interesting
Reproducible
Online
###### Stats
Tracking 143,632 papers.