First published: 2016/10/07 (9 years ago) Abstract: We present an interpretation of Inception modules in convolutional neural
networks as being an intermediate step in-between regular convolution and the
recently introduced "separable convolution" operation. In this light, a
separable convolution can be understood as an Inception module with a maximally
large number of towers. This observation leads us to propose a novel deep
convolutional neural network architecture inspired by Inception, where
Inception modules have been replaced with separable convolutions. We show that
this architecture, dubbed Xception, slightly outperforms Inception V3 on the
ImageNet dataset (which Inception V3 was designed for), and significantly
outperforms Inception V3 on a larger image classification dataset comprising
350 million images and 17,000 classes. Since the Xception architecture has the
same number of parameter as Inception V3, the performance gains are not due to
increased capacity but rather to a more efficient use of model parameters.
Xception Net or Extreme Inception Net brings a new perception of looking at the Inception Nets. Inception Nets, as was first published (as GoogLeNet) consisted of Network-in-Network modules like this

The idea behind Inception modules was to look at cross-channel correlations ( via 1x1 convolutions) and spatial correlations (via 3x3 Convolutions). The main concept being that cross-channel correlations and spatial correlations are sufficiently decoupled that it is preferable not to map them jointly. This idea is the genesis of Xception Net, using depth-wise separable convolution ( convolution which looks into spatial correlations across all channels independently and then uses pointwise convolutions to project to the requisite channel space leveraging inter-channel correlations). Chollet, does a wonderful job of explaining how regular convolution (looking at both channel & spatial correlations simultaneously) and depthwise separable convolution (looking at channel & spatial correlations independently in successive steps) are end points of spectrum with the original Inception Nets lying in between.

*Though for Xception Net, Chollet uses, depthwise separable layers which perform 3x3 convolutions for each channel and then 1x1 convolutions on the output from 3x3 convolutions (opposite order of operations depicted in image above)*
##### Input
Input for would be images that are used for classification along with corresponding labels.
##### Architecture
Architecture of Xception Net uses one for VGG-16 with convolution-maxpool blocks replaced by residual blocks of depthwise separable convolution layers. The architecture looks like this

##### Results
Xception Net was trained using hyperparameters tuned for best performance of Inception V3 Net. And for both internal dataset and ImageNet dataset, Xception outperformed Inception V3. Points to be noted
- Both Xception & Inception V3 have roughly similar no of parameters (~24 M), hence any improvement in performance can't be attributed to network size
- Xception normally takes slightly lower training time compared to Inception V3, which can be configured to be lower in future