Abstract

Network compression techniques that combine tensor decompositions and pruning have shown promise in leveraging the advantages of both strategies. In this work, we propose enhanced Network cOmpRession through TensOr decompositions and pruNing (NORTON), a novel method for network compression. NORTON introduces the concept of filter decomposition, enabling a more detailed decomposition of the network while preserving the weight’s multidimensional properties. Our method incorporates a novel structured pruning approach, effectively integrating the decomposed model. Through extensive experiments on various architectures, benchmark datasets, and representative vision tasks, we demonstrate the usefulness of our method. NORTON achieves superior results compared to state-of-the-art techniques in terms of complexity and accuracy.

🔥 News

💪 Motivation

Tensor decomposition has emerged as an effective tool for compressing convolutional neural networks by exploiting the low-rank structure of weight tensors. Existing approaches typically focus on which decomposition to use and how to select the rank, while implicitly assuming that the convolutional weights should be decomposed at the level of the entire layer.

However, a convolutional layer weight is inherently a 4-order tensor, and the way this tensor is handled prior to decomposition has a major impact on both compression efficiency and approximation quality. Prior works either decompose the whole layer directly, or reshape the tensor into a lower-order form before applying decomposition. These strategies operate at a coarse granularity and may either limit flexibility or compromise the original multidimensional structure.

Comparison of decomposition strategies
Figure 1: Different ways of decomposing convolutional weights: layer-wise decomposition, reshaped-based decomposition, and the proposed filter-wise decomposition.

In contrast, we observe that convolution operates on a filter-by-filter basis: each output feature map is produced independently by convolving the input with a single filter. This motivates a more fine-grained perspective, where the 4-order weight tensor is naturally interpreted as a collection of 3-order filter tensors.

Building on this observation, NORTON adopts a filters decomposition strategy, in which each filter is decomposed individually. This preserves the full multidimensional structure, avoids unnecessary tensor reshaping, and yields a narrower and more stable range of ranks. Moreover, filters decomposition replaces a convolutional layer with fewer sublayers than layer-wise decomposition, reducing architectural complexity and facilitating subsequent compression steps.

🚀 Why Combining Decomposition and Pruning?

Among existing model compression paradigms, tensor decomposition and structured pruning have proven to be both effective and practical. Decomposition reduces redundancy by exploiting low-rank structure in weights, while pruning removes redundant filters or channels directly. Importantly, both approaches produce standard architectures that can be deployed on resource-constrained devices without specialized hardware.

Despite their shared goal, these two techniques have largely been studied in isolation. Only limited efforts have explored how to combine them in a truly complementary manner. This separation overlooks a key property of convolutional networks: their weights simultaneously exhibit low-rank and sparse characteristics.

Complementarity of decomposition and pruning
Figure 2: Decomposition and pruning are complementary: combining them enables significantly higher compression than either method alone.

Decomposition methods alone cannot eliminate all redundant filters, while aggressive pruning often leads to severe accuracy degradation when low-rank structure is ignored. As a result, state-of-the-art methods based on either paradigm typically saturate well below extreme compression regimes.

NORTON is motivated by the idea that these two strategies should be integrated rather than stacked mechanically. By operating on a shared representation— the factor matrices produced by filter-wise decomposition—NORTON enables decomposition and pruning to reinforce each other. This principled integration allows NORTON to achieve ultra-high compression rates, pushing network compression far beyond what either approach can reach independently.

🎯 Approach

NORTON is a unified network compression framework that tightly integrates filter-wise tensor decomposition and structured filter pruning in a sequential and mutually reinforcing manner. The key insight is that convolutional filters exhibit both low-rank and redundant structures, which can be exploited more effectively when these two paradigms are combined through a shared representation.

As illustrated in Figure 3, NORTON follows a two-phase pipeline applied simultaneously to all convolutional layers:

  1. Filters decomposition: Each convolutional filter is independently decomposed using Canonical Polyadic Decomposition (CPD), preserving the original multidimensional structure while yielding compact CPDBlocks composed of factor matrices.
  2. CPDBlock pruning: Redundant filters are identified and removed directly in the decomposed space using a distance metric based on Principal Angles Between Subspaces (PABS), which robustly handles scaling and permutation ambiguities of CPD.

By sharing the same factor matrices for both decomposition and pruning, NORTON achieves extreme compression rates while maintaining accuracy. A single fine-tuning stage is then sufficient to recover performance, resulting in highly compact models suitable for deployment on resource-constrained devices.

NORTON framework diagram
Figure 3: NORTON framework — filter-wise CP decomposition followed by CPDBlock pruning in the decomposed space.

🚀 Throughput acceleration

To evaluate NORTON's effectiveness in downstream tasks, we used our compressed ResNet-50/Imagenet as the backbone for training Faster/Mask/Keypoint-RCNN on COCO. Our method shows promising results in terms of precision and recall, along with achieving relatively higher compression levels compared to other approaches. Remarkably, NORTON significantly enhances inference throughput, resulting in over a \(2 \times\) improvement in frames per second (FPS) compared to the baseline models. For instance, Faster-RCNN exhibits a reduction in end-to-end latency from 100 ms to 42 ms, achieving a real-time framerate of 25 FPS. It is worth emphasizing that these performance evaluations were conducted on a GTX 3060 GPU, providing robust evidence of the real-world applicability of our approach. These results highlight NORTON's potential as a valuable tool for enhancing neural network efficiency and effectiveness in demanding tasks such as real-world object detection, instance segmentation, and keypoint detection.

Figure 2: Baseline (left) vs Compressed (right) model inference.

🌈 Visualizing feature preservation

We present a qualitative evaluation of feature preservation, complementing the established efficiency demonstrated through numerical results. Our analysis involves a random selection of 5 images from the ImageNet validation dataset, examining three compression levels applied to the original ResNet-50 model: 44%, 63%, and 79%. Utilizing GradCAM for interpretation, we visually assess and analyze feature maps in both the original and compressed models. The visual representation underscores our framework's efficacy in retaining crucial features across a diverse range of classes. Noteworthy is its consistent robustness in capturing and preserving essential information at different CRs. This resilience implies sustained effectiveness and reliability across varying scenarios and compression levels, positioning our framework as a versatile choice for network compression across diverse applications and datasets.

Input CR=0% CR=50% CR=64% CR=78%

Figure 3: Qualitative assessment of feature preservation in compressed models.

🔖 Citation

If the code and paper help your research, please kindly cite:

        
          @article{pham2025enhanced,
            title={Enhanced Network Compression Through Tensor Decompositions and Pruning},
            author={Pham, Van Tien and Zniyed, Yassine and Nguyen, Thanh Phuong},
            journal={IEEE Transactions on Neural Networks and Learning Systems}, 
            year={2025},
            volume={36},
            number={3},
            pages={4358-4370},
            doi={10.1109/TNNLS.2024.3370294}
          }
        
      

👍 Acknowledgements

This work was granted access to the high-performance computing resources of IDRIS under the allocation 2023-103147 made by GENCI. Specifically, our experiments were conducted on the Jean Zay supercomputer, located at IDRIS, the national computing center for the National Centre for Scientific Research (CNRS).

jean-zay

We thank the Agence Nationale de la Recherche (ANR) for partially supporting our work through the ANR ASTRID ROV-Chasseur project (ANR-21-ASRO-0003).

ANR Logo