This paper explores the use of transformer-coupled (TC) technique for the 2:1 MUX and the 1:2 DEMUX to serialize-and-deserialize (SerDes) high-speed data sequence. The widely used current-mode logic (CML) designs of latch and multiplexer/demultiplexer (MUX/DEMUX) are replaced by the proposed TC approach to allow the more headroom and to lower the power consumption. Through the stacked transformer, the input clock pulls down the differential source voltage of the TC latch and the TC multiplexer core while alternating between the two-phase operations. With the enhanced drain-source voltage, the TC design attracts more drain current with less width-to-length ratio of NMOS than that of the CML counterpart. The source-offset voltage is decreased so that the supply voltage can be reduced. The lower supply voltage improves the power consumption and facilitates the integration with low voltage supply SerDes interface. The MUX and the DEMUX chips are fabricated in 65-nm standard CMOS process and operate at 0.7-V supply voltage. The chips are measured up to 40-Gb/s with sub-hundred milliwatts power consumption.
List of the following materials will be included with the Downloaded Backup:
A duty-cycle correction technique using a novel pulse width modification cell is demonstrated across a frequency range of 100 MHz–3.5 GHz. The technique works at frequencies where most digital techniques implemented in the same technology node fail. An alternative method of making time domain measurements such as duty cycle and rise/fall times from the frequency domain data is introduced. The data are obtained from the equipment that has significantly lower bandwidth than required for measurements in the time domain. An algorithm for the same has been developed and experimentally verified. The correction circuit is implemented in a 0.13-µm CMOS technology and occupies an area of 0.011 mm2. It corrects to a residual error of less than 1%. The extent of correction is limited by the technology at higher frequencies. The proposed architecture of this paper area and power consumption analysis using tanner tool.
List of the following materials will be included with the Downloaded Backup:
Abstract:
A new solution for an ultralow-voltage bulk driven (BD) asynchronous delta–sigma modulator is described in this paper. While implemented in a standard 0.18-µm CMOS process from the Taiwan Semiconductor Manufacturing Company and supplied with VDD = 0.3 V, the circuit offers a 53.3-dB signal-to-noise and distortion ratio, which corresponds to 8.56-bit resolution. In addition, the total power consumption is 37 nW, the signal bandwidth is 62 Hz, and the resulting power efficiency is 0.79 pJ/conversion. The above-mentioned features have been achieved employing a highly linear transconductor and a hysteretic comparator based on nontailed BD differential pair.
List of the following materials will be included with the Downloaded Backup:Abstract:
A low-phase-noise relaxation oscillator uses a digital compensation loop to reduce its temperature coefficient (TC). This relaxation oscillator is fabricated in the 0.18-µm CMOS process. The measured average oscillation frequency is 13.4 MHz. The whole oscillator consumes 157.8 µW under a 1.2-V supply. The measured average TCs of the oscillation frequency with and without compensation are 193.15 and 1098.7 ppm/◦C, respectively. The TC achieves an improvement of 5.7 times. The measured frequency variation is within ±2% from −20 ◦C to 100 ◦C by using the digital compensation loop. The measured phase noise at 100-kHz offset frequency is −104.82 dBc/Hz, and the measured figure of merit (FOM) is −154.4 dBc/Hz
List of the following materials will be included with the Downloaded Backup:Abstract:
This paper proposes a time-to-digital converter (TDC) that achieves wide input range and fine time resolution at the same time. The proposed TDC utilizes pulse-shrinking (PS) scheme in the second stage for a fine resolution and two-step (TS) architecture for a wide range. The proposed PS TDC prevents an undesirable non-uniform shrinking rate issue in the conventional PS TDCs by utilizing a built-in offset pulse and an offset pulse width detection schemes. With several techniques, including a built-in coarse gain calibration mechanism, the proposed TS architecture overcomes a nonlinearity due to the signal propagation and gain mismatch between coarse and fine stages. The simulation results of the TDC implemented in a 0.18-µm standard CMOS technology demonstrate 2.0-ps resolution and 16-bit range that corresponds to ∼130-ns input time interval with 0.08-mm2 area. It operates at 3.3 MS/s with 18.0 mW from 1.8-V supply and achieves 1.44-ps single-shot precision. Index Terms— Built-in calibration, pulse shrinking (PS), time-to-digital conversion, two step (TS).
List of the following materials will be included with the Downloaded Backup:Abstract:
A 2.5-V 8-bit low force and efficient Successive Approximation Register Analog-to-Digital converter (SAR-ADC) utilizing a Principled Open Loop Comparator (POLC) and Switched Multi-Threshold Complementary Metal Oxide Semiconductor (SMTCMOS) D-FF shift Register. In light of high proficiency and low force applications SAR-ADC is increasingly well known, yet it experience the ill effects of resolution and speed confinements. To defeat the above issue proposed a systematic methodology uses low force POLC based SAR-ADC is structured. Considering about the resolution, speed and compact design of 8- bit SAR-ADC, the proposed POLC strategy reasonably diminishes the propagation delay by 37% and decreases the force utilization by 62% appeared differently in relation to the standard system. A D-flip flop is planned to employ SMTCMOS procedure which has low force utilization and productively decline the leakage power. All the above circuits are simulated by using TANNER-EDA tool in 0.25μm CMOS technology produces 97% Efficiency.
List of the following materials will be included with the Downloaded Backup:Abstract:
Power analysis (PA) attacks have become a serious threat to security systems by enabling secret data extraction through the analysis of the current consumed by the power supply of the system. Embedded memories, often implemented with six-transistor (6T) static random access memory (SRAM) cells, serve as a key component in many of these systems. However, conventional SRAM cells are prone to side-channel power analysis attacks due to the correlation between their current characteristics and written data. To provide resiliency to these types of attacks, we propose a security-oriented 7T SRAM cell, which incorporates an additional transistor to the original 6T SRAM implementation and a two-phase write operation, which significantly reduces the correlation between the stored data and the power consumption during write operations. The proposed 7T SRAM cell was implemented in a 28 nm technology and demonstrates over 1000× lower write energy standard deviation between write ‘1’ and ‘0’ operations compared to a conventional 6T SRAM. In addition, the proposed cell has a 39%–53% write energy reduction and a 19%–38% reduced write delay compared to other power analysis resistant SRAM cells.
List of the following materials will be included with the Downloaded Backup:Abstract:
The latest video coding standard high-efficiency video coding (HEVC) provides 50% improvement in coding efficiency compared to H.264/AVC to meet the rising demands for video streaming, better video quality, and higher resolution. The deblocking filter (DF) and sample adaptive offset (SAO) play an important role in the HEVC encoder, and the SAO is newly adopted in HEVC. Due to the high throughput requirement in the video encoder, design challenges such as data dependence, external memory traffic, and on-chip memory area become even more critical. To solve these problems, we first propose an interlacing memory organization on the basis of quarter-LCU to resolve the data dependence between vertical and horizontal filtering of DF. The on-chip SRAM area is also reduced to about 25% on the basis of quarter-LCU scheme without throughput loss. We also propose a simplified bitrate estimation method of rate-distortion cost calculation to reduce the computational complexity in the mode decision of SAO. Our proposed hardware architecture of combined DF and SAO is designed for the HEVC intraencoder, and the proposed simplified bitrate estimation method of SAO can be applied to both intra- and intercoding. As a result, our design can support ultrahigh definition 7680 × 4320 at 40 f/s applications at merely 182 MHz working frequency. Total logic gate count is 103.3 K in 65 nm CMOS process.
List of the following materials will be included with the Downloaded Backup:Abstract:
In this article, a new solution for an ultralow-voltage (ULV) ultralow-power (ULP) operational transconductance amplifier (OTA) is presented. Thanks to the combination of a low-voltage bulk-driven nontailed differential stage with the multipath Miller zero compensation technique, a simple class AB power-efficient ULV structure has been obtained, which can operate from supply voltages less than the threshold voltages of the employed MOS transistors, while offering rail-to-rail input common-mode range at the same time. The proposed OTA was fabricated using the 180-nm CMOS process from Taiwan Semiconductor Manufacturing Company (TSMC) and can operate from VDD ranging from 0.3 to 0.5 V. The 0.3-V version dissipates only 12.6 nW of power while showing a 64.7-dB voltage gain at 1-Hz, 2.96-kHz gain-bandwidth product, and a 4.15-V/ms average slew-rate at 30-pF load capacitance. The measured results agree well with simulations.
List of the following materials will be included with the Downloaded Backup:Abstract:
There is an emerging need to design configurable accelerators for the high-performance computing (HPC) and artificial intelligence (AI) applications in different precisions. Thus, the floating-point (FP) processing element (PE), which is the key basic unit of the accelerators, is necessary to meet multiple-precision requirements with energy-efficient operations. However, the existing structures by using high-precision-split (HPS) and low-precision-combination (LPC) methods result in low utilization rate of the multiplication array and long multi term processing period, respectively. In this article, a configurable FP multiple-precision PE design is proposed with the LPC structure. Half precision, single precision, and double precision are supported. The 100% multiplier utilization rate of the multiplication array for all precisions is achieved with improved speed in the comparison and summation process. The proposed design is realized in a 28-nm process with 1.429-GHz clock frequency. Compared with the existing multiple-precision FP methods, the proposed structure achieves 63% and 88% areasaving performance for FP16 and FP32 operations, respectively. The 4× and 20× maximum throughput rates are obtained when compared with fixed FP32 and FP64 operations. Compared with the previous multiple-precision PEs, the proposed one achieves the best energy-efficiency performance with 975.13 GFLOPS/W.
List of the following materials will be included with the Downloaded Backup:Abstract:
In this paper, a double-error-correcting and triple error-detecting (DEC-TED) Bose–Chaudhuri–Hocquenghem (BCH) code decoder with high decoding efficiency and low power for error correction in emerging memories is presented. To increase the decoding efficiency, we propose an adaptive error correction technique for the DEC-TED BCH code that detects the number of errors in a codeword immediately after syndrome generation and applies a different error correction algorithm depending on the error conditions. With the adaptive error correction technique, the average decoding latency and power consumption are significantly reduced owing to the increased decoding efficiency. To further reduce the power consumption, an invalid-transition-inhibition technique is proposed to remove the invalid transitions caused by glitches of syndrome vectors in the error-finding block. Synthesis results with an industry-compatible 65-nm technology library show that the proposed decoders for the (79, 64, 6) BCH code take only 37%–48% average decoding latency and achieve more than 70% power reduction compared to the conventional fully parallel decoder under the 10−4–10−2 raw bit-error rate.
List of the following materials will be included with the Downloaded Backup:Abstract:
As the technology is getting advanced continuously the problem for the security of data is also increasing. The hackers are equipped with new advanced tools and techniques to break any security system. Therefore people are getting more concern about data security. The data security is achieved by either software or hardware implementations. In this work Field Programmable Gate Arrays (FPGA) device is used for hardware implementation since these devices are less complex, more flexible and provide more efficiency. This work focuses on the hardware execution of one of the security algorithms that is the Advanced Encryption Standard (AES) algorithm. The AES algorithm is executed on Vivado 2014.2 ISE Design Suite and the results are observed on 28 nanometers (nm) Artix-7 FPGA. This work discusses the design implementation of the AES algorithm and the resources consumed in implementing the AES design on Artix-7 FPGA. The resources which are consumed are as follows- Slice Register (SR), Look-Up Tables (LUTs), Input/Output (I/O) and Global Buffer (BUFG).
List of the following materials will be included with the Downloaded Backup:Abstract:
A floating-point fused dot-product unit is presented that performs single-precision floating-point multiplication and addition operations on two pairs of data in a time that is only 150% the time required for a conventional floating-point multiplication. When placed and routed in a 45nm process, the fused dot-product unit occupied about 70% of the area needed to implement a parallel dot-product unit using conventional floating-point adders and multipliers. The speed of the fused dot-product is 27% faster than the speed of the conventional parallel approach. The numerical result of the fused unit is more accurate because one rounding operation is needed versus at least three for other approaches.
List of the following materials will be included with the Downloaded Backup:Abstract:
As a traditional digital platform, Field Programmable Gate Array (FPGA) is seldom used for analog applications. Since there is no way to fine tune the gate property or circuit structure, the performance of FPGA analog application is usually inferior to its counterparts based on full-custom or even cell-based design. Nevertheless, a high performance FPGA time-to-digital Converter (TDC) is proposed in this paper to expand the FPGA territory into high-end analog applications. The test time signal is sampled by a serious timing references generated by feeding the original clock into a tapped delay line. According to periodicity, the delays among those timing references are wrapped into a single reference period and the effective TDC resolution can be made much smaller than the clock period to compete even with the state-of the art full-custom TDCs in performance. After measurement, the effective resolution is as fine as 2.5 ps. The corresponding differential nonlinearity (DNL) is -1.90~1.66 LSB and the integral nonlinearity (INL) is -3.79~6.53 LSB only.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Cyclic Redundancy Check (CRC) is widely used for transmission error detection in various communication interfaces. As the transmission rate increases, accelerating CRC with lower resource consumption for high-speed interfaces becomes significant. This paper analyzes and implements a typical CRC algorithm (Stride-x) and designs a padding-zero strategy to support the input data length with multiples of byte. Besides, experiments are conducted to validate the proposed algorithm on Xilinx FPGA platforms. When stride is 1, the proposed algorithm outperforms a typical parallel CRC algorithm in throughput and resource consumption with various input bus widths (32/128/256 bits).
List of the following materials will be included with the Downloaded Backup:Abstract:
In this paper, we propose a low-power high-speed pipeline multiply-accumulate (MAC) architecture. In a conventional MAC, carry propagations of additions (including additions in multiplications and additions in accumulations) often lead to large power consumption and large path delay. To resolve this problem, we integrate a part of additions into the pa rtial product reduction (PPR) process. In the proposed MAC architecture, the addition and accumulation of higher significance bits are not performed until the PPR process of the next multiplication. To correctly deal with the overflow in the PPR process, a small-size adder is designed to accumulate the total number of carries. Compared with previous works, experimental results show that the proposed MAC architecture can greatly reduce both power consumption and circuit area under the same timing constraint.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
True random number generators (TRNGs) are fundamentals in many important security applications. Though they exploit randomness sources that are typical of the analog domain, digital-based solutions are strongly required especially when they have to be implemented on Field Programmable Gate Array (FPGA)-based digital systems. This paper describes a novel methodology to easily design a TRNG on FPGA devices. It exploits the runtime capability of the Digital Clock Manager (DCM) hardware primitives to tune the phase shift between two clock signals. The presented auto-tuning strategy automatically sets the phase difference of two clock signals in order to force on one or more flip-flops (FFs) to enter the metastability region, used as a randomness source. Moreover, a novel use of the fast carry-chain hardware primitive is proposed to further increase the randomness of the generated bits. Finally, an effective on-chip post-processing scheme that does not reduce the TRNG throughput is described. The proposed TRNG architecture has been implemented on the Xilinx Zynq XC7Z020 System on Chip (SoC). It passed all the National Institute of Standards and Technology (NIST) SP 800-22 statistical tests with a maximum throughput of 300×106 bit per second. The latter is considerably higher than the throughput of other previously published DCMbased TRNGs.
List of the following materials will be included with the Downloaded Backup:Abstract:
In a memory system, understanding how the host is stressing the memory is important to improve memory performance. Accordingly, the need for the analysis of memory command trace, which the memory controller sends to the dynamic random access memory, has increased. However, the size of this trace is very large; consequently, a high-throughput hardware (HW) accelerator that can efficiently compress these data in real time is required. This paper proposes a high throughput HW accelerator for lossless compression of the command trace. The proposed HW is designed in a pipeline structure to process Huffman tree generation, encoding, and stream merge. To avoid the HW cost increase owing to high throughput processing, a Huffman tree is efficiently implemented by utilizing static random access memory-based queues and bitmaps. In addition, variable length stream merge is performed at a very low cost by reducing the HW wire width using the mathematical properties of Huffman coding and processing the metadata and the Huffman codeword using FIFO separately. Furthermore, to improve the compression efficiency of the DDR4 memory command, the proposed design includes two preprocessing operations, the “don’t care bits override” and the “bits arrange,” which utilize the operating characteristics of DDR4 memory. The proposed compression architecture with such preprocessing operations achieves a high throughput of 8 GB/s with a compression ratio of 40.13% on average. Moreover, the total HW resource per throughput of the proposed architecture is superior to the previous implementations.
List of the following materials will be included with the Downloaded Backup:Abstract:
In this brief, a high-throughput Huffman encoder VLSI architecture based on the Canonical Huffman method is proposed to improve the encoding throughput and decrease the encoding time required by the Huffman code word table construction process. We proposed parallel computing architectures for frequency-statistical sorting and code-size computational sorting. This architecture results in a process of building a tree and assigning symbols that can be completed by scanning the data only once. This solves the problem of the low efficiency of the traditional algorithm, which needs to scan the data twice. Consequently, in addition to the advantages of the high compression ratio inherited from the Canonical Huffman, the proposed architecture has overridden advantages for a high parallelism processing capacity. The experimental results showed that the proposed architecture decreased the encoding time by 26.30% compared to the available Huffman encoder using the standard algorithm when encoding 256 8-bit symbols. Furthermore, the VLSI architecture could further decrease the encoding time when encoding more 8-bit symbols. In particular, when encoding 212,642 8-bit symbols, the proposed VLSI architecture could reduce the encoding time by 87.40%. Thus, compared with the traditional Huffman encoders, this brief achieved the improvement of coding efficiency.
List of the following materials will be included with the Downloaded Backup:Abstract:
A novel low-complexity multiple-input multiple-output (MIMO) detector tailored for single-carrier frequency division-multiple access (SC-FDMA) systems, suitable for efficient hardware implementations. The proposed detector starts with an initial estimate of the transmitted signal based on a minimum mean square error (MMSE) detector. Subsequently, it recognizes less reliable symbols for which more candidates in the constellation are browsed to improve the initial estimate. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.
List of the following materials will be included with the Downloaded Backup:Abstract:
A novel type of highly efficient conditional feed through pulse-triggered flip-flop (P-FF) is proposed and demonstrated. The data-to-output (D-to-Q) delay in this circuit was highly optimized using pre discharging and conditional signal feed through schemes. Power consumption was also reduced using a shared pulse generator and an output feedback-controlled conditional keeper, which diminished the floating status of the internal node. The driving strength of this design was further enhanced by including an additional pull-down path at the output node. Various post layout simulation results applied to 16-nm Fin FET technology demonstrated a higher energy efficiency (at all input data toggle rates) for the proposed topology than comparable P-FF devices. Notably, the proposed model achieved a 62% D-to-Q delay reduction, compared to a transmission gate FF, outperforming the device by more than 66% in terms of power efficiency and 87% in energy efficiency (at a 50% input data toggle rate). Improvements were even more significant in comparison with other conventional P-FFs. These results suggest the proposed design to be a viable new option for high-efficiency sequential elements in high-speed applications.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
There are many schemes proposed to protect integrated circuits (ICs) against an unauthorized access and usage, or at least to mitigate security risks. They lay foundations for hardware roots of trust whose crucial security primitives are generators of truly random numbers. In particular, such generators are used to yield one-time challenges (nonces) supporting the IC authentication protocols employed to counteract potential threats such as untrusted users accessing ICs. However, IC vendors raise several concerns regarding the complexity of these solutions, both in terms of area overhead, the impact on the design flow, and testability. These concerns have motivated this work presenting a simple, yet effective, all-digital lightweight and self-testable random number generator to produce a nonce. It builds on a generic ring generator architecture, i.e., an area and time optimized version of a linear feedback shift register, driven by a multiple-output ring oscillator. A comprehensive evaluation, based on three statistical test suits from NIST and BSI, show feasibility and efficiency of the proposed scheme and are reported herein.
List of the following materials will be included with the Downloaded Backup:Abstract:
This brief presents a low-complexity I/Q (in-phase and quadrature components) imbalance calibration method for the transmitter using quadrature modulation. Impairments in analog quadrature modulator have a deleterious effect on the signal fidelity. Among the critical impairments, I/Q imbalance (gain and phase mismatches) deteriorates the residual sideband performance of the analog quadrature modulator degrading the error vector magnitude. Based on the theoretical mismatch analysis of the quadrature modulator, we propose a low-complexity I/Q imbalance extraction algorithm. After the parameter extraction, the transmitter is calibrated by imposing the counter imbalanced mismatch of the transmitter through the digital baseband. In comparison with existing I/Q imbalance calibration methods, the novelty of the proposed method lies in that: 1) only three spectrum measurements of the device-under-test are needed for extraction and calibration of gain and phase mismatches; 2) due to the blind nature of the calibration algorithm, the proposed approach can be readily applicable to an existing I/Q transmitter; 3) no extra hardware that degrades the calibration accuracy is required; and 4) due to the non-iterative nature, the proposed method is faster and computationally more efficient than previously published methods.
List of the following materials will be included with the Downloaded Backup:Abstract:
For video applications in a special environment such as medical imaging, space exploration, and underwater exploration, the video captured by an image sensor is often deteriorated because of low lighting conditions. Therefore, it is necessary to enhance the part of the image that is too dark to distinguish details while maintaining the remaining part with the same brightness. The retinex algorithm is widely used to restore naturalness of a video, especially exhibiting outstanding performance in the enhancement of a dark area. However, it demands large computational complexity because of its intricate structure, such as the Gaussian filter and exponentiation operations, and consequently, it is difficult to process in real time. This article presents a low-cost and high-throughput design of the retinex video enhancement algorithm. The hardware (HW) design is implemented using a field-programmable gate array (FPGA), and it supports a throughput of 60 frames/s for a 1920 × 1080 image with negligible latency. The proposed FPGA design minimizes HW resources while maintaining the quality and the performance by using a small line buffer instead of a frame buffer, by applying the concept of approximate computing for the complex Gaussian filter, and by designing a new and nontrivial exponentiation operation. The proposed design makes it possible to significantly reduce HW resources (up to 79.22% of total resources) compared to existing systems and is compatible with commercialized devices through the standard HDMI/DVI video ports.
List of the following materials will be included with the Downloaded Backup:Abstract:
In this brief, a fast and very low power voltage level shifter (LS) is presented. By using a new regulated cross-coupled (RCC) pull-up network, the switching speed is boosted and the dynamic power consumption is highly reduced. The proposed (LS) has the ability to convert input signals with voltage levels much lower than the threshold voltage of a MOS device to higher nominal supply voltage levels. The presented LS occupies a small silicon area owing to its very low number of elements and is ultra-low-power, making it suitable for low-power applications such as implantable medical devices and wireless sensor networks. Results of the post-layout simulation in a standard 0.18-μm CMOS technology show that the proposed circuit can convert up input voltage levels as low as 80 mV. The power dissipation and propagation delay of the proposed level shifter for a low/high supply voltages of 0.4/1.8 V and input frequency of 1 MHz are 123.1 nW and 23.7 ns, respectively.
List of the following materials will be included with the Downloaded Backup:Abstract:
A nanopower CMOS 4th-order lowpass filter suitable for biomedical applications is presented. The filter is formed by cascading two types of subthreshold current-reuse biquadratic cell. Each proposed cell is capable of neutralizing the bulk effect that induces the passband attenuation. The nearly 0-dB passband gain can thus be maintained, while the entire filter circuit remains compact and power-efficient. Designed for electrocardiogram detection as an example of application, the filter prototype has been fabricated in a 0.35 µm CMOS process occupying 269 µm × 383 µm chip area. Measurements verify that the filter can operate from a 1.5-V single supply and consumes 5.25 nW, while providing a cutoff frequency of 100 Hz and input-referred noise of 39.38 µVrms. The intermodulation-free dynamic range of 51.48 dB is obtained from a two-tone test of 50 and 60 Hz input frequencies. Compared with state-of-the-art nanopower lowpass filters using the most relevant and reasonable figure of merit, the proposed filter ranks the best.
List of the following materials will be included with the Downloaded Backup:Abstract:
As the device dimension is shrinking day by day the conventional transistor based CMOS technology encounters serious hindrances due to the physical barriers of the technology such as ultra-thin gate oxides, short channel effects, leakage currents & excessive power dissipation at nano scale regimes. Quantum Dot Cellular Automata is an alternate challenging quantum phenomenon that provides a completely different computational platform to design digital logic circuits using quantum dots confined in the potential well to effectively process and transfer information at nano level as a competitor of traditional CMOS based technology. This paper has demonstrated the implementation of circuits like D, T and JK flip flops using a derived expression from SR flip-flop. The kink energy and energy dissipations has been calculated to determine the robustness of the designed flip-flops. The simulation results have been verified using QCA Designer simulation tool.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Electronic devices are necessary in small spaces in order to provide fast speed and low power consumption. Arithmetic operations determine how quickly electronics operate. In many applications involving VLSI signal processing, multiplication is a necessary arithmetic operation. Thus, to create any kind of signal processing module, a high-speed multiplier is a prerequisite. Every individual has different needs and goals, which has led to the development of different multipliers according to the need of application. In this paper, a Hybrid multiplier is proposed and designed using hybrid adders which is a mixture of Brent Kung adder and Kogge Stone adder which results in less delay i.e. 4.062ns compared to other multipliers existed.
List of the following materials will be included with the Downloaded Backup:Abstract:
In-memory computing using emerging technologies such as resistive random-access memory (ReRAM) addresses the ‘von Neumann bottleneck’ and strengthens the present research impetus to overcome the memory wall. While many methods have been recently proposed to implement Boolean logic in memory, the latency of arithmetic circuits (adders and consequently multipliers) implemented as a sequence of such Boolean operations increases greatly with bit-width. Existing in-memory multipliers require O(n2) cycles which is inefficient both in terms of latency and energy. In this work, we tackle this exorbitant latency by adopting Wallace Tree multiplier architecture and optimizing the addition operation in each phase of the Wallace Tree. Majority logic primitive was used for addition since it is better than NAND/NOR/IMPLY primitives. Furthermore, high degree of gate-level parallelism is employed at the array level by executing multiple majority gates in the columns of the array. In this manner, an in-memory multiplier of O(n.log(n)) latency is achieved which outperforms all reported in-memory multipliers. Furthermore, the proposed multiplier can be implemented in a regular transistor-accessed memory array without any major modifications to its peripheral circuitry and is also energy-efficient.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
In this paper, we propose a reduced complexity parallel least mean square structure (RC-pLMS) for adaptive beamforming and its pipelined hardware implementation. RC-pLMS is formed by two least mean square (LMS) stages operating in parallel (pLMS), where the overall error signal is derived as a combination of individual stage errors. The pLMS is further simplified to remove the second independent set of weights resulting in a reduced complexity pLMS (RC-pLMS) design. In order to obtain a pipelined hardware architecture of our proposed RC-pLMS algorithm, we applied the delay and sum relaxation technique (DRC-pLMS). Convergence, stability and quantization effect analysis are performed to determine the upper bound of the step size and assess the behavior of the system. Computer simulations demonstrate the outstanding performance of the proposed RC-pLMS in providing accelerated convergence and reduced error floor while preserving a LMS identical O(N) complexity, for an antenna array of N elements. Synthesis and implementation results show that the proposed design achieves a significant increase in the maximum operating frequency over other variants with minimal resource usage. Additionally, the resulting beam radiation pattern show that the finite precision DRC-pLMS implementation presents similar behavior of the infinite precision theoretical results.
List of the following materials will be included with the Downloaded Backup:Abstract:
The main aim of the Single image (SR) super-resolution is to generate (HR) high-resolution images from (LR) low-resolution images. This paper briefly presents a concept of real time super resolution method of FHD based image extended and scaling processor. The super resolution system includes three blocks of operations. The first is a low-frequency interpolation stage, where bicubic interpolation is used for reconstructing the low-frequency parts of HR images. The second stage generates high-frequency patches by choosing the highest related pre-trained regression function according to each HR low frequency patch. In the third stage, with the high-frequency information, the low-frequency image patches are enhanced and overlapped to construct the SR result. These operations for gaining a high-frequency result are applied to the Y-luminance channel only, while the high-resolution Cb and Cr channels are generated by bicubic interpolation. The proposed system generates the output image resolution of 1920 X 1080 (FHD) by the input of 800 X 800 image size. The proposed architecture performs an anchored neighborhood regression algorithm that generates a high-resolution image from a low-resolution image input using only numbers of line buffers. Finally, super resolution technique is implemented in VHDL and Synthesized in the XILINX VERTEX-5 FPGA and shown the comparison for power, area and delay reports.
List of the following materials will be included with the Downloaded Backup:Abstract:
This paper explores a low standby power 10T (LP10T) SRAM cell with high read stability and write-ability (RSNM/WSNM/WM). The proposed LP10T SRAM cell uses a strong cross-coupled structure consisting standard inverter with a stacked transistor and Schmitt-trigger inverter with a double-length pull-up transistor. This along with the read path separated from true internal storage nodes eliminates the read-disturbance. Furthermore, it performs its write operation in pseudo differential form through write bit line and control signal with a write-assist technique. To estimate the proposed LP10T SRAM cell’s performance, it is compared with some state-of-the-art SRAM cells using HSPICE in 16-nm CMOS predictive technology model at 0.7 V supply voltage under harsh manufacturing process, voltage, and temperature variations. The proposed SRAM cell offers 4.65X/1.57X/1.46X improvement in RSNM/WSNM/WM and 4.40X/1.69X narrower spread in RSNM/WM compared to the conventional 6T SRAM cell. Furthermore, it shows 1.26X/1.08X/1.01X higher RSNM/WSNM/WM and 1.71X/1.25X tighter/wider spread in RSNM/WM compared to the best studied SRAM cells. The proposed SRAM cell indicates 74.48%/1.41% higher/lower read/write delay compared to the 6T SRAM cell. Moreover, it exhibits the third-(second-) best read (write) dynamic power, consuming 29.69% (26.87%) lower than the 6T SRAM cell. The leakage power is minimized by the proposed design, which is 37.35% and 12.08% lower than that of the 6T and best studied cells, respectively. Nonetheless, the proposed LP10T SRAM cell occupies 1.313X higher area compared to the 6T SRAM cell.
List of the following materials will be included with the Downloaded Backup:Proposed Abstract:
This paper presents the design and FPGA implementation of a 16-bit reversible processor architecture employing Fredkin, Feynman, and PERES gate architectures for reversible logic design. Reversible computing offers promising advantages in terms of energy efficiency and information loss prevention, making it suitable for various emerging computing paradigms. The proposed processor architecture encompasses a carefully crafted instruction set, data path, and control logic, all realized using reversible logic gates. Key components such as the ALU, register file, and memory elements are designed with an emphasis on reversibility. The design is implemented using Hardware Description Languages (HDLs), targeting a specific FPGA platform. The paper outlines the design methodology, gate-level implementation details, memory design considerations, FPGA synthesis, and testing procedures. Furthermore, it discusses optimization strategies and presents simulation results to validate the functionality and efficiency of the proposed reversible processor architecture. This work contributes to the advancement of reversible computing and provides insights into the practical realization of reversible processor architectures on FPGA platforms.
List of the following materials will be included with the Downloaded Backup:Abstract:
One of the main motivations for using ternary logic systems is the amount of information per circuit line is higher as compared to the corresponding binary logic representation, thereby leading to more compact circuit realizations. This is particularly attractive for quantum computing as quarts are expensive resources and minimizing their number is one of the main objectives during synthesis. Therefore, ternary reversible logic synthesis has drawn significant attention among researchers. It deals with fundamental unit of information called quarts that can exist in one of the three states |0, |1 and |2. Hence, the aim of this paper is to bridge the knowledge gap for the beginners in this domain than searching the entire space. Therefore, the present work discusses the basic concepts of ternary reversible logic and ternary reversible gates. The detailed discussion of the various ternary reversible logic synthesis will enable the beginners in this domain to understand the ternary reversible logic in a better way.
List of the following materials will be included with the Downloaded Backup:Abstract:
This brief presents a low-power and high-precision bandgap voltage and current reference (BGVCR) in one simple circuit for battery-powered applications. All the amplifiers have been eliminated in the proposed circuit. The voltage reference is derived from the bandgap topology, and the current reference is obtained by summing a proportional-to-absolute-temperature (PTAT) current and a complementary-to-absolute-temperature (CTAT) current. Therefore, the temperature coefficient of the current reference can be optimized. Besides, a pseudo-cascode structure and a simple line sensitivity enhancement circuit are adopted to improve the current mirror accuracy and line sensitivity. The proposed circuit is fabricated in a 0.18-μm deep N-well CMOS process with an active area of 0.063 mm2. The measured VREF and IREF are 1.2 V and 51 nA, respectively. The VREF and IREF show measured average temperature coefficients of 32.7 ppm/℃ and 89 ppm/℃ at a temperature of -45 to 125 ℃ and standard deviations of 0.17 % and 1.15 %, respectively. In the supply voltage range of 2 to 5 V, the line sensitivities of voltage and current are 0.058%/V and 1.76%/V, respectively. The minimum supply voltage is 2 V with a total power consumption of 192 nW at room temperature.
List of the following materials will be included with the Downloaded Backup:Abstract:
We present a novel generalization of quadrature oscillators (QVCO) which we call “arbitrary phase oscillator” or APO for short. In contrast to a QVCO which generates only quadrature phases, the APO is capable of continuously generating any desired phase at its output. The proposed structure employs a novel coupling mechanism to generate arbitrary phase shifts between two coupled oscillators without the need for an explicit phase shifter. A rigorous nonlinear dynamic analysis is presented to give a closed-form formula for the generated phase shifts, and the theory is verified by numerical simulation as well as measurement results of a prototype chip fabricated in 130-nm CMOS technology. The prototype APO has a frequency tuning range of 4.90–5.65 GHz and is continuously phase tunable from 0◦ to 360◦ across the entire frequency range. The APO structure can be used in designing novel coupled-oscillator-based phased arrays for 5G wireless communications.
List of the following materials will be included with the Downloaded Backup:Abstract:
This brief presents a three-stage comparator and its modified version to improve the speed and reduce the kickback noise. Compared to the traditional two-stage comparators, the three-stage comparator in this work has an extra amplification stage, which enlarges the voltage gain and increases the speed. Unlike the traditional two-stage structure that uses pMOS input pair in the regeneration stage, the three-stage comparator makes it possible to use nMOS input pairs in both the regeneration stage and the amplification stage, further increasing the speed. Furthermore, in the proposed modified version of three-stage comparator, a CMOS input pair is adopted at the amplification stage. This greatly reduces the kickback noise by canceling out the nMOS kickback through the pMOS kickback. It also adds an extra signal path in the regeneration stage, which helps increase the speed further. For easy comparison, both the conventional two-stage and the proposed three-stage comparators are implemented in the same 130-nm CMOS process. Measured results show that the modified version of three-stage comparator improves the speed by 32%, and decreases the kickback noise by ten times. This improvement is not at the cost of increased input referred offset or noise.
List of the following materials will be included with the Downloaded Backup:Abstract:
In this paper, we present a two-speed, radix-4, serial-parallel multiplier for accelerating applications such as digital filters, artificial neural networks, and other machine learning algorithms. Our multiplier is a variant of the serial–parallel (SP) modified radix-4 Booth multiplier that adds only the nonzero Booth encodings and skips over the zero operations, making the latency dependent on the multiplier value. Two sub circuits with different critical paths are utilized so that throughput and latency are improved for a subset of multiplier values. The multiplier is evaluated on an Intel Cyclone V field-programmable gate array against standard parallel–parallel and SP multipliers across four different process–voltage–temperature corners. We show that for bit widths of 32 and 64, our optimizations can result in a 1.42×–3.36× improvement over the standard parallel Booth multiplier in terms of area–time depending on the input set.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
During smart long-term monitoring of any biomedical signal in wireless body area networks, wearable sensor nodes generate and transmit a large amount of data, increasing transmission power consumption. In order to reduce data storage and power consumption, a lossless data compression technique for an electrocardiogram signal monitoring system is presented in this letter. For this, a hybrid lossless compression algorithm based on Run-length coding and Golomb–Rice coding is proposed to enhance the bit compressing rate. The lossless encoding scheme is implemented on the MIT-BIH arrhythmia database, achieving a compression ratio of 2.91. A VLSI-based architecture of the data compression algorithm is implemented in 90nm CMOS technology that consumes power of 18.78 µW at 100 MHz operating frequency and 1.2 V supply voltage, occupying an area of 0.0051 mm2.
List of the following materials will be included with the Downloaded Backup:Abstract:
Low-precision arithmetic operations to accelerate deep-learning applications on field-programmable gate arrays (FPGAs) have been studied extensively, because they offer the potential to save silicon area or increase throughput. However, these benefits come at the cost of a decrease in accuracy. In this article, we demonstrate that reconfigurable constant coefficient multipliers (RCCMs) offer a better alternative for saving the silicon area than utilizing low-precision arithmetic. RCCMs multiply input values by a restricted choice of coefficients using only adders, subtractors, bit shifts, and multiplexers (MUXes), meaning that they can be heavily optimized for FPGAs. We propose a family of RCCMs tailored to FPGA logic elements to ensure their efficient utilization. To minimize information loss from quantization, we then develop novel training techniques that map the possible coefficient representations of the RCCMs to neural network weight parameter distributions. This enables the usage of the RCCMs in hardware, while maintaining high accuracy. We demonstrate the benefits of these techniques using AlexNet, ResNet-18, and ResNet-50 networks. The resulting implementations achieve up to 50% resource savings over traditional 8-bit quantized networks, translating to significant speedups and power savings. Our RCCM with the lowest resource requirements exceeds 6-bit fixed point accuracy, while all other implementations with RCCMs achieve at least similar accuracy to an 8-bit uniformly quantized design, while achieving significant resource savings.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Advanced Encryption Standard (AES) algorithm plays an important role in a data security application. In general S-box module in AES will give maximum confusion and diffusion measures during AES encryption and cause significant path delay overhead. In most cases, either LUTs or embedded memories are used for S- box computations which are vulnerable to attacks that pose a serious risk to real-world applications. In this paper, implementation of the composite field arithmetic-based Sub-bytes and inverse Sub-bytes operations in AES is done. The proposed work includes an efficient multiple round AES cryptosystem with higher-order transformation and composite field s-box formulation with some possible inner stage pipelining schemes which can be used for throughput rate enhancement along with path delay optimization. Finally, input biometric-driven key generation schemes are used for formulating the cipher key dynamically, which provides a higher degree of security for the computing devices.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
In this brief an approach is proposed to achieve energy savings from reduced voltage operation. The solution detects timing-errors by integrating Algorithm Based Fault Tolerance (ABFT) into a digital architecture. The approach has been studied with a systolic array matrix multiplier operating at reduced voltages, detecting errors on-the-fly to avoid energy demanding memory round-trips. The analysis of the solution has been done using analog-digital co-simulation to extract the transient behavior under different voltages and clock frequencies. HSPICE simulations using 90nm CMOS transistor models, and experiments by reducing operation voltage of an FPGA device were carried out. HSPICE simulations, showed possibility of 10x increase in energy-efficiency by approaching near-threshold region.
List of the following materials will be included with the Downloaded Backup:Abstract:
A low-complexity analog technique to suppress the local oscillator (LO) harmonics in software-defined radios is presented. Accurate mathematical analyses show that an effective attenuation of the LO harmonics is achieved by modulating the transconductance of the low-noise transconductance amplifier (LNTA) with a raised-cosine signal. This modulation is performed through the bias network of a cascode device with a negligible increase in the LNTA noise figure. The proposed technique results in a notch at the third harmonic and at least 36 dB of attenuation at the fifth and the seventh harmonics. Experimental results in 130-nm CMOS and post layout simulation results in 65-nm CMOS verify the proper functionality of the proposed technique and the accuracy of the proposed analyses
List of the following materials will be included with the Downloaded Backup:Abstract:
Cryptography systems have become inseparable parts of almost every communication device. Among cryptography algorithms, public-key cryptography, and in particular elliptic curve cryptography (ECC), has become the most dominant protocol at this time. In ECC systems, polynomial multiplication is considered to be the most slow and area consuming operation. This article proposes a novel hardware architecture for efficient field-programmable gate array (FPGA) implementation of Finite field multipliers for ECC. Proposed hardware was implemented on different FPGA devices for various operand sizes, and performance parameters were determined. Comparing to state-of-the art works, the proposed method resulted in a lower combinational delay and area–delay product indicating the efficiency of design.
List of the following materials will be included with the Downloaded Backup:Abstract:
Today, reversible logic can be used for designing low-power CMOS circuits, optical data processing, DNA computations, biological researches, quantum circuits and nanotechnology. Sometimes using of reversible logic is inevitable such as build quantum computers. Reversible logic circuits structure is much more complicated than irreversible logic circuits. Multiplication operation is considered as one of the most important operations in the ALU unit. In this paper, we have proposed two 4×4 reversible unsigned multiplier circuits in which Wallace tree method is used to reduce the depth of circuits. In first design, the partial products circuit is designed using TG and FG gates so that TG is used to produce the partial products and FG for fan-out. In the second design, TG and PG gates are used to produce the partial products and no fan-out is required. Moreover, we have used PG gate and Feynman' block as reversible half-adder (HA) and full-adder (FA) in the summation network, respectively. In the first design, the main purpose is to decrease the depth of the circuit and increase the circuit speed. In the second design we would attempt to improve quantum parameters the number of garbage outputs, constant inputs and quantum cost. The evaluation results show that the first design, in terms of delay, is the fastest circuit. Also, the second design in terms of the number of constant inputs, garbage outputs and quantum cost is better than other designs.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
In fact, as a traditional encryption method, DES has been certified as an unsuitable tool for ciphering due to its smaller key space. Further, in concern of the real-time encryption in the current fast communication era, such as 5G, long-time as well as large computational level processes are not gotten into the consideration. As a result, an innovative encryption structure with hyperchaotic keys for efficient encryption is constructed, where the frame of DES structure is applied, the plain image is shuffled through row and column directions in the first round, and then rearranged to be 64 blocks to fit into the frame of DES structure for 4 rounds ciphering with hyperchaotic subkeys. Also, in order to encrypt the content of the image at the block level, a set of alternative S-box has been produced in this article as well. The simulation results indicate that the proposed scheme is feasible and reliable for digital image encrypting, not only a large key space can be obtained, but also the low correlation of the adjacent contents can be achieved, and further, in comparison of several existing approaches, less-computational resource can be proven as well. In particular, due to the innovative DES structure, the computational speed is significantly faster than the original DES algorithm and many other chaos-based image ciphering schemes.
List of the following materials will be included with the Downloaded Backup:Abstract:
In this paper we describe an efficient implementation of an IEEE 754 single precision floating point multiplier targeted for Xilinx Virtex-5 FPGA. VHDL is used to implement a technology-independent pipelined design. The multiplier implementation handles the overflow and underflow cases. Rounding is not implemented to give more precision when using the multiplier in a Multiply and Accumulate (MAC) unit. With latency of three clock cycles the design achieves 301 MFLOPs. The multiplier was verified against Xilinx floating point multiplier core.
List of the following materials will be included with the Downloaded Backup:Abstract:
The modern real time applications related to image processing and etc., demand high performance discrete wavelet transform (DWT). This paper proposes the floating point multiply accumulate circuit (MAC) based 1D/2D-DWT, where the MAC is used to find the outputs of high/low pass FIR filters. The proposed technique is implemented with 45 nm CMOS technology and the results are compared with various existing techniques. The proposed 8 × 8-point floating point 2-levels 2D-DWT achieves 27.6% and 83.7% of reduction in total area and net power respectively as compared with existing DWT.
List of the following materials will be included with the Downloaded Backup:Proposed Abstract:
In image processing and computer vision, pixel shuffling is a method used to increase an image's resolution without adding more parameters or network complexity. With this technique, a low-quality image's pixels are rearranged to produce an output with a better resolution. Pixel shuffling has proven successful in a number of applications, such as image synthesis, super-resolution, and style transfer. Its simplicity and efficiency make it an attractive option for tasks where increasing image resolution is essential, while avoiding the computational overhead associated with more complex architectures. The image line buffer based pixel shuffling technique presented in this study is an alternative to the classic method, which takes up more logic space in VLSI implementations. This proposed method splits and reconstructs the source photos using a 5x5 image line buffer. With the use of block interleave techniques, this pixel shuffling approach handled row and column sequence using this 5x5 picture line buffer. In conclusion, this study was compared with the PSNR and SSIM value; comparisons of logic sizes for area, latency, and power were also examined.
List of the following materials will be included with the Downloaded Backup:Abstract:
In this work, two approaches to realize a look up table (LUT) based finite impulse response (FIR) filter using Residue Number System (RNS) are proposed. The proposed implementations take advantage of shift and add approach offered by the chosen module set. The two proposed filter architecture are compared with an earlier proposed version of reconfigurable RNS FIR filter. The filters are synthesized using Cadence RTL compiler in UMC 90 nm technology. The performance of the filters are compared in terms of Area (A), Power (P), and Delay (T). The results show that one of the proposed architecture offers significant improvement in terms of delay, while the second approach is well suited for applications that require minimal power and area. Both implementations offer advantage in area delay and power-delay-product. Proposed approaches are also verified functionally using Altera DSP Builder.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
FPGA is familiar with prototyping and implementing simple to complex DSP systems. The FPGA based design may be highly affected by factors that include selection of an FPGA board, Electronic Design Automation Tool and the Programming Techniques to optimize the algorithm. The algorithm optimization results in a more compact design regarding the area and achieved frequency. In DSP algorithms optimization, the major bottleneck is the multiplier complexity evident in, for example - FIR, IIR, FFT, and others. Research shows much work on multiplier optimization. Despite all possible optimization techniques, the multiplier consumes tremendous resources when translated on hardware, with more power consumption and observed delay. The proposed work is novel in that it brings resources optimization in a familiar shift and add multiplier algorithm by implementing the design in FPGA and comparing the results with the existing shift, and add a multiplier. In the implementation of the design, Xilinx Vertex -7 FPGA is used along with ISE 14.2 simulators. The parameters to compare are the Lookup tables (Logic element of FPGA), adder/subtractors and the multiplexers, along with performance characters, like the operating frequency, delay and total levels of logic (path travelled by the signal in register transfer level). The output shows that the anticipated design is an excellent alternative to the conventional shift and add algorithm.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Approximate computing is a promising paradigm for trading off accuracy to improve hardware efficiency in error-resilient applications such as neural networks and image processing. This brief presents an ultra-efficient approximate multiplier with error compensation capability. The proposed multiplier considers the least significant half of the product a constant compensation term. The other half is calculated precisely to provide an ultra-efficient hardware-accuracy tradeoff. Furthermore, a low-complexity but effective error compensation module (ECM) is presented, significantly improving accuracy. The proposed multiplier is simulated using HSPICE with 7nm tri-gate Fin FET technology. The proposed design significantly improves the energy-delay product, on average, by 77% and 54% compared to the exact and existing approximate designs. Moreover, the proposed multiplier’s accuracy and effectiveness in neural networks and image multiplication are evaluated using MATLAB simulations. The results indicate that the proposed multiplier offers high accuracy comparable to the exact multiplier in NNs and provides an average PSNR of more than 51dB in image multiplication. Accordingly, it can be an effective alternative for exact multipliers in practical error-resilient applications.
List of the following materials will be included with the Downloaded Backup:Abstract:
This paper describes a bandwidth (BW)- and slew rate (SR)-enhanced class AB voltage follower (VF). A thorough small signal analysis of the proposed and a state-of-the-art AB-enhanced VF is presented to compare their performance. The proposed circuit has 50-MHz BW, 19.5-V/µs SR, and a BW figure of merit of 41.6 (MHz × pF/µW) for CL = 50 pF. It provides 13 times higher current efficiency and 15 times higher BW than the conventional VF with equal 60-µW static power dissipation. The experimental and simulation results of a fabricated test chip in the 130-nm CMOS technology validate the proposed circuit.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
High speed multimedia applications have paved way for a whole new area in high speed error-tolerant circuits with approximate computing. These applications deliver high performance at the cost of reduction in accuracy. Furthermore, such implementations reduce the complexity of the system architecture, delay and power consumption. This paper explores and proposes the design and analysis of two approximate compressors with reduced area, delay and power with comparable accuracy when compared with the existing architectures. The proposed designs are implemented using 45 nm CMOS technology and efficiency of the proposed designs have been extensively verified and projected on scales of area, delay, power, Power Delay Product (PDP), Error Rate (ER), Error Distance (ED), and Accurate Output Count (AOC). The proposed approximate 4 : 2 compressor shows 56.80% reduction in area, 57.20% reduction in power, and 73.30% reduction in delay compared to an accurate 4 : 2 compressor. The proposed compressors are utilised to implement 8 × 8 and 16 × 16 Dadda multipliers. These multipliers have comparable accuracy when compared with state-of-the-art approximate multipliers. The analysis is further extended to project the application of the proposed design in error resilient applications like image smoothing and multiplication.
List of the following materials will be included with the Downloaded Backup:Abstract:
The approximate computing paradigm emerged as a key alternative for trading off accuracy and energy efficiency. Error-tolerant applications, such as multimedia and signal processing, can process the information with lower-than-standard accuracy at the circuit level while still fulfilling a good and acceptable service quality at the application level. The automatic detection of R-peaks in an electrocardiogram (ECG) signal is the essential step preceding ECG processing and analysis. The Haar discrete wavelet transform (HDWT) is a low-complexity pre-processing filter suitable to detect ECG R-peaks in embedded systems like wearable devices, which are incredibly energy constrained. This work presents an approximate HDWT hardware architecture for ECG processing at very high energy efficiency. Our best-proposal employing pruning within the approximate HDWT hardware architecture requires just seven additions. The use of a truncation technique to improve energy efficiency is also investigated herein by observing the evolution of the signal-to-noise ratio and the ultimate impact in the ECG peak-detection application. This research finds that our HDWT approximate hardware architecture proposal accepts higher truncation levels than the original HDWT. In summary: Our results show about 9 times energy reduction when combining our HDWT matrix approximation proposal with the pruning and the highest acceptable level of truncation while still maintaining the R-peak detection performance accuracy of 99.68% on average.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Approximate computing is a promising technique to elevate the performance of digital circuits at the cost of reduced accuracy in numerous error-resilient applications. Multipliers play a key role in many of these applications. In this brief, we propose a truncation based Booth multiplier with a compensation circuit generated by selective modifications in k-map to circumvent the carry appearing from the truncated part. By judicious mapping, hardware pruning and output error reduction is achieved simultaneously. In the quest of power and accuracy trade-off, Truncated and Approximate Carry based Booth Multipliers (TACBM) are proposed with a range of designs based on truncation factor w. When compared with the state-of-the-art multipliers, TACBM outperforms in terms of accuracy and Area Power savings. TACBM (w = 10) provides with 0.02% MRED and 23% reduction in Area-Power product compared to exact Booth multiplier. The multipliers are evaluated using image blending and Multilayer perceptron (MLP) neural network and a high value of accuracy (95.63%) for MLP is achieved.
List of the following materials will be included with the Downloaded Backup:Abstract:
Here, the critical path of ripple carry adder (RCA)-based binary tree adder (BTA) is analyzed to find the possibilities for delay minimization. Based on the findings of the analysis, the new logic formulation and the corresponding design of RCA are proposed for the BTA. The comparison result shows that the proposed RCA design offers better efficiency in terms of area, delay and energy than the existing RCA. Using this RCA design, the BTA structure is proposed. The synthesis result reveals that the proposed 32-operand BTA provides the saving of 22.5% in area–delay product and 28.7% in energy–delay product over the recent Wallace tree adder which is the best among available multi-operand adders. The authors have also applied the proposed BTA in the recent multiplier designs to evaluate its performance. The synthesis result shows that the performance of multiplier designs improved significantly due to the use of proposed BTA. Therefore, the proposed BTA design can be a better choice to develop the area, delay and energy efficient digital systems for signal and image processing applications.
List of the following materials will be included with the Downloaded Backup:Abstract:
This paper proposes an area-efficient bidirectional shift-register using bidirectional pulsed-latches. The proposed bidirectional shift-register reduces the area and power consumption by replacing master-slave flip-flops and 2-to-1 multiplexers with the proposed bidirectional pulsed-latches and non-overlap delayed pulsed clock signals, and by using sub shift-registers and extra temporary storage latches. A 256-bit bidirectional shift-register was fabricated using a 65nm CMOS process. Its area was 1,943μm2 and its power consumption is 200μW at a 100MHz clock frequency with VDD=1.2V. It reduces area by 39.2% and power consumption by 19.4% compared to the conventional bidirectional shift-register, length in most cases.
List of the following materials will be included with the Downloaded Backup:Abstract:
The combination of FAST corners and BRIEF descriptors provide highly robust image features. We present a novel detector for computing the FAST-BRIEF features from streaming images. To reduce the complexity of the BRIEF descriptor, we employ an optimized adder tree to perform summation by accumulation on streaming pixels for the smoothing operation. Since the window buffer used in existing designs for computing the BRIEF point-pairs are often poorly utilized, we propose an efficient sampling scheme that exploits register reuse to minimize the number of registers. Synthesis results based on 65- nm CMOS technology show that the proposed FAST-BRIEF core achieves over 40% reduction in area-delay product compared to the baseline design. In addition, we show that the proposed architecture can achieve 1.4x higher throughput than the baseline architecture with slightly lower energy consumption.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Addition units are widely used in many computational kernels of several error-tolerant applications such as machine learning and signal, image, and video processing. Besides their use as stand-alone, additions are essential building blocks for other math operations such as subtraction, comparison, multiplication, squaring, and division. The parallel prefix adders (PPAs) is among the fastest adders. It represents a parallel prefix graph consisting of the carry operator nodes, called prefix operators (POs). The PPAs, in particular, are among the fastest adders because they optimize the parallelization of the carry generation (G) and propagation (P). In this work, we introduce approximate PPAs (AxPPAs) by exploiting approximations in the POs. To evaluate our proposal for approximate POs (AxPOs), we generate the following AxPPAs, consisting of a set of four PPAs: approximate Brent–Kung (AxPPA-BK), approximate Kogge–Stone (AxPPAKS), Ladner-Fischer (AxPPA-LF), and Sklansky (AxPPA-SK). We compare four AxPPA architectures with energy-efficient approximate adders (AxAs) [i.e., Copy, error-tolerant adder I (ETAI), lower-part OR adder (LOA), and Truncation (trunc)]. We tested them generically in stand-alone cases and embedded them in two important signal processing application kernels: a sum of squared differences (SSDs) video accelerator and a finite impulse response (FIR) filter kernel. The AxPPA-LF provides a new Pareto front in both energy-quality and area-quality results compared to state-of-the-art energy-efficient AxAs.
List of the following materials will be included with the Downloaded Backup:Abstract:
In this paper, a new pseudorandom number generator (PRNG) based on the logistic map has been proposed. To prevent the system to fall into short period orbits as well as increasing the randomness of the generated sequences, the proposed algorithm dynamically changes the parameters of the chaotic system. This PRNG has been implemented in a vertex 7 field-programmable gate array (FPGA) with a 32-bit fixed point precision, using a total of 510 lookup tables (LUTs) and 120 registers. The sequences generated by the proposed algorithm have been subjected to the National Institute of Standards and Technology (NIST) randomness tests, passing all of them. By comparing the randomness with the sequences generated by a raw 32-bit logistic map, it is shown that, by using only an additional 16% of LUTs, the proposed PRNG obtains a much better performance in terms of randomness, increasing the NIST passing rate from 0.252 to 0.989. Finally, the proposed bitwise dynamical PRNG is compared with other chaos-based realizations previously proposed, showing great improvement in terms of resources and randomness.
List of the following materials will be included with the Downloaded Backup:Abstract:
A CMOS fully integrated all-pass filter with an extremely low pole frequency of 2 Hz is introduced in this paper. It has 0.08-dB passband ripple and 0.029-mm2Si area. It has 0.38-mW power consumption in strong inversion with ±0.6-V power supplies. In subthreshold, it has 0.64-µW quiescent power and operates with ±200-mV dc supplies. Miller multiplication is used to obtain a large equivalent capacitor without excessive Si area. By varying the gain of the Miller amplifier, the pole frequency can be varied from 2 to 48 Hz. Experimental and simulation results of a test chip prototype in 130-nm CMOS technology validate the proposed circuit.
List of the following materials will be included with the Downloaded Backup:Abstract:
A non-destructive column-selection-enabled 10T SRAM for aggressive power reduction is presented in this brief. It frees a half-selected behavior by exploiting the bit line-shared data-aware write scheme. The differential-VDD (Diff-VDD) technique is adopted to improve the write ability of the design. In addition, its decoupled read bit lines are given permission to be charged and discharged depending on the stored data bits. In combination with the proposed dropped-VDD biasing, it achieves the significant power reduction. The experimental results show that the proposed design provides the 3.3× improvement in the write margin compared with the standard Diff-10T SRAM. A 5.5-kb 10T SRAM in a 65-nm CMOS process has a total power of 51.25 µW and a leakage power of 41.8 µW when operating at 6.25 MHz at 0.5 V, achieving 56.3% reduction in dynamic power and 32.1% reduction in leakage power compared with the previous single-ended 10T SRAM.
List of the following materials will be included with the Downloaded Backup:Abstract:
This study represents designing and implementation of a low power and high speed 16 order FIR filter. To optimize filter area, delay and power, different multiplication techniques such as Vedic multiplier, add and shift method and Wallace tree (WT) multiplier are used for the multiplication of filter coefficient with filter input. Various adders such as ripple carry adder, Kogge Stone adder, Brent Kung adder, Ladner Fischer adder and Han Carlson adder are analyzed for optimum performance study for further use in various multiplication techniques along with barrel shifter. Secondly optimization of filter area and delay is done by using add and shift method for multiplication, although it increases power dissipation of the filter. To reduce the complexity of filter, coefficients are represented in canonical signed digit representation as it is more efficient than traditional binary representation. The finite impulse-response (FIR) filter is designed in MATLAB using equiripple method and the same filter is synthesized on Xilinx Spartan 3E XC3S500E target field-programmable gate array device using Very High Speed Integrated Circuit Hardware Description Language (VHDL) subsequently the total on-chip power is calculated in Vivado2014.4. The comparison of simulation results of all the filters show that FIR filter with WT multiplier is the best optimized filter.
List of the following materials will be included with the Downloaded Backup:Abstract:
Approximate multipliers attract a large interest in the scientific literature that proposes several circuits built with approximate 4-2 compressors. Due to the large number of proposed solutions, the designer who wishes to use an approximate 4-2 compressor is faced with the problem of selecting the right topology. In this paper, we present a comprehensive survey and comparison of approximate 4-2 compressors previously proposed in literature. We present also a novel approximate compressor, so that a total of twelve different approximate 4-2 compressors are analyzed. The investigated circuits are employed to design 8 × 8 and 16 × 16 multipliers, implemented in 28nm CMOS technology. For each operand size we analyze two multiplier configurations, with different levels of approximations, both signed and unsigned. Our study highlights that there is no unique winning approximate compressor topology since the best solution depends on the required precision, on the signedness of the multiplier and on the considered error metric.
List of the following materials will be included with the Downloaded Backup:Abstract:
In this paper a new and highly efficient hardware architecture for a bit-serial implementation of a 3*3 filter on FPGA is developed and presented. The concept is implemented on a Gaussian blur spatial filter and it can be extended to other filters with similar characteristics. The proposed Single Instruction Multiple Data (SIMD) architecture provides a constant operating time independent of the size of the given image while the arithmetic operations are limited to the operations of addition. The Multiple Instruction Multiple Data (MIMD) performance is achieved in a near fraction of the cost. Thus, the hardware’s utilization is optimized. The total time needed to perform the filter of interest on the given image is solely dependent on the working clock frequency. The proposed design is evaluated using a small image and is implemented on two FPGA families with various sizes of an image. Also, it is compared with other architectures.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Approximate computing is an emerging paradigm for trading off computing accuracy to reduce energy consumption and design complexity in a variety of applications, for which exact computation is not a critical requirement. Different from conventional designs using AND-OR and XOR gates, the majority gate is widely used in many emerging nanotechnologies. An ultra-efficient 6-2 compressor is proposed in this paper. It is composed of two majority gates that lead to low energy consumption and high hardware efficiency. The proposed compressor is utilized in the approximate partial product reduction of a modified 8×8 Dadda multiplier with a truncated structure. Experimental results show that this multiplier realizes a significant reduction in hardware cost, especially in terms of power and area, on average by up to 40% and 31% respectively, compared to exact and state-of-the-art designs. The application of image multiplication is also presented to assess the practicability of the multiplier. The results show that the proposed multiplier results in images with higher quality in peak signal to noise ratio (PSNR) and mean structural similarity index metric (MSSIM) compared to other designs.
List of the following materials will be included with the Downloaded Backup:Abstract:
Major operation block in any processing unit is a multiplier. There are many multiplication algorithms are proposed, by using which multiplier structure can be designed. Among various multiplication algorithms, Wallace tree multiplication algorithm is beneficial in terms of speed of operation. With the advancement of technology, demand for circuits with high speed and low area is increasing. In order to improve the speed of Wallace tree multiplier without degrading its area parameter, a new structure of Wallace tree multiplier is proposed in this paper. In the proposed structure, the final addition stage of partial products is performed by parallel prefix adders (PPAs). In this paper, five Wallace tree multiplier structures are proposed using Kogge stone adder, Sklansky adder, Brent Kung adder, Ladner Fischer adder and Han carlson adder. All the multiplier structures are designed using Verilog HDL in Xilinix 13.2 design suite. The proposed structures are simulated using ISIM simulator and synthesized using XST synthesizer. The proposed designs are analyzed with respect to traditional multiplier design in terms of area (No. of LUTs) and delay (ns).
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Systolic Array (SA) architecture is a unique computation architecture where the inputs are continuously flowing, and the processing elements perform the desired computations in parallel. SA’s are prominently investigated due to the emergence of heavy and large processing elements for modern-day Convolution Neural Network (CNN) applications. Taking this cue, SA architectures of the order of kernel size and configured with approximate multipliers are investigated for image processing applications. The approximate array multiplier derived from approximate 4-2 compressors were employed to achieve hardware benefits without losing on the image quality metrics. The SA architecture is configured to the same size as filter kernels in a view to achieve maximum utilization, and the same is compared with other existing SA architectures for hardware metrics. The computational time for processing an image of size 256 × 256 was evaluated for approximated SA. This work investigates approximate SA for Gaussian smoothing and image outline feature extraction applications to showcase the reliability of the design. The novel approximate SA architecture is a step toward designing compact sized SoC designs for real-time image and video processing applications.
List of the following materials will be included with the Downloaded Backup:Abstract:
In this study, the design and field-programmable gate array (FPGA) implementation of the digital notch filter with the lattice wave digital filter (LWDF) structure is presented. For reducing the initial signal transient, the variable notch bandwidth filter is designed. During the initial samples, the notch filter has a wide bandwidth in order to diminish signal transient. As time moves forward, the notch bandwidth reduces to attain the possible minimum width. This results in minimized transient duration notch filter with a sufficiently high-quality factor. Previously, the IIR structure has been used for implementing the time varying bandwidth notch filter. Such a filter requires two variable coefficients for varying the notch width with time. The advantage of using a LWDF structure is that only one coefficient has variable values to vary the notch width with time. Therefore, the number of memory locations required to implement the proposed design is reduced by half. Moreover, the LWDF is less sensitive to the word-length effects. Thus, the proposed lattice wave digital notch filter (LWDNF) produces better results compared to the existing literature in terms of error analysis. The suggested LWDNF is then implemented on a field-programmable gate array using a Xilinx system generator for the DSP design suite.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Random Number Generators (RNGs) are substantially used in many security domains, providing a fundamental source of unpredictability essential for tasks such as cryptography, simulations, and statistical analyses. The efficiency and quality of an RNG directly impact the reliability and security of diverse applications, making advancements in RNG design, as explored in this study, of significant importance for enhancing computational processes. This paper presents an innovative Pseudo-Random Number Generator (PRNG) that leverages the efficiency of two carefully selected Linear Feedback Shift Registers (LFSRs) and a connecting XOR gate. The investigation of five polynomials identified an optimal pair, resulting in a notable improvement of over 200X in the length of random bit sequences compared to a single LFSR-based PRNG. The Basys3 FPGA board with the xc7a35tcpg236-1 FPGA chip was used to implement and synthesize the proposed design. Two significant findings emerge from this research. Firstly, using variable polynomials demonstrates a huge enhancement in the duration of randomness, outperforming the impact of variable seeds. A noteworthy observation is that employing the same polynomials in different branches does not result in optimal results. Secondly, managing more seeds is associated with an increased area cost, underscoring the efficiency of handling two polynomials.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Approximate arithmetic computing circuits and architectures have been proven to be energy efficient designs for Deep Neural Networks (DNNs) which are error resilient. In this paper, an approximate 8-bit Wallace Multiplier has been proposed and designed in 90nm CMOS technology for energy efficiency. The proposed 8-bit approximate multiplier design consumes ~32% less energy in comparison to an accurate 8-bit Wallace Tree multiplier with less than 20% Mean Relative Error (MRE).
List of the following materials will be included with the Downloaded Backup:Proposed Abstract:
The Arithmetic Logic Unit (ALU) is a fundamental component in digital systems, particularly in the central processing units (CPUs) of microprocessors, where it executes essential arithmetic and logical functions. This paper presents the design and implementation of an 8-bit Arithmetic Logic Unit (ALU) using CMOS technology, developed and simulated in DSCH3 and Microwind environments. The primary goal of this research is to design an efficient and compact ALU optimized for performance and area efficiency. The 8-bit ALU performs eight operations: ripple carry addition, ripple borrow subtraction, multiplication, XOR, left shift, right shift, NAND, and NOR. Each logic gate within the ALU is constructed using CMOS logic to enhance power efficiency and integration density. This paper provides a detailed description of the ALU's CMOS-based architecture, its key components, and the control mechanism for operation selection. Performance metrics, including speed, area efficiency, and power consumption, are analyzed to assess the ALU’s effectiveness in CMOS technology.
List of the following materials will be included with the Downloaded Backup:Proposed Abstract:
Arithmetic logic unit (ALU) is an important part of all digital gadgets and applications. This paper presents the design and implementation of an 8-bit Arithmetic Logic Unit (ALU) with a capability to perform eight distinct operations. ALUs are fundamental components in the central processing units (CPUs) of microprocessors and are responsible for executing arithmetic and logical operations. The primary objective of this research is to design an efficient and versatile 8-bit ALU that can execute a wide range of operations while optimizing for performance and area efficiency. The proposed 8-bit ALU is designed to perform the following eight operations: Ripple carry addition, Ripple borrow subtraction, Array multiplication, XOR operation, left shift, right shift, NAND operation and a logical NOR operation. The research presents a detailed description of the ALU's architecture, its constituent components, and the control mechanism for selecting operations. Performance metrics, such as speed, area efficiency, and power consumption, are analyzed and compared with Xilinx FPGA.
List of the following materials will be included with the Downloaded Backup:We have also Code for 720 x 576 Image Resolution using 64 x 64 Block Size of HEVC. Cost of this Update work in High Resolution Rs. 45,000/- ( Rs. 45,000/- + Rs. 35,000/- ) : Total Cost : Rs. 80,000/-
Abstract:
This paper aims to design an efficient mixed serial five-stage pipeline processing hardware architecture of deblocking filter (DBF) and sample adaptive offset (SAO) filter for high efficiency video coding decoder. The proposed hardware is designed to increase the throughput and reduce the number of clock cycles by processing the pixels in a stream of 4 × 36 samples in which edge filters are applied vertically in a parallel fashion for processing of luma/chroma samples. Subsequently these filtered pixels are transposed and reprocessed through vertical filter for horizontal filtering in a pipeline fashion. Finally, the filtered block transposed back to the original orientation and forwarded to a three-stage pipeline SAO filter. The proposed architecture is implemented in field programmable gate array and application specific integrated circuit platform using 90-nm library. Experimental results illustrate that the proposed DBF and SAO architecture decreases the processing cycles (172) required for processing each 64 × 64 or large coding unit compared with the state-of-the-art literature with the increase of gate count (593.32K) including memory. The results show that the throughput of the proposed filter can successfully decode ultrahigh definition video sequences at 200 frames/s at 341 MHz.
List of the following materials will be included with the Downloaded Backup:Abstract: Iterative methods are basic building blocks of communication systems and often represent a dominating part of the system, and therefore, they necessitate careful design and implementation for optimal performance. In this brief, we propose a novel field programmable gate arrays design of matrix–vector multiplier that can be used to efficiently implement widely adopted iterative methods. The proposed design exploits the sparse structure of the matrix as well as the fact that spreading code matrices have equal magnitude entries. Implementation details and timing analysis results are promising and are shown to satisfy most modern communication system requirements.
List of the following materials will be included with the Downloaded Backup:Abstract:
A novel design of a hybrid Full Adder (FA) using Pass Transistors (PTs), Transmission Gates (TGs) and Conventional Complementary Metal Oxide Semiconductor (CCMOS) logic is presented. Performance analysis of the circuit has been conducted using Cadence toolset. For comparative analysis, the performance parameters have been compared with twenty existing FA circuits. The proposed FA has also been extended up to a word length of 64 bits in order to test its scalability. Only the proposed FA and five of the existing designs have the ability to operate without utilizing buffer in intermediate stages while extended to 64 bits. According to simulation results, the proposed design demonstrates notable performance in power consumption and delay which accounted for low power delay product. Based on the simulation results, it can be stated that the proposed hybrid FA circuit is an attractive alternative in the data path design of modern high-speed Central Processing Units.
List of the following materials will be included with the Downloaded Backup:Proposed Abstract:
Approximate computing is an emerging paradigm in error-tolerant applications that leads to power-efficient designs without significant loss in quality. The divider in these applications have complex hardware and more latency among the computational blocks resulting in power consumption. Hence approximating the division module would lead to designs with vastly improved power efficiency. A new approximate subtractor (AxSUB) is proposed in this paper with the intent to reduce the hardware complexity while achieving accuracy within permissible limits. The proposed AxSUB and existing approximate subtractor units are used in the restoring array division (RAD) architecture to prove the efficacy of the AxSUB. This proposed architecture design with 8/4 approximate divider using Verilog HDL and synthesized using Xilinx Spartan 6 FPGA, and proved the performance of area, delay and power.
List of the following materials will be included with the Downloaded Backup:Abstract:
In this brief, based on upset physical mechanism together with reasonable transistor size, a robust 10T memory cell is first proposed to enhance the reliability level in aerospace radiation environment, while keeping the main advantages of small area, low power, and high stability. Using Taiwan Semiconductor Manufacturing Company 65-nmCMOS commercial standard process, simulations performed in Cadence Spectre demonstrate the ability of the proposed radiation-hardened-by-design 10T cell to tolerate both 0 →1and1→0 single node upsets, with the increased read/write access time.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
One of the primary purposes of a digital signal processing system is multiplication. The multiplier’s performance affects the DSP system’s overall performance. Therefore, it is crucial to create an effective and quick multiplier implementation design. Vedic mathematics can be used to simplify complex computations so that they are easier to perform verbally. Urdhva Triyambakam is the multiplication algorithm used in Vedic math. In this paper, we employing Brent Kung adder to enhance the Vedic multiplier’s performance. The Urdhva Tiryagbhyam sutra is being used in place of other multiplication strategies since it applies to all instances of algorithms for N x N bit numbers and produces the least amount of latency. Four 4-bit vedic multipliers, two 8-bit Brent Kung adders, one 4-bit Brent Kung adder, and an OR gate are used to create an 8-bit vedic multiplier. A 4-bit vedic multiplier is created similarly by combining four 2-bit vedic multipliers, two 4-bit Brent Kung Adders, one 2-bit Brent Kung Adder, and one OR gate. These four-bit vedic multipliers are then combined to form an eight-bit vedic multiplier. After that, Xilinx Vivado Software is used to simulate and synthesis the 8 x 8 Vedic Multiplier, which was coded in Verilog HDL. The proposed Vedic Multiplier is outperformed in terms of speed when compared to related works.
List of the following materials will be included with the Downloaded Backup:Abstract:
This paper introduces a mixed-logic design method for line decoders, combining transmission gate logic, pass transistor dual-value logic and static CMOS. Two novel topologies are presented for the 2-4 decoders: a 14-transistor topology aiming on minimizing transistor count and power dissipation and a 15-transistor topology aiming on high power delay performance. Both a normal and an inverting decoder are implemented in each case, yielding a total of four new designs. Furthermore, four new 4-16 decoders are designed, by using mixed-logic 2-4 pre decoders combined with standard CMOS post-decoder. All proposed decoders have full swinging capability and reduced transistor count compared to their conventional CMOS counterparts. Finally, a variety of comparative spice simulations at the 32 nm shows that the proposed circuits present a significant improvement in power and delay, outperforming CMOS in almost all cases.
List of the following materials will be included with the Downloaded Backup:Abstract:
Due to limited frequency resources, new services are being applied to the existing frequencies, and service providers are allocating some of the existing frequencies for newly enhanced mobile communications. Because of this frequency environment, repeater and base station systems for mobile communications are becoming more complicated, and frequency interference caused by multiple bands and services is getting worse. Therefore, a heterodyne receiver using IF filters with high selectivity has been used to minimize the interference between frequencies. However, repeater and base station systems in mobile communications employing fixed IF filters cannot actively cope with the usage of multiple frequency bands, the application of various services, and frequency recycling. Therefore, this brief proposes a reconfigurable digital IF filter with variable center frequency and bandwidth while achieving high selectivity as existing IF filters. The center frequency of filter can vary from 10MHz to 62.5MHz, and the filter bandwidth can be selective to one of 10MHz, 15MHz, and 20MHz. The proposed digital filter also reduces the complexity of adders and multipliers by 38.81% and 41.57%, respectively, compared to an existing digital filter by using a filter bank and a multi stage structure. This digital IF filter is fabricated on a 130-nm CMOS process and occupies 5.90 mm2.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
2-Dimensional fast Fourier transform (FFT) has been widely used in radar signal process. Due to the need for high performance, field programmable gate array (FPGA) is an ideal hardware device for this application. For space-borne radar platform such as synthetic aperture radar (SAR), single-event upsets (SEUs) can cause lots of soft errors in static random access memory (SRAM) based FPGA. As to this, protecting the 2D-FFT implemented in FPGA from SEUs is very important. In this article, we analyze the critical weakness induced by SEUs in the 2D-FFT process, and then a 2D-FFT design with high SEU resilience is presented. The design utilizes the advantage of several anti-SEU methods. For butterfly control in FFT, partially triple modular redundancy (TMR) is used. For data buffers, error correction code (ECC) is applied to read and write operation. Furthermore, safe finite state machine (FSM) is adopted by important control registers. Fault injection results show that all these reinforcement technologies contribute to enhance the ability to mitigate the SEU effects.
List of the following materials will be included with the Downloaded Backup:Abstract:
In this paper, an exchange algorithm is proposed to design sparse linear phase finite impulse response (FIR) filters with reduced effective length. The sparse FIR filter design problem is formally an l0-norm minimization problem. This original design problem is re-formulated by encoding the filter coefficients using a binary encoding vector, which represents the locations of the zero and non-zero filter coefficients. An iterative 0-1 exchange process with proper direction control is proposed to propel the minimax approximation error toward the specified upper bound of error for sparsity maximization. The effective length is optimized with a lower priority than sparsity in the proposed algorithm. Simulation results show that the proposed algorithm is superior to the existing algorithms in terms of both sparsity and/or effective length in most cases.
List of the following materials will be included with the Downloaded Backup:Abstract: In this paper, we propose the design of two vectors testable sequential circuits based on conservative logic gates. The proposed sequential circuits based on conservative logic gates outperform the sequential circuits implemented in classical gates in terms of testability. Any sequential circuit based on conservative logic gates can be tested for classical unidirectional stuck-at faults using only two test vectors. The two test vectors are all 1s, and all 0s. The designs of two vectors testable latches, master-slave flip-flops and double edge triggered (DET) flip-flops are presented. The importance of the proposed work lies in the fact that it provides the design of reversible sequential circuits completely testable for any stuck-at fault by only two test vectors, thereby eliminating the need for any type of scan-path access to internal memory cells. The reversible design of the DET flip-flop is proposed for the first time in the literature. We also showed the application of the proposed approach toward 100% fault coverage for single missing/additional cell defect in the quantum dot cellular automata (QCA) layout of the Fredkin gate. We are also presenting a new conservative logic gate called multiplexer conservative QCA gate (MX-cqca) that is not reversible in nature but has similar properties as the Fredkin gate of working as 2:1 multiplexer. The proposed MX-cqca gate surpasses the Fredkin gate in terms of complexity (the number of majority voters), speed, and area.
List of the following materials will be included with the Downloaded Backup:Abstract:
Approximate computing is tentatively applied in some digital signal processing applications which have an inherent tolerance for erroneous computing results. The approximate arithmetic blocks are utilized in them to improve the electrical performance of these circuits. Multiplier is one of the fundamental units in computer arithmetic blocks. Moreover, the 4-2 compressors are widely employed in the parallel multipliers to accelerate the compression process of partial products. In this paper, three novel approximate 4-2 compressors are proposed and utilized in 8-bit multipliers. Meanwhile, an error-correcting module (ECM) is presented to promote the error performance of approximate multiplier with the proposed 4-2 compressors. In this paper, the number of the approximate 4-2 compressor’s outputs is innovatively reduced to one, which brings further improvements in the energy efficiency. Compared with the exact 4-2 compressors, the simulation results indicate that the proposed approximate compressors UCAC1, UCAC2, UCAC3 achieve 24.76%, 51.43%, and 66.67% reduction in delay, 71.76%, 83.06%, and 93.28% reduction in power and 54.02%, 79.32%, and 93.10% reduction in area, respectively. And the utilization of these proposed compressors in 8-bit multipliers brings 49.29% reduction of power consumption on average.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Due to their shrinking feature sizes as well as environmental influences, such as high-energy radiation, electrical noise, and particle strikes, integrated circuits are getting more vulnerable to transient faults. Accordingly, how to make those circuits more robust has become an essential step in today’s design flows. Methods increasing the robustness of circuits against these faults already exist for a long period of time but either introduce huge additional logic, change the timing behavior of the circuit, or are applicable for dedicated circuits such as microprocessors only. In this paper, we propose an alternative method, which overcomes these drawbacks by determining application specific knowledge of the circuit, namely the relations of flip-flops and when they assume the same value. By this, we exploit partial redundancies, which are inherent in most circuits anyway (even the optimized ones), to frequently compare the circuit signals for their correctness—eventually leading to an increased robustness. Since determining the correspondingly needed information is a computationally hard task, formal methods, such as bounded model checking, satisfiability-based automatic test pattern generation, and binary decision diagrams, are utilized for this purpose. The resulting methodology requires only a slight increase in additional hardware, does only influence the timing behavior of the circuit negligibly, and is automatically applicable to arbitrary circuits. Experimental evaluations confirm these benefits.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Deep Neural Networks (DNNs) perform intensive matrix multiplications but can tolerate inaccurate intermediate results to some degree. This makes them a perfect target for energy reduction by approximate computing. However, current research in this direction requires DNNs redesign and does not provide the flexibility for users to trade accuracy for energy saving. In this brief, we propose a runtime reconfigurable approximate floating-point multiplier and present details of its hardware implementation. The flexible computation precision is provided by our error correction module, which is controlled by reconfigurable clock signals. The circuit design solves the glitch and metastability problems. The proposed approximate multiplier with three precision levels is evaluated on Synopsys design compiler and Xilinx FPGA platforms. Experimental results demonstrate the advantages of our approach in terms of speed, hardware overhead, and power consumption, while ensuring a controllable accuracy loss for DNNs inferences.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
GPS uses ECCs to see if an error occurs when the data sent from the satellite reaches the user. Each message structure uses ECCs such as Hamming Code, CRC, BCH Code, and LDPC Code. If the satellite contains all of the encoders, it has a negative impact to the area and power consumption. Therefore, in this paper, we propose a CRC-BCH unified encoder for GPS, which is efficient in terms of space and power consumption. Since both the CRC and BCH encoders use shift registers, the design was made using this part. To replace the existing encoder, the CRC-BCH encoder must have the same output. To validate this, we used individual CRC and BCH encoders and confirmed that the generated output was identical to the output of the proposed encoder. The proposed CRC-BCH unified encoder was synthesized at an operating frequency of 400 MHz using the CMOS 28nm process. The synthesis results showed that it used 16.67% less area and consumed 19.68% less power than the existing encoder. Therefore, the proposed CRC-BCH unified encoder offers advantages in terms of satellite weight and energy efficiency.
List of the following materials will be included with the Downloaded Backup:Abstract:
Conventionally, fixed-width adder-tree (AT) design is obtained from the full-width AT design by employing direct or post-truncation. In direct-truncation, one lower order bit of each adder output of full-width AT is post-truncated, and in case of post-truncation, {p} lower order-bits of final-stage adder output are truncated, where p = dlog2 Ne and N is the input-vector size. Both these methods do not provide an efficient design. In this paper, a novel scheme is presented to obtain fixed-width AT design using truncated input. A bias estimation formula based on probabilistic approach is presented to compensate the truncation error. The proposed fixed-width AT design for input-vector sizes 8 and 16 offers (37%, 23%, 22%) and (51%, 30%, 27%) area delay product (ADP) saving for word-length sizes (8, 12, 16), respectively, and calculates the output almost with the same accuracy as the post-truncated fixed-width AT which has the highest accuracy among the existing fixed-width AT. Further, we observed that Walsh-Hadamard transform based on the proposed fixed-width AT design reconstruct higher-texture images with higher peak signal to noise ratio (PSNR) and moderate-texture images with almost the same PSNR compared to those obtained using the existing AT designs. Besides, the proposed design creates an additional advantage to optimize other blocks appear at the upstream of the AT in a complex design.
List of the following materials will be included with the Downloaded Backup:Proposed Abstract:
Frequency dividers are of utmost importance in frequency synthesizers that are based on phase locked loops. The use of dual modulus presales enhances the versatility of the design in both integer and Fractional-N frequency synthesizers. The selection of an acceptable division ratio is dependent upon the channel spacing and frequency range of the synthesizer. There are several techniques for division in electronic systems, including the injection locked frequency divider (ILFD), complementary ILFDs, flip flop based dividers, dual modulus dividers, and modular dividers. Therefore, these approaches possess some advantages and disadvantages, such as reduced jitter, a restricted frequency tuning range, increased circuit size due to the addition of an LC tank circuit, increased power consumption, and lower quality factor. This work aims at addressing certain issues pertaining to clock dividers and proposes a unique design that utilizes a multiple digital frequency divider based on D flip flops. The architectural design is predicated on the use of a phase shifting mechanism using a D flip flop, which effectively controls the division ratio. The present study involves the use of a preliminary phase shifting melody in conjunction with the Digital Clock Manager (DCM). The auto tuning strategy described in this study aims to adjust the phase difference between two differential clock signals. By intentionally inducing metastability in one or more flip flops, the proposed approach utilizes a digital clock manager in a clock divider to mitigate the effects of metastability and reduce jitter across multiple tuning frequencies. Furthermore, it is worth noting that the logic size and power consumption required for its operation are significantly reduced.
List of the following materials will be included with the Downloaded Backup:Abstract:
The proposed work aims to facilitate the conversion of images into a hexadecimal format for efficient storage and manipulation, and subsequently restore them to their original form. This conversion is beneficial for reducing storage space and simplifying data transmission. The system supports multiple color spaces, including grayscale, RGB, and YCbCr, enhancing its versatility in image processing tasks. Users select an image file, which the system processes according to the selected mode: converting the image or its channels to a hexadecimal format and saving the data to files. During restoration, the system reads the hexadecimal files, reconstructs the image, and displays it. To ensure the fidelity of the restored images, the system computes and displays quality metrics such as Peak Signal-to-Noise Ratio (PSNR), Mean Squared Error (MSE), and Structural Similarity Index (SSIM). This comprehensive solution provides an efficient method for image data handling and quality assessment, ensuring accurate and reliable image restoration.
Proposed System:The proposed system aims to facilitate the conversion of images into a hexadecimal format and subsequently restore them to their original form. This system supports multiple color spaces, including grayscale, RGB, and YCbCr, and evaluates the quality of the restored images using metrics such as Peak Signal-to-Noise Ratio (PSNR), Mean Squared Error (MSE), and Structural Similarity Index (SSIM).
List of the following materials will be included with the Downloaded Backup:Proposed Abstract:
Random Number Generators (RNGs) are substantially used in many security domains, providing a fundamental source of unpredictability essential for tasks such as cryptography, simulations, and statistical analyses. The efficiency and quality of an RNG directly impact the reliability and security of diverse applications, making advancements in RNG design, as explored in this study, of significant importance for enhancing computational processes. This paper presents an innovative Pseudo-Random Number Generator (PRNG) that leverages the efficiency of two carefully selected Linear Feedback Shift Registers (LFSRs) and a connecting XOR gate. The investigation of five polynomials identified an optimal pair, resulting in a notable improvement of over 200X in the length of random bit sequences compared to a single LFSR-based PRNG. The Basys3 FPGA board with the xc7a35tcpg236-1 FPGA chip was used to implement and synthesize the proposed design. Two significant findings emerge from this research. Firstly, using variable polynomials demonstrates a huge enhancement in the duration of randomness, outperforming the impact of variable seeds. A noteworthy observation is that employing the same polynomials in different branches does not result in optimal results. Secondly, managing more seeds is associated with an increased area cost, underscoring the efficiency of handling two polynomials.
List of the following materials will be included with the Downloaded Backup:Abstract:
In practical CCTV applications, there are problems of the camera with low resolution, camera fields of view, and lighting environments. These could degrade the image quality and it is difficult to extract useful information for further processing. Super-resolution techniques have been proposed widely by the researchers. However, many approaches are complex and are difficult to use in practical scenarios. In this paper, we propose an efficient Super-resolution algorithm using overlapping bi-cubic for hardware implementation. Experimental results are verified using processing time and reconstructed images that can be used in real time applications.
List of the following materials will be included with the Downloaded Backup:Abstract:
Ternary content-addressable memory (TCAM)-based search engines play an important role in networking routers. The search space demands of TCAM applications are constantly rising. However, existing realizations of TCAM on field-programmable gate arrays (FPGAs) suffer from storage inefficiency. This paper presents a multipumping-enabled multiported SRAM-based TCAM design on FPGA, to achieve an efficient utilization of SRAM memory. Existing SRAM-based solutions for TCAM reduce the impact of the increase in the traditional TCAM pattern width from an exponential growth in memory usage to a linear one using cascaded block RAMs (BRAMs) on FPGA. However, BRAMs on state-of-the-art FPGAs have a minimum depth limitation, which limits the storage efficiency for TCAM bits. Our proposed solution avoids this limitation by mapping the traditional TCAM table divisions to shallow sub-blocks of the configured BRAMs, thus achieving a memory-efficient TCAM memory design. The proposed solution operates the configured simple dual-port BRAMs of the design as multiported SRAM using the multipumping technique, by clocking them with a higher internal clock frequency to access the sub-blocks of the BRAM in one system cycle. We implemented our proposed design on a Virtex-6 xc6vlx760 FPGA device. Compared with existing FPGA-based TCAM designs, our proposed method achieves up to 2.85 times better performance per memory.
List of the following materials will be included with the Downloaded Backup:Abstract:
Approximate arithmetic has recently emerged as a promising paradigm for many imprecision-tolerant applications. It can offer substantial reductions in circuit complexity, delay and energy consumption by relaxing accuracy requirements. In this paper, we propose a novel energy-efficient approximate multiplier design using a significance-driven logic compression (SDLC) approach. Fundamental to this approach is an algorithmic and configurable lossy compression of the partial product rows based on their progressive bit significance. This is followed by the commutative remapping of the resulting product terms to reduce the number of product rows. As such, the complexity of the multiplier in terms of logic cell counts and lengths of critical paths is drastically reduced. A number of multipliers with different bit-widths (4-bit to 128-bit) are designed in System Verilog and synthesized using Synopsys Design Compiler. Post-synthesis experiments showed that up to an order of magnitude energy savings, and reductions of 65% in critical delay and almost 45% in silicon area can be achieved for a 128-bit multiplier compared to an accurate equivalent. These gains are achieved with low accuracy losses estimated at less than 0.00071 mean relative error. Additionally, we demonstrate the energy-accuracy trade-offs for different degrees of compression, achieved through configurable logic clustering. In evaluating the effectiveness of our approach, a case study image processing application showed up to 68.3% energy reduction with negligible losses in image quality expressed as peak signal-to-noise ratio (PSNR).
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
VLSI realizations of digit-recurrence binary division usually use redundant representation of partial remainders and quotient digits. The former allows for fast carry-free computation of the next partial remainder, and the latter leads to less number of the required divisor multiples. In studying the previous relevant works, we have noted that the binary carry save (CS) number system is prevalent in the representation of partial remainders, and redundant high radix representation of quotient digits is popular in order to reduce the cycle count. In this paper, we explore a design space containing four division architectures. These are based on binary CS or radix-16 signed digit (SD) representations of partial remainders. On the other hand, they use full or partial pre computation of divisor multiples. The latter uses smaller multiplexer at the cost two extra adders, where one of the operands is constant within all cycles. The quotient digits are represented by radix-16 [−9,9]SDs. Our synthesis-based evaluation of VLSI realizations of the best previous relevant work and the four proposed designs show reduced power and energy figures in the proposed designs at the cost of more silicon area and delay measures. However, our energy-delay product is 26%–35% less than that of the reference work.
List of the following materials will be included with the Downloaded Backup:Abstract:
Approximate addition is a technique to trade off energy consumption and output quality in error-tolerant applications. In prior art, bit truncation has been explored as a lever to dynamically trade off energy and quality. In this brief, an innovative bit truncation strategy is proposed to achieve more graceful quality degradation compared to state-of-the-art truncation schemes. This translates into energy reduction at a given quality target. When applied to a ripple-carry adder, the proposed bit truncation approach improves quality by up to 8.5 dB in terms of peak signal-to-noise ratio, compared to traditional bit truncation. As a case study, the proposed approach was applied to a discrete cosine transform engine. In comparison with prior art, the proposed approach reduces energy by 20%, at insignificant delay and silicon area overhead.
List of the following materials will be included with the Downloaded Backup:Abstract:
Static random access memory (SRAM)-based ternary content-addressable memory (TCAM) on field-programmable gate arrays (FPGAs) is used for packet classification in software-defined networking (SDN) and Open Flow applications. SRAMs implementing TCAM contents constitute the major part of a TCAM design on FPGAs, which are vulnerable to soft errors. The protection of SRAM-based TCAMs against soft errors is challenging without compromising critical path delay and maintaining a high search performance. This brief presents a low cost and low-response-time technique for the protection of SRAM-based TCAMs. This technique uses simple, single-bit parity for fault detection which has a minimal critical path overhead. This technique exploits the binary-encoded TCAM table maintained in SRAM-based TCAMs for update purposes to implement a low-response-time error-correction mechanism at low cost. The error-correction process is carried out in the background, allowing lookup operations to be performed simultaneously, thus maintaining a high search performance. The proposed technique provides protection against soft errors with a response time of 293 ns, whereas maintaining a search rate of 222 million searches per second on a 1024 × 40 size TCAM on Artix-7 FPGA.
List of the following materials will be included with the Downloaded Backup:Abstract:
Ternary content addressable memories (TCAMs) are widely used in network devices to implement packet classification. They are used, for example, for packet forwarding, for security, and to implement software-defined networks (SDNs). TCAMs are commonly implemented as standalone devices or as an intellectual property block that is integrated on networking application-specific integrated circuits. On the other hand, field-programmable gate arrays (FPGAs) do not include TCAM blocks. However, the flexibility of FPGAs makes them attractive for SDN implementations, and most FPGA vendors provide development kits for SDN. Those need to support TCAM functionality and, therefore, there is a need to emulate TCAMs using the logic blocks available in the FPGA. In recent years, a number of schemes to emulate TCAMs on FPGAs have been proposed. Some of them take advantage of the large number of memory blocks available inside modern FPGAs to use them to implement TCAMs. A problem when using memories is that they can be affected by soft errors that corrupt the stored bits. The memories can be protected with a parity check to detect errors or with an error correction code to correct them, but this requires additional memory bits per word. In this brief, the protection of the memories used to emulate TCAMs is considered. In particular, it is shown that by exploiting the fact that only a subset of the possible memory contents are valid, most single-bit errors can be corrected when the memories are protected with a parity bit.
List of the following materials will be included with the Downloaded Backup:We can provide Online Support Wordlwide, with proper execution, explanation and additionally provide explanation video file for execution and explanations.
NXFEE, will Provide on 24x7 Online Support, You can call or text at +91 9789443203, or email us nxfee.innovation@gmail.com
Customer are advice to watch the project video file output, and before the payment to test the requirement, correction will be applicable.
After payment, if any correction in the Project is accepted, but requirement changes is applicable with updated charges based upon the requirement.
After payment the student having doubts, correction, software error, hardware errors, coding doubts are accepted.
Online support will not be given more than 3 times.
On first time explanation we can provide completely with video file support, other 2 we can provide doubt clarifications only.
If any Issue on Software license / System Error we can support and rectify that within end of day.
Extra Charges For duplicate bill copy. Bill must be paid in full, No part payment will be accepted.
After payment, to must send the payment receipt to our email id.
Powered by NXFEE INNOVATION, Pondicherry.
Copyright © 2024 Nxfee Innovation.