Category Archives: Processors

HPC (High Performance Compute) Cluster with MPI and ArchLinux

The following is a simple guide to setting up a cluster server and nodes using ArchLinux.The advantage of this approach is the flexibility of setting up a computer capable of high speed parallel computation using commodity hardware.
The procedure will be generally similar for most Unix based systems.The preference for Arch is driven by its philosophy of keeping-it-simple.’Simple’ is defined from a technical standpoint, not a usability standpoint. It is better to be technically elegant with a higher learning curve, than to be easy to use, and technically crap.Thus for a base system that will be as lean and fast as possible the minimalist base Arch install is perfect for the task at hand.

Open MPI

The Open MPI Project is an open source MPI-2 implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available.

Machine setup

This guide assumes:

  • all the machines have been formatted and Arch base system installed according to the guide
  • the machines are connected via a TCP/IP network with the ip addresses and hostnames noted down as they will be required in later steps.
  • each machine has a common login account (in this case baloo)
  • all machines are using the same processor architecture i686 or x86_64

Its always a good idea to get the latest and up-to date Arch system so a quick:pacman -Syu

SSH setup

Open MPI communicates between the nodes and server over a secure connection provided by openssh secure shell.The full details of openssh options can be found from the arch wiki or the main openssh site .Here the bare minimum is given to get a cluster up and running.

Installing openssh

Accomplished by calling:pacman -S openssh
the default configuration for the sshd (server deamon) are enough for our needs.Inspect the /etc/ssh/sshd_config making sure all options are sane then continue.

Generating ssh-keys

To allow the cluster to send communication to the nodes from the server without the password being requested at every instance we shall use ssh-keys to enable the seamless logon.Using the defaults accept as given.No passphrase is selected , although inherently less secure than with one this precludes the need to setup key management via keyring.

Copying Keys to the server

Start the ssh deamon rc.d start sshd on both the server and the slave node and copy the public key from each node to the server.These will all end up in the home directory for our common user /home/baloo/.ssh/
The server publickey ( each of the publickeys copied over from the nodes are then appended to the authorized_keys file at ~.ssh/authorized_keys on the server.To enable two way communication its then possible to copy this file back to all the nodes after.
IMPORTANT:make sure the permissions for the following are all appropriate for reading and writing only by the owner:chmod 700 ~/
chmod 700 /.ssh
chmod 600 authorized_keys

logging into the remote machines via ssh should no longer require a passsword.

NFS setup

OpenMPI requires the programs that are to be run to be in a common location here .Instead of copying the program executable over and over to the slave nodes we set up a simple NFS shared directory with the actual folder on the server from which all the nodes will mirror the contents.

Server Configuration

Create the directory that will be shared /parallel in this instance and edit the /etc/exports to have the file mirrored to the remote nodes
/parallel . . . . . . . *(rw,sync)
and change the ownership permissions for the shared directory to nobody
chown -R nobody.nobody /parallel
edit /etc/conf.d/nfs-common.conf

Client Configuration

Edit /etc/fstab to include the following line so the clients can access the shared /parallel directory /parallel nfs defaults 0 0

Daemons Configuration

Setting the appropriate daemons to launch on start-up simply requires the modification of /etc/rc.conf and adding the appropriate entries.


DAEMONS=(…….sshd rpcbind  nfs-common nfs-server ……)


DAEMONS=(…….sshd rpcbind  nfs-common ……)

OpenMPI setup

With the preliminary setup out of the way we can now install the openMPI package , it comes with inbuilt wrappers for c++ fortran and c additionally the python wrappers can also be installed.It should be installed on both the server and nodes
pacman -S openmpi python-mpi4py python2-mpi4py
*the python wrappers are there if you want to implement the parallel programs in mpi for python

OpenMPI Configuration

To allow Open MPI to know on which machines to run your programs create a hostfile in the default user home directory.if /etc/hosts was set up you can use the host names here otherwise the IP addresses of the machines can work just as well.~/mhosts

#The master node is dual processor machine hence slots = 2
localhost slots=2
#The slave node is a quad core machine hence the slots=4
Or1oN slots=4

Running Programs on the cluster

To run myprogram on the cluster issue the following command from the /parallel directory:$mpirun -n 4 –hostfile ~/mhosts ./myprogram$mpirun -n 4 –hostfile ~/mhosts python

or$mpiexec -n 4 –hostfile ~/mhosts ./myprogram

$mpiexec -n 4 –hostfile ~/mhosts python

Leave a comment

Posted by on April 18, 2012 in code, Harware, Processors, Uncategorized


Tags: , , , ,

Futures in Neuromorphic Computing

Which chip will emerge the victor in the new race to beat Moore’s law and finally give us the intelligent machines weve been told are going to be in our future; by being out-competed or forming a marriage of convenience , its still far too early to tell.Briefly some background as i may be running off with the premise of this piece even before the starting gun.


The state of transistor tech that has sustained the electronics and computer industry for the past 20 plus years has grown by leaps and bounds (thank you Moore’s Law) enabling massive computational devices to proliferate at a fraction of the cost that they would have been had at in the preceding year.And even from the earliest times when a PC took up an entire room and drew as much power as a small town , the dream of AI has been slowly gaining traction.However it was realised early on that the Positronic brains we so desire for out robot would not be realised by the current hardware at hand .Fast forward to the present to where the problem still persists , no matter how many processor cores one throws at it the crop of supercomputers built to simulate an artificial intelligence still hold to that same principle of a large roomful of boxes drawing enough power to a small town (the more things change).However a fundamental difference with the earl efforts in AI research is with advances in neuroscience we know better how the functioning of the brain can be possibly simulated by artificial means .

The hardware side of AI research has shown that a fundamental flaw in the model being the von Neumann architecture.
<”von Neumann architecture is a design model for a stored-program digital computer that uses a central processing unit (CPU) and a single separate storage structure (“memory”) to hold both instructions and data.The separation between the CPU and memory leads to the von Neumann bottleneck, the limited throughput (data transfer rate) between the CPU and memory compared to the amount of memory. In most modern computers, throughput is much smaller than the rate at which the CPU can work. This seriously limits the effective processing speed when the CPU is required to perform minimal processing on large amounts of data. The CPU is continuously forced to wait for needed data to be transferred to or from memory.”-Wiki>

This is functionally different from the way that a brain will organise its information let alone process it

<”A biological brain is able to quickly execute this massive simultaneous information orgy—and do it in a small package—because it has evolved a number of stupendous shortcuts. Here’s what happens in a brain: Neuron 1 spits out an impulse, and the resultant information is sent down the axon to the synapse of its target, Neuron 2. The synapse of Neuron 2, having stored its own state locally, evaluates the importance of the information coming from Neuron 1 by integrating it with its own previous state and the strength of its connection to Neuron 1. Then, these two pieces of information—the information from Neuron 1 and the state of Neuron 2’s synapse—flow toward the body of Neuron 2 over the dendrites. And here is the important part: By the time that information reaches the body of Neuron 2, there is only a single value—all processing has already taken place during the information transfer. There is never any need for the brain to take information out of one neuron, spend time processing it, and then return it to a different set of neurons. Instead, in the mammalian brain, storage and processing happen at the same time and in the same place.” – Spectrum IEEE>

This brings us to the first of the next generation processing elements based on memristor technology.


From the ground up a memristor , whose existence was theorised in the 70’s and actualised by HP labs in — in application is like a FPGA ; realising functions that need several transistors in a CMOS circuit with the added advantage of non-volatile memory (no power required for state refreshing) and a structure that is remarkably defect-tolerant.

The memristor layer interacts with the CMOS logic layer of the hybrid chip and according to the circuit configurations is able to realise any number of logic gate structures.The process of creating the hybrid chip leaves the underlying CMOS layer untouched , redundant data paths of the crossbar architecture allow routing around defective areas.In neuromorphic computing application the memristor as synapse and transistors as the neurones unsupervised learning becomes an actual possibility.A current work in progress by Boston University , MoNETA where the aim is to realise a general purpose AI able to adapt to solving a problem without prior training , which essentially boils down to a brute force technique with little room for creative problem solving.Using hundreds of normal PE cores sandwiched in a memristor layer where memory is localised to a super-cache immediately accessible and relying on very little power to maintain the information.

The software in this case for modelling the neurological topology is being handled by Cog Ex Machina a special purpose Os.


The next contender to the plate of a nueromorphic chip is the Chaogate.I must confess I’m particularly attached to this one , and not just because of butterfly’s.Partial differential equations and the way their solutions arise bring some of the most beautiful patterns , and i like to think brains work similarly if we could only see.As far as chip construction is concerned a new type of gate the Chaogate has been developed recently able to reconfigure itself to provide different logic gates – hence chaogates.Different from FPGAs where switching between RCLG’s achieves reconfiguration chaogates morph via the pattern inherent in their constitutive nonlinear element. Modern computers depend on boolean logic of which any logical operation can be realised by NOR and NAND gates.The chaotic processor is taken as a 1D system whose state is represented by x and dynamics given by non-linear map f(x) , if necessary and sufficient conditions are satisfied by f(x) simultaneously it is able to implement the full set of logical operations.

It also becomes possible to implement combinational logic directly, case in point the half adder involving two AND gates (for the carry) and XOR (sums 1st digit) is implementable with one 1D chaotic element.And a full adder requires three iterations of the single chaotic element giving us efficient computational modules without cascading.

Development by ChaoLogix using standard CMOS techniques has led to an array with: a morphing ALU giving higher functions (multiplier and adder ) in less than 1 clock cycle and communication protocols morphing between 2 different communication protocols in less than 1 clock cycle ( synchronous serial data link or serial computer bus).Arrays can be conceivably be programmed on the run , with threshold values being sent from an external program for optimisation of the task at hand.

Current efforts are aimed at optimisation of the design of a chaogate to sizes similar or smaller to NAND gates , and as a caveat the developers add that programming the chaogates will require development of a new hardware description language , whose scarcity at the moment lends ideas from evolutionary algorithms to be considered as viable alternatives to achieve optimal array configurations.


While focusing on the hardware advances in recent months on the software side of things Numenta deserves a nod for its work in recreation of a workable model of the human neocortex using its HTM approach.On the open-source side dust seems to be gathering with the last activity on projects like OpenAI being about four years ago.

With recent advancements tackling the whole problem of AI from a new perspective its high time a proper open stack was available to enable the faintest vestiges of consciousness to be breathed into our computers.So say we all.


Image credits.”Positronic Brain” by Fernando Laub [
“Optical Micrograph of CMOS chip with memristor ” [ Nano Lett., 2009, 9 (10), pp 3640–3645 DOI: 10.1021/nl901874j]
“Chaogate Element” –  American Institute of Physics.[doi:10.1063/1.3489889]
Leave a comment

Posted by on December 26, 2010 in code, Harware, Processors


SBC(Single Board Computer) using 8086

The aim is to design and model an 8086 based computer and add several interfaced peripherals to it.The system has been designed to meet the following requirements :

Total 32Kx16 SRAM

Total 64Kx16 EPROM

I/O Ports Parallel

I/O Ports analog-digital

a block diagram of the system showing the functional units relationships to each other .The desccriptions that follow are based on this model and willthus be referrd to as per the module it currntly appears in And below is a preliminary sketch of the completed system.CAVEAT:This is how the completed system will look like however pin outs may change regarding the availiability of ICs.

I wıll be going through the different modules step-by-step in the upcoming posts .

ARES schematic

ARES Achematic of SBC

Leave a comment

Posted by on December 17, 2010 in Harware, Processors, Uncategorized