Ian Howson on Ian Howson

Why the MacBook Pro 15" has a discrete GPU

ian@mutexlabs.com (Ian Howson) — Thu, 19 Jan 2017 00:00:00 +0000

For a long time, I couldn’t figure out why the 2016 MBPr15 had a discrete GPU. It’s not significantly faster than the integrated GPU, it’s no good for games, there aren’t many games on Mac anyway, and it doesn’t provide any significant improvement for creatives (Photoshop, Final Cut, etc).

It comes at a large cost. It takes a huge amount of space, reducing battery capacity and increasing weight. It requires extra cooling, consuming space, reducing battery capacity and increasing weight. The extra power consumption… you guessed it… costs battery capacity and increases weight. dGPUs on the previous 15” line have been a nuisance (massive, unexplained power consumption, lower customer satisfaction). And because it’s AMD, you can’t do any significant GPGPU workloads (NVIDIA’s CUDA is miles ahead of AMD’s stuff).

I think I’ve worked out why Apple had to put a dGPU on the MBP15.

5K is Apple’s future

Windows machines are going to standardise on 4K. It will be cheap, thanks to consumer TVs, and it’s well supported already by hardware.

Apple is pushing 5K.

Technical challenges will keep the PC manufacturers out
27” is about as big as you can make a display before it’s impractically large
2560px wide (unscaled) is the right pixel density for a 27” display
For video editors, 5K lets you run an unscaled 4K video and have some space for software controls and chrome
Photo and video work needs to be pixel-perfect (no scaling) so you can’t just scale 4K to 2560px wide

1920px wide is too coarse on 27” and 3840px is too fine. Windows still lags with HiDPI support (it’s basically there, but there are a lot of rough edges). With the sole exception of the MacBook Air 13”, Apple now has HiDPI on every single device.

When Apple released the 5K iMac, they had to design and manufacture custom chips. No PC manufacturer wants to do that. They ship whatever they can buy off the shelf.

5K requires multistream

Current 5K tech requires two display streams. It’s not transmitted over the wire as one super-high-resolution display. It’s split into two smaller ones and reconstructed on the display. From the PC’s point of view (but hidden by the OS), a single 5K display is two small displays arranged a certain way.

Intel’s integrated GPUs have a maximum of three outputs

You can see where I’m going with this. The iGPUs have three outputs. You need two to run a 5K display. So with the iGPU, you can run the internal display and one external 5K display.

That’s kindy shoddy for a top-of-the-line machine; people expect to run two external displays. So the only thing Apple can do is add more outputs. How do you add more outputs? You need to add a whole GPU!

AMD was probably chosen because NVIDIA’s parts usually have the same three output limitation. (Possibly one could run the iGPU and dGPU together, but I imagine there would be technical headaches and the performance gap would be difficult to explain). AMD’s GPUs typically support six outputs. That’s exactly three 5K displays.

Also, you don’t want to add a big GPU on a laptop (unless it’s gaming specific). GPU peak power consumption is way higher than for CPUs. You need to design the machine to remove the consequent heat output. Some software runs the GPU hard unneccessarily, just as it does the CPU. This isn’t obvious to users, but they complain that random software (like Flash) kills their battery life and makes the machine hot. For these users (the vast majority!) the best thing you can do is limit their power consumption by giving them underpowered hardware. They probably won’t notice and can’t do much damage.

The future?

Intel’s iGPUs will probably eventually support more outputs (and/or with single-stream 5K). The dGPU will then be totally redundant and can be removed for weight reduction or battery life improvement. This won’t be a reality for a few more years, at least. In the meantime, Apple has a unique feature and competitive advantage.

Intel is putting more and more GPU power onto their regular CPUs all the time. There’s not really a middle ground any more. Current iGPUs are good enough for all non-gaming tasks. Gaming requires a high-power dGPU, and Apple’s not catering to that market.

GRAPE: the Generic Risk Assessment Process Explained

ian@mutexlabs.com (Ian Howson) — Tue, 10 Jan 2017 00:00:00 +0000

Practically every risk assessment process is the same. Rather than reading boring stuff, use my handy checklist. I will then authorise you an additional fifteen (15) minutes of Facebook time to be used outside of working hours.

1. Write down all of the things that can go wrong

All of them.

Come on, use your imagination.

2. For each thing that can go wrong, tell me:

What is it?

How likely is it to happen? (Likelihood)

e.g. Trump gets elected President

You eat a second dessert with dinner tonight

Does the pope shit in the woods?

Supposing it happens, how bad will it be? (Impact)

Somebody ate the last bagel

Facebook is down

I’d lose my job. Oh, and people might die or something.

Will you go to prom with me? (Seriously)

3. How bad is it? (Risk)

How should I know? Just check some boxes!

Impact

Likelihood

4. Bonus points: can you do anything to control the risk?

Go back and change the likelihood and impact figures yourself. What am I, your mother?

5. Are you happy with your current level of risk?

You might want to tell an adult. Your manager, your CEO or your dog are good candidates. Really, tell as many people as you can so that when they come looking for a scapegoat, you have someone else to point to.

Disclaimer

This is meant to be funny. It’s still a better process than no process. Risk management is serious business and you should take all due care. I am not a lawyer, get professional advice, take two aspirin and call me in the morning, etc.

Attacks on embedded systems

ian@mutexlabs.com (Ian Howson) — Fri, 06 Jan 2017 00:00:00 +0000

System

I’m not going into too much detail here as they’re well covered in other material. Consider attacks on:

The physical device – what physical controls does the device to prevent tampering?
- Tamper switches
- Locks
- Armor (e.g. safes)
Network interfaces
- Particularly those with no encryption – Nordic, BLE, Zigbee
Web interfaces
Software interfaces
Social engineering (your staff)
Cryptographic protocols. Many embedded systems use cryptosystems with known attacks, very short key lengths, or simply incorrect implementations. Because you have control over the host hardware, timing attacks are also much easier to execute.

Extracting firmware

The chief weakness of an embedded device is that it’s physically not in your control. The attacker has total control of a single device, and if they learn enough about the software stack, they can develop exploits that work across many devices.

Bypassing microcontroller code locks
RAM/parallel bus sniffing
Read firmware through the bootloader
Reading a serial flash chip without removing it from the PCB
Reading code from microcontrollers
Remove a flash chip from the PCB
SPI bus sniffing

Extracting keys

Many embedded devices carry valuable private crypto keys. Methods to extract these include:

Acoustic key extraction from chips
Differential power analysis
Glitching

Attacks on chips

Very few ICs are designed with security in mind. They contain valuable firmware and crypto keys. Methods to attack them include:

Decapping
Microprobing
Optical key extraction from chip backside
Optical ROM extraction
Partial flash reprogramming through light exposure

Other

The hardware of embedded devices can be manipulated in interesting ways to expose security problems.

Connecting debuggers
Device cloning (by manufacturer or third party)
Finding JTAG ports
Finding serial ports
Identity cloning by connecting multiple devices to the same legitimate identity chip
Inhibiting clocks
Inhibiting reset
Manipulating RTCs
Modifying serial numbers and identity chips

Firmware analysis

Firmware can be obtained from a running device or as an update package from the vendor. The challenge, then, is to make sense of it and find security flaws.

Binary analysis is well covered in the existing reverse engineering literature, but there are some embedded-specific tools available.

binwalk
decompilers

Is responsible disclosure appropriate for IoT devices?

ian@mutexlabs.com (Ian Howson) — Tue, 03 Jan 2017 00:00:00 +0000

Story time.

When I was an undergraduate, we did a research project. I looked at how difficult it was to break symmetric crypto using FPGAs. Nothing super novel, but it added some data points to the literature.

We presented these projects to industry. An IBM employee asked, “why are you breaking crypto? Are you a terrorist?” I did a double take. She was serious.

I’d forgotten that regular people on the street haven’t internalised why we intentionally break things. So let’s lay it out there and see if it still works.

Why we break things

We don’t know how secure our {software|devices|systems} are. We can’t prove the absence of security problems. We can prove the presence of security problems.

We don’t know what other people know, either. If we have reports of what is broken, we conservatively assume that attackers already know this and assess our risk accordingly. Absence of reports gives us weak confidence that a system has not been broken (but negative reporting is not commonplace at this time, so it doesn’t tell us much).

Let’s take a concrete example: breaking DES. I want to use DES in an application. My one-line threat model is “I want it to withstand attack by opposing militaries”. Should I use DES?

Well, the literature (my paper!) says that back in 2003, you can break a DES-encrypted message with a $100 FPGA board in about three weeks. My 2016 estimate is that you can break DES in about 6 seconds, if you can muster the computing power of the Bitcoin network.

A large proportion of the Bitcoin network is controlled by one government. If I needed to keep something secret from an opposing military for more than 6 seconds, I would not use DES.

If my threat model says “I want it to withstand attack by college students”, then yeah… you can probably use DES! Maybe not against Engineering students, but most of them, sure.

Crypto and attacks by foreign powers are a clear-cut case. We know that computing and attacks improve constantly. Defenders (people, governments, militaries, companies) have a legitimate case to use cryptography. Attacking cryptography and publishing results of breaks is therefore valuable.

Responsible disclosure

Another common case is where we attack software – say, a public-facing web application. Let’s call it BookFace. You attack BookFace wanting to know if your private data is safe there. You find problems. What do you do with this information?

Keep it to yourself?

Don’t tell anyone. This does the public a disservice; an attack is possible but they don’t know about it. Most government infosec organisations (e.g. the NSA) take this approach. They want to keep attacks to themselves so they can be used later.

Sell it?

People will buy information on how to perform an attack. This is where malware black markets come from. This puts a lower bound on the value of a discovered attack. If you (the vendor) want some other form of disclosure, you’d better make sure that what you’re offering is better than the market price of the attack. Not all attackers are motivated by the mad props they get by publishing papers.

Tell the world about it?

You could just tell the world. You get your mad props immediately. Like in the academic world, there is no risk that someone else will publish your discovery and steal your mad props.

Crypto research takes the ‘tell the world’ approach. Crypto is widely deployed and there’s no single entity that can do anything about a break. Adversaries are usually governments. You might as well just publish the result and flee to a country with no extradition laws.

If you’ve just breached BookFace, anyone can now breach BookFace and steal people’s private data. You might have good intentions, but not everyone does.

Tell the vendor about it?

You could tell the vendor first. For a while, this would result in lawsuits and gag orders from the vendor trying to force you to never reveal details of the attack.

Most vendors nowadays are more enlightened and will work with you. This is the path that Responsible Disclosure takes.

You tell the vendor in private.
They get a reasonable amount of time to fix the problem.
You can then publish details of the attack.
You get mad props.
The vendor gets a more secure system.
Nobody’s nudes get leaked.

Let’s make a checklist of Desirable Outcomes For Responsible Disclosure:

Vendor can fix the system
Researcher gets mad props
Users remain secure in their warm beddy-bies

But for IoT…

IoT vendors usually can’t fix their problems.

The vendor does not physically control the device
The vendor usually cannot push remote updates to a device
Users might not know that the device exists
Users might not care
Updating the device might require downtime, and not every IoT device is permitted downtime
Infrastructure devices take a long time to update
Regulatory restrictions will make approval to distribute an update very slow and with tremendous cost to the vendor

So let’s review the checklist:

Researcher gets mad props: check
Vendor can fix the system: not really
Users remain secure: no

By publishing the attack, we didn’t really improve the state of the world. Miscreants can reuse the attack but vendors and users remain insecure.

Who’s the vendor?

Crypto algorithms are used by everyone. We don’t bother to disclose attacks because there is no single authority.

Websites and software have a single controlling entity. Websites are the ideal case for responsible disclosure because they are completely in control of a single entity. They update, everyone updates.

Desktop software and apps are controlled by the user, but most users will update most of the time. Not always. Windows updates are forced; this comes at a cost to the user but is a responsible thing for the broader community.

IoT devices have a single vendor, but very little control over the deployed device. Disclosing an attack to an IoT vendor doesn’t really help them.

Are you on the right side?

Responsible disclosure is about improving vendor and user security while making sure researchers are recognised for their research efforts.

Consider attacks on content control systems – DVD, Blu-ray, XBox, Playstation, iOS, Pay TV. The security system is designed to:

prevent anyone other than the manufacturer from distributing content (i.e. stop piracy)
for games, ensure a secure and fair computing platform (i.e. stop cheaters)

So let’s say you publish an attack on one of these. The vendor will improve their future systems, but they can’t do anything about the existing ones (these are all embedded devices in the people’s homes, of course). Depending on the severity of the breach, that content might be open forever and cheating rampant.

So, let’s look at our Responsible Disclosure checklist again:

Vendor can fix the system: sort of. Maybe the attack can be mitigated by a software update. Maybe not.
Researcher gets mad props: Oh yeah. These are high-value systems with significant effort put into defence. These are gold.
Users remain secure: Well…

The user might not want to be secure. Users generally don’t like content protection. The security system protects the vendor, not the user.

So by breaking content protection, did you do good? Did you improve the world? That’s a tricky question which I’m not touching; “content wants to be free”, “commercial interests need to be protected”, “artists won’t produce art if they won’t be compensated” and so on. I’m not getting into that.

So what to do?

Let’s think:

If you (the researcher) publish, jerkasses will use your attack. Vendors and users can’t do much about it.
If you don’t publish, you don’t get recognition, and users/vendors remain ignorant to the vulnerabilities in a product.

One major problem with academic science research is that papers often lack enough detail to reproduce the result. Specifically, software and datasets are not released. The researcher gets props but does not advance human knowledge. We don’t even know if they did the research or just made it up.

Recently, there is movement in the ‘full disclosure’ direction: everything is released. People can repeat the analyses. This can be bad for the publishing researcher (the new information can be used to dispute the result) but is good for scientific progress as we have a fuller, more confident ‘truth’ than before.

Likewise, with security research, you could publish everything. Method, firmware images, exploit. Make it easy for someone to verify your result. The poses a particular risk for IoT systems where it is difficult for the vendor to react to the publication.

You could publish just enough to prove that an attack is possible, in line with traditional scientific research. The recent Pay TV hacks are a good example; the methods are shown along some pretty compelling evidence. We don’t know for sure that it works; a demonstration device is great evidence but not absolute proof. There’s not enough information that you could do it yourself. The existence of the paper and talks makes it much easier to do, but the attack is technically challenging enough that the most common attackers (people at home) probably can’t execute it.

This is a good position. The vendor and users are not significantly hurt. The researcher shows evidence of a successful attack on a challenging system.

There’s no universal answer. For IoT, the fundamental problem is that devices aren’t updated. It’s not even clear that “vendors must do remote OTA updates” is a good strategy. Not all devices can afford downtime.

Why extract firmware?

ian@mutexlabs.com (Ian Howson) — Tue, 20 Dec 2016 00:00:00 +0000

In a black-box penetration test – say, for a web application – the attacker has very limited knowledge of how the software works. All that you know is what you can gather from the outside. This makes it difficult to detect vulnerabilities. At the other extreme, copy protection of desktop software has been completely unsuccessful. The attacker controls the hardware and therefore can manipulate the software, bypassing security controls.

For an attacker, an IoT device starts somewhere in the middle. The attacker does not have the firmware and must probe from the outside. If the attacker can obtain or manipulate the firmware, their job becomes much easier. Common ways that an attacker can obtain the firmware are:

download it from the Internet (e.g. as a device update that your company releases)
convince the device’s firmware to send it
extract it from the device hardware

Once the attacker has a copy of the firmware, it’s usually easy to figure out what’s inside using standard software security techniques. Binwalk is a firmware analysis tool which can tell you what’s inside a firmware image.

External threats

The attacker might be able to find out:

operating system type and version (and can then learn any known vulnerabilities for same)
any third party software (and documented bundled vulnerabilities)
any hidden services, especially those used for manufacturing and testing
password hashes and – surprisingly frequent – plaintext passwords
web routes that are not visible from the public web service (again, often leftovers manufacturing and testing)

Given a firmware image, the attacker may be able to run it on their own hardware (with greater privileges) or obtain the source code from the Internet. This all gives the attacker more information to work with and more opportunities to find a vulnerability.

An attacker that can modify firmware and have your device run it has some new opportunities:

They can use their hardware to run their own software.
They can bypass software controls. For example, a door lock could be modified to unlock in response to a special card. A multiplayer video game could show players that are hidden behind walls.
They can mount impersonation attacks. For example, an attacker could remove a device from a target site, modify the firmware (removing or modifying controls) and reinstall it in the target site.

Protection of secrets

Some device firmware contains secrets – valuable IP, maintenance passwords, company secrets or content decryption keys.

Device cloning

A common fear of vendors is that their manufacturing partners will manufacture and sell devices without the involvement of the vendor. Pirate SD cards are a real problem, and there have been stories of Kickstarter projects being cloned and sold in China. Low-quality clone devices cost both profits and reputation for the business.

In response, some vendors only have their manufacturing partners write basic bringup firmware to the device – just enough to test that the hardware is working correctly. This probably helps, but there are a multitude of other ways to obtain firmware – assuming that the pirate manufacturer doesn’t just write their own.

How to extract firmware from a device

ian@mutexlabs.com (Ian Howson) — Tue, 20 Dec 2016 00:00:00 +0000

Through the application

If you (the attacker) can get a shell on the device, usually it is trivial to SCP the filesystem contents out of the device. This shell might come from any of the traditional software or network vulnerabilities. It is also common for serial ports on the device to expose a login shell.

Standard software and network security countermeasures apply. In particular, try to disable debug ports in production firmware.

Through the bootloader

The bootloader is in a particularly weak position in an IoT device:

It is usually unencrypted
There are minimal security controls
It is accessible through multiple ports – usually an onboard serial port and often through network interfaces
It must be able to access main device Flash in order to boot the system

It is common for just the bootloader to be written to the device during manufacturing. The final firmware image is written to the device through the bootloader at a later stage. It is unusual that the bootloader is configured to disable console and Flash access after device programming is complete.

Conveniently for old folks like me, bootloaders usually still use RS232 serial ports for access. The old XModem/ZModem/Kermit protocols are often shipped and can be used to copy files off device Flash.

Countermeasures:

Customise the bootloader to allow the bare minimum facilities required to bring up the device (booting, Flash writes and network drivers). Disable Flash/memory reads and filesystem access.
Disable debug ports after device configuration is complete
Use your platform’s secure boot facilities (if any)
Once device firmware has been written, reconfigure the bootloader to only allow booting, not memory/Flash reads and writes

Through JTAG, programming and debug headers

It’s extremely unlikely that you can remove these facilities altogether; they’re important for manufacturing. You can make small gains by making them difficult or inconvenient to use:

Don’t connect main Flash to the JTAG chain
Don’t label the ports
Use test pads instead of pins, connectors or holes in the PCB. Bonus points if you can cover the pads (e.g. conformal coating) after firmware has been written.
Groups of pads or pins are especially suspicious to the attacker. The number of pads and some voltage measurements will usually reveal the purpose of the whole group. Spread the pads around the PCB.
Put traces on inner layers on the PCB where they’re more difficult to access. Consider a security mesh over the top to disable the device if broken.
Disable CPU debug facilities (both in software and by not connecting the pads to the PCB)

By accessing the memory bus

This is an unusual and difficult. Notably, this method was used to extract encryption keys from the original XBox (page 125). Many of the same countermeasures apply:

Put the traces on inner layers of the PCB and add a security mesh
Memory encryption is a possibility, but is generally ineffective and costly
iPhone and many Android phones stack the RAM onto the same package as the CPU/SoC, making it physically challenging to access the memory bus without destroying the whole package

By removing the Flash chip

I have never encountered an IoT device which did not give me a useful firmware image once its main Flash was removed and dumped. There must be someone shipping a device with encrypted Flash; please get in contact with me if you find one!

Some devices ship with an SD card as their boot media. The solution here is obvious: remove the card and dump it using a regular computer. Once, I had to get the soldering iron out as the device didn’t use a socket. That’s as tough as it gets.

(I enjoy the irony that an 8-bit CPU can boot from an SD card that contains a 32-bit CPU – and that this is the most cost-effective way to do things in some applications.)

Most devices will use a soldered-on Flash device. There are a few common variations:

Serial SPI NOR (usually small – kilobytes to a few megabytes)
Serial NAND (larger, bigger CPUs)
Parallel NOR/NAND

Many designs will allow the Flash chip to be read while still on the PCB. This is great for the attacker; less risk of damage to the device and less work. For a serial Flash device, this is achieved by lifting the TX/RX/MISO/MOSI lines and connecting them to an off-the-shelf device programmer. You can do the same with a parallel part, but it’s usually easiest to just remove the whole chip and put it in a socket.

Another option is to solder to the exposed Flash pins and disable the main CPU – either by holding it in reset or disabling the clocks.

Removing a Flash chip is easy. The exact method varies depending on the board, but the simplest thing for most devices is to heat it with a hot air rework station and lift the chip off the board with tweezers. Higher density boards make this more challenging. Sometimes as an attacker you don’t care about destroying the host PCB – anything you learn from one device will be applicable to another anyway.

Countermeasures

By far the best countermeasure to physical Flash attacks is to encrypt the firmware and use a trusted boot facility. Then, an attacker has nothing of value to extract from the Flash. Even without trusted boot, encrypted firmware is a lot better than nothing.

Failing that, you’re limited to physical controls:

Glob or epoxy the Flash to the board
Use a parallel Flash chip; it raises the cost of attack
Use a BGA part. This prevents in-place access to the Flash chip, raises the cost of inserting the part into a socket and makes reassembly more challenging. Make sure you don’t expose the vias to the outer layers of the PCB, or you’ve just given attackers a convenient place to solder to!
Hide data traces on inner layers of the PCB

Xilinx FPGAs have an interesting solution to this. Their firmware (bitstream) is stored encrypted in an external Flash chip. The decryption keys are stored in write-only SRAM within the FPGA package. An external battery keeps the key storage SRAM alive. The keys never need to leave the chip.

There are obvious downsides to this – a battery is expensive and large – but if you’re using an FPGA already, cost, space and power consumption are probably not big constraints.

By removing the microcontroller

If you’re looking at a small device (usually 8-bit), it’s likely that the CPU, RAM and Flash are integrated onto the same die.

These parts all have code locks that prevent the program Flash from being read externally. There are plenty of examples of people defeating those locks. Many of these attacks require the microcontroller to be removed and the package melted off. This is beyond the ability of most at-home attackers.

Enable the code lock after production firmware has been written
Consider globbing the part to the PCB
Many microcontrollers ship with code security features such as self-destruction and encryption

Miscellaneous

ian@mutexlabs.com (Ian Howson) — Mon, 19 Dec 2016 00:00:00 +0000

Cryptography demands that your CPU and memory subsystems be perfect

In normal operation, your device can, surprisingly, tolerate a lot of small errors. The occasional bitflip in RAM won’t hurt anything and a slightly out-of-spec CPU (e.g. low voltage or noisy power supply) will work well enough. Most applications don’t do enough CPU/RAM work that errors are a big problem.

Cryptography is different; it will stress your CPU and RAM for a long period. It is also completely intolerant of errors. A single-bit transition will completely ruin the result of a crypto operation.

A quick-and-dirty test, for Linux systems at least, is to repeatedly hash a file in RAM. If you’ve got a tmpfs mounted on /tmp, for example, you can:

dd if=/dev/random of=/tmp/junk bs=1M count=<most of RAM - e.g. '12' for a 16MB machine>
md5sum /tmp/junk

Repeatedly run the md5sum. If you ever get different results, you’re seeing memory corruption.

Pressures on highly regulated industries

ian@mutexlabs.com (Ian Howson) — Mon, 19 Dec 2016 00:00:00 +0000

Change is expensive

Highly regulated industries like automotive, aerospace and medical, have a common pressure: change is expensive.

These industries need to comply with regulations before their product is allowed to be sold. On top of designing, developing, testing, marketing and selling a product, meeting regulatory requirements takes massive amounts of resources.

Every market has different regulatory requirements, so products can’t be sold in a region until that regulator is satisfied.

Regulators typically take months, occasionally years, to receive a submission, process it and respond. While the regulator is busy, the company will move its efforts to the next product. This works against security efforts; while regulators are doing their thing, engineers have forgotten about the product or left the company. By the time you come back to produce an update, nobody will remember why or how the old device worked.

Should a regulator reject your submission, it will take a long time (weeks to months) before you can resubmit and try again. This delay kills startup companies.

Most of the time, firmware is an item that regulators want to control. You’ll need to prove that the firmware is safe and has been well tested. Sometimes there are mandatory design standards that must be met. This goes beyond just the firmware that your company writes: any software that is included in your product must meet some standard. This is a huge problem for any Linux-based system as you can’t realistically prove the safety of a large body of third-party code. You can test and submit it as a black box, sometimes.

Complexity is your enemy, moreso than usual. Simpler designs mean less parts and less external software. This means simpler and faster regulatory submissions. The tradeoff? Security, of course! It’s tough to justify doing additional work for a device which will disable functionality for a possible future event. You need to get this thing shipped now!

Regulators pay attention to cryptography. Some markets restrict its use, some limit its strength, and some outlaw it entirely. Regulators pay attention. If your product uses crytography, you’ve limited the markets that you can sell to and made your interactions with regulators slower.

Regulators are paying more attention to security, though there’s nothing concrete right now. Their concerns are mostly around user safety. DDoS from virus-infected pacemakers, not so much.

So what?

There’s a massive cost to changing the design. You’re going to choose an old CPU that is boring, trusted and well documented. It needs to be available for a long time in the future.

Despite tremendous advances in mobile CPUs, you probably can’t use them because they’ll only be manufactured for a small number of years. If your CPU becomes obsolete, you have to redo the hardware design and resubmit it to the regulators; that’s expensive.

You’re going to emphasise simplicity. Less hardware components means better reliability. They mean less documentation where you have to prove that they’re safe. Smaller firmware means less testing and less documentation. Less can go wrong.

You’re going to develop more in-house rather than buy solutions in. Because you’re manufacturing and supporting for decades, your toolchains and any third-party software need to be stable for a long time. Stability is a problem for security, though – a small security fix might need you to pull in newer, untested code. What works for web development (constant updates and change) works poorly for IoT.

You’re going to make as few changes to the design and the firmware as possible because changes might need to be approved by the regulators again. Even minor changes can have unexpected implications.

For many devices, the firmware that they receive at manufacture time is the most recent firmware they’ll ever receive. It needs to be solid! You probably can’t just patch it remotely later. If your simple bugfix has a surprise bug, that bug will be out there forever.

But… pacemakers can be remotely hacked!

Sort of. Not really.

In this light, it is easy to understand the decisions made by pacemaker manufacturers:

The CPU won’t support much, if any cryptography – it needs to be boring, low power and reliable.
Their device has to run for decades.
There’s a reasonable control against “attacker reprograms pacemaker”: they have to be physically close to the device. You could just stab the victim with a knife!
The surgery is highly invasive. Any reduction in reliability is really bad.
IEC 62304 makes it difficult (but not impossible) to use externally sourced software. An SSL stack is a large, complex piece of software. Even symmetric cryptography raises new regulatory hurdles.
People receiving pacemakers already have a serious health issue. You probably don’t want to delay any medical treatment by adding extra security controls (e.g. authentication) to the process.
Medical device companies tend to protect their IP through lawyers and patents, not technology.

Software developers shouldn't build threat models

ian@mutexlabs.com (Ian Howson) — Mon, 19 Dec 2016 00:00:00 +0000

Often, the software or firmware developers end up building threat models. This is a terrible idea for two reasons:

1. It conflicts with their goals.

Software developers have one priority: ship the product faster. Faster faster faster. You’re asking them to produce a report (taking time) which will require them to implement controls (taking more time). Some of the controls will be burdensome and require major changes. So they will silently omit threats which are difficult to control, downgrade their probability/impact, or select an ineffective but easy-to-implement control.

2. The software developers in your company probably have little or no security training.

Wait, let me back up. Do you even have software developers? A large proportion of IoT vendors don’t employ software developers at all – the hardware/electrical engineers write the firmware. Sometimes the firmware is outsourced, so the firmware vendors aren’t interested in satisfying long-term needs.

So, IF you have software people and IF they have security training and IF they ever use it, they might be in a position to discuss security issues. But given point (1) – effective threat modelling conflicts with their goals – it’s not a great idea.

So who builds the threat model?

Someone who isn’t interested in the ship date. Preferably, someone who has no relationship with your firmware team.

If you use an external consultant, they can deliver the bad news and leave without causing conflict between staff.

What do I do with the threat model?

It becomes software requirements. (You have requirements, right?)

Because they’re requirements, the developers are more free to select an appropriate control. It fits in with their existing workflow and gives them something to test against. (You have tests, right?)

I know, I know: most orgs don’t have requirements or tests. Hopefully your threat model does not show any significant risks to your business. If it does, you’re now in a much better position to advocate for proper testing. You can clearly show the risk and cost of NOT testing.

Why is IoT security different?

ian@mutexlabs.com (Ian Howson) — Mon, 19 Dec 2016 00:00:00 +0000

Cost

Most IoT devices sold nowadays are sold as a piece of hardware. You pay once for the hardware and it is expected to work for a long time. As a result, the vendor is under tremendous pressure to keep the manufacture price of the hardware low; it entirely dictates their profit margin.

Because there is pressure to reduce manufacturing costs, every component comes under scrutiny. Do we need 64MB of RAM, or can we get by with 32? A less capable CPU will shave 50 cents off the BOM cost. The hardware capabilities are reduced to the bare minimum, and unfortunately for security, features like cryptography tend to be demanding of hardware resources. The vast majority of IoT devices in the wild are simply not capable of strong crypto as it is presently used.

Business models

Traditional software costs nothing to duplicate. There are two common business models:

Buy once, use forever
Ongoing subscription

The software industry is moving to subscription plans because consumers expect regular updates and support. They also expect to get new features, but hate to pay for them. Software vendors incur large costs to support and update a software product after it has already been sold.

Much of the support and update burden is security patches.

IoT devices mostly use the “buy once, use forever” model. Unfortunately, this means that the vendor has little incentive to update their device once it has been released. Updates cost money. They would prefer that customers buy a new device and throw away the old one.

There are some businesses which discount hardware costs by selling an ongoing subscription (e.g. Internet or cell phone service), but in these cases the service is valuable and the hardware is a necessary cost. You wouldn’t pay $5/month to use an IoT lightbulb, for instance. Where a service must be provided for a long time, it is usually priced into the up-front purchase price. No IoT lightbulb costs $200 to manufacture.

As a result, IoT vendors rarely release security patches for their products.

Hardware capabilities

IoT devices use different CPUs to those found in a modern laptop or desktop. They are always less powerful – sometimes dramatically so. Where a typical (2016) laptop will have 8GB of RAM, there are CPUs in IoT devices which have less than 20 bytes (yes, bytes!) of RAM. There are several reasons for this:

Cost, obviously. Smaller CPUs are cheaper.
Smaller CPUs use less power. Less power means less heat, longer battery life, less cooling, smaller size and lower manufacturing cost.
IoT devices typically integrate most or all of their peripherals onto the CPU package, further reducing size/cost/power.
Many IoT devices handle real-time tasks, and these are often easier to develop on a non-desktop-class operating system.
Some IoT devices are manufactured for an extended period (over five years) with minimal hardware changes (aerospace, automotive, medical). The parts must be manufactured for at least that time. Typically, older parts must be selected, with less capabilities.

There are thousands of architectures in common use. I discuss a few common classifications in hardware classes.

The result of all of this is that not all IoT devices are capable of strong cryptography. At the time of writing (2016), IoT devices with the same crypto capabilities as a desktop PC are rare.

This isn’t as simple as “Moore’s Law will fix it”. The non-cost benefits (power, development effort, predictability) of smaller CPUs are enormous. It’ll be a long time before we can fit something with the power of a Raspberry Pi (400MHz 32-bit ARM, 50-5000mW) into the space and power envelope of a TinyAVR (4MHz 8-bit AVR, 5mW) – and the AVR would still boot faster.

Software development practices

Software (firmware!) is usually developed alongside the hardware device. It’s very common for firwmare to be developed by someone who isn’t a specialised software developer (often they’re electrical engineers first and learn software development on-the-job). They are therefore less likely to be educated in security practice than a software developer.

Software/firmware is also seen as a cost and usually takes longer than the hardware development. It therefore delays release to market and is abbreviated as much as possible.

Physical environment

IoT devices operate in all imaginable physical environments – underwater, inside human bodies, in space.

As a result:

The vendor usually doesn’t control the physical environment
The hardware can be damaged or operate incorrectly out of its designed physical environment
The hardware itself is an avenue that attackers can use to learn more about the device
Physical and temporal proximity are often used as security controls

Remember, a Fitbit is a $100 computer whose purpose in life is to be shaken. You would never do this to a regular computer!

Unattended operation

Despite mainstream media reporting, IoT devices already surround you and control much of your life. You don’t know that they exist. You’re certainly not aware that they need to be maintained.

Many classes of IoT devices – building controls, SCADA devices, medical implants – need to operate for a long time with no human intervention. For the security practitioner, that means that they need to operate securely for a long time with nobody patching them or monitoring them.

An unpatched Windows XP machine on the Internet will be compromised within a few minutes, but at least someone will notice that it has been compromised. The IoT devices in tomorrow’s news story have already been deployed somewhere and forgotten.

Huge variability in architectures

On the desktop, practically all machines run Windows on an Intel CPU. On servers, Linux or Windows on Intel. On phones, iOS or Android on ARM.

On IoT devices, there is no dominant platform. There are hundreds of different CPUs and dozens of different operatings systems. Many devices use a custom operating system or no operating system at all. Even stock operating systems are heavily customised.

If an attacker compromises iOS or Windows, they can reuse the same method over a massive install base. Because they’re constantly attacked and have strong corporate backing, they’re very robust at this point in time.

IoT devices are all different. They’re generally very easy to compromise, but the same exploit isn’t usable against many devices. Given that attackers have a finite amount of time to spend attacking and exploiting devices, they’ll spend their effort on more lucrative (effort * impact) targets.

How do we fix IoT security?

ian@mutexlabs.com (Ian Howson) — Tue, 13 Dec 2016 00:00:00 +0000

So, given the constraints that I’ve ranted on about over and over again:

The business can’t easily switch to a subscription model
There is downward pressure on hardware costs
Cheap devices will probably never have good security controls
Devices in the field will never be updated
Developers are only interested in the ship date, not security
Users don’t understand security, won’t learn it and won’t pay more for it

What are we to do? How do we improve the current state of affairs?

Produce a threat model and publish it

It’s a big dream, but I would like to see every device have a threat model produced, and I would like the vendors to publish those threat models.

This serves two purposes. One, the vendor thinks about security, if only briefly. And two, consumers use the threat model to judge if it’s appropriate to their situation (or, more likely, judge based on whether it has been published at all).

Don’t rely on the user to make a security decision

Users, generally, do not take the time to learn about security. They also don’t necessarily make the right security decision. Where possible, you need to make it for them.

This will sometimes conflict with your goals. If you’re selling a WiFi access point, you’ll get less support calls if you leave it open by default. Adding reasonable security (unique WPA2 passwords on each device) costs you money in manufacturing time, documentation, but especially support.

Apple doesn’t require, but it strongly encourages users to use passcodes and Touch ID. It also enables disk encryption by default. These are good decisions. They’re well tolerated by most users.

Many IP cameras use a default password and open UPnP ports by default. These are bad decisions. If the user does not explicitly intervene (they won’t!) then the camera is exposed to the world.

Forcing software updates is a good step in this direction, though users tend to hate it.

Government regulation

Probably nothing will come of regulation. Design and manufacturing occurs across several countries; regulation would need to be on sale, like the EU with RoHS. Unlike RoHS, which is easy to define (you can’t use this list of materials), ‘adequate security’ is completely different depending on the type of device and the context in which it is used.

For many product categories, adding regulatory overhead would completely kill it as a business.

It’s difficult regulation to write. You can’t write something blanket like “all comms must be encrypted”; this would make many devices impossible to design. Security needs to be tailored to the environment. As a result, a conventional risk management process is more appropriate (identify the risks, implement appropriate controls, then the regulator signs off to say that you’ve done that well enough.)

Product categories where there is significant risk of personal or property damage (medical devices, cars, industrial equipment, aircraft) might need specific security regulation. Chances are, they already have regulation by virtue of being high risk.

Regulation can also be politically motivated. You don’t want bogus media coverage of ‘pacemakers can be remotely hacked’ or ‘terrorists can explode your laptop’ to impact your business.

Positive and negative reporting

I’ve advocated for reports of negative results – i.e. “we pentested this device and did not find any problems”. I believe that these are more useful than positive (“dis shit be busted”) reports.

If negative reports became more commonplace, they might give vendors a reason to actually think about security. They want people to publish good things about them!

Right now, researchers have to speculate as to the sort of security threats a device is designed to handle. If researchers publish a report claiming vulnerabilities that a device was never intended to handle, the vendor loses both ways: they paid the cost of implementing security and they still got bad press. If vendors explained their threat model up-front, it primes the conversation; all future discussion will be from that reference point, and vendors get to choose that reference point.

Right now, ‘security researcher publishes bad report’ is the most plausible security threat that many vendors face.

We’ve got the ‘stick’ side of incentives right – vendors that ship bad security sometimes get bad press. We should have a ‘carrot’ side too – vendors that take the time to document and openly discuss their security decisions get good press.

Make the network resilient

Mirai and botnets are not an IoT phenomenon. They’ve been around for decades, often using unpatched desktop/laptop machines. There are still Windows XP machines out there, and they’re not receiving updates any more!

We need to get more vendors producing updates, and we need to get more end-users installing updates, but we can never patch everything. I think a more pragmatic way to proceed is to make the network tolerate and/or prevent malicious behaviour.

Broadly, botnets are used for:

DDoS
Sending spam
Bitcoin mining (probably not any more)
Relays or proxies for other nefarious activities

We can’t do much about relaying, but DDoS and spam are easy to detect: lots of outbound traffic. DDoS is usually a lot of IP/UDP to a small number of IPs. Spam is particularly easy to detect because it’s identifiable by destination port number!

Consumer routers could potentially rate limit this traffic and/or warn the user (though communicating with the user is an unsolved problem). Internet service providers, likewise, could detect this on the consumer side. Both, however, would incur costs to do so.

Network-enforced killswitches

A more radical proposition would be to globally share information on compromised devices, like what we do with spam blacklists. Routers could automatically take corrective action (blocking UPnP or all network traffic) to bad devices.

At a basic level, you’d want to be able to block ranges of MAC addresses. Some sort of software version detection would also be needed. This could potentially cover XP machines.

Consumer Internet routers are unfortunately very price-sensitive. It’s unlikely that they would add the software to do this, especially given it might harm the vendor’s other products.

Done properly, this might incentivise vendors to properly address security in their products rather than risk being blacklisted.

Consumer education

For the entire history of computing, consumer education on security has been a total failure. Users just don’t care about security and they don’t want to learn about it.

The most effective security is either enforced on users (which they hate) or is built-in and convenient – e.g. Touch ID.

Better frameworks

Ehhh.

So here’s the thing. The popular dev boards from Intel, Google, Raspberry Pi and so on – they’re all Linux machines. Which is fine. But:

Because they run Linux, only expensive devices can use them.
Because they run Linux, you don’t need to think that hard about the software stack. We’ve got decades of great network software stacks for desktop-class machines! You can afford to go crazy:
- all comms go over OpenSSL
- all flash is encrypted
- use the trusted boot features
They don’t do anything about the IoT-specific issues like hardware security, your business being uninterested in shipping updates, or your electrical engineers having no security training.

So go and build your product on one – you can find out quickly if it’s going to work. Building on cheap, small hardware is a premature optimisation. Just don’t be surprised if manufacturing and/or management ask you to save $20 BOM cost by switching away from Linux and your favourite framework.

Negative reporting and security research

ian@mutexlabs.com (Ian Howson) — Tue, 13 Dec 2016 00:00:00 +0000

In academia, there’s an entrenched problem where negative results are not published. Researchers found something interesting and published a paper; that’s great! Now we need other teams to confirm or deny whether this happened. This does not happen. Reproductions (either positive or negative) are rarely published, and negative results especially (we ran the test but did not observe the same outcome) do not get published.

This is bad, because a single claim of a result is not very strong evidence. If multiple independent sources achieve the same result, that is something we can be confident of knowing is true.

Likewise, in security, we only report positive results. We only report on things which are broken. All day, every day, every IoT device that people look at has security problems.

Every. Single. One.

So you might as well assume that everything is insecure. Security almost demands that you take that approach – it’s the safe, conservative way to build secure systems! Don’t know how secure something is? Assume that it is insecure and plan accordingly.

What is useful in this environment is… negative results! “We tested this device and were not able to break into it.” That’s extremely useful information! Yeah, you want someone to double-check it, to try more things. You want to know what things the researchers found that smelled funny but didn’t consistitute a vulnerability. You want to know what the researchers tried. You can then refer back to your threat model, adjust your ‘probability’ figures appropriately, and have a better idea of what risks your business is exposed to.

Please publish negative results. Please tell us attacks you tried but which failed. You’ll stand out as doing something odd. You might look silly if someone contradicts you later. But it’s the best information we can get right now.

Design assuming your security controls will fail

ian@mutexlabs.com (Ian Howson) — Sat, 10 Dec 2016 00:00:00 +0000

When you produce your threat model, you’re multiplying the probability of a threat by the impact of that threat.

IoT devices have three characteristics that work against you, as a device designer:

They probably won’t get updates. It’s not in the financial interests of the business.
Even if an update is available, users aren’t likely to install it. Users often don’t realise that the IoT device exists!
Some IoT devices operate for a long time – decades, in some cases. Think SCADA hardware, HVAC controls or implantable medical devices.

As a general rule, attacks only get better with time. As a designer, you’re not just defending against the attacks that exist today – you need to defend against future attacks that haven’t been invented yet!

For these reasons, I believe that you should design your device assuming that it will be breached. Once you adopt the mindset that some or all of your security controls will fail, you’re in a better position to design them to minimise the impact of a breach.

Key revocation

Key revocation isn’t a great strategy given the constraints of IoT devices, but there’s plenty of prior work that you can draw from.

For example, the designers for DVD and Blu-ray assumed that some of the encryption keys would be leaked. They’re distributed in every single device (or software instance), they’re difficult to control (DVD/Blu-ray players are cost sensitive) and attackers have a strong incentive to extract the keys (high quality duplication and distribution of content). Knowing this, they designed the system so that leaked keys could be revoked and new content could not be played on compromised hardware. This also incentivises the hardware designers slightly – if one of their players lost a key, they would be ‘punished’ by having their players unable to play new content. (One might debate if this is a punishment or a blessing; they would have to explain to customers why their player won’t play new content, but many customers would just buy a new player.)

Defence in depth

If you use the model that individual controls will fail, the obvious solution is to have multiple controls for a particular risk.

The controls need to be as independent as possible. Having two separate software checks for unauthorised access doesn’t help you if the attacker uses a debugger to bypass them both. Having a software check and an external hardware check would help.

Partitioning

Keep high-risk areas of the system separate from low-risk areas. Where possible, separate independent risky systems. You don’t want a breach in one subsystem to impact another.

SSH (the software) does this through “privilege separation”. Parts of it need to run as the root (administrative) user, but most of it can operate as a less-privileged user. To reduce the attack surface, SSH is split into different sections with different privilege requirements.

Cars are another great example. Modern cars want integration between all car systems – the user wants to be able to control everything from one interface. On one hand, you could reduce manufacturing costs by sharing CPUs and networks between the two. On the other hand, you don’t want a vulnerability in your media system from affecting safety-critical systems. Best to keep them separate and control the interfaces between them carefull.

Canaries and tamper detection

If you’re storing sensitive data (e.g. content decryption keys) you might be better off destroying the keys and/or device if an intrusion is detected.

This can be done in software. What an intrusion looks like varies tremendously between applications, but you might look for changes in memory that should not change, commands on debug ports, or ‘kill’ commands on interfaces that follow the same pattern as regular commands but are designed to catch fuzzing.

If you’ve got a Linux system, a great strategy would be to build AppArmor profiles for your application. Any AppArmor violations could trip a self-destruct. (Just limiting your application is a great start, even if you do nothing about violations!)

If you (the vendor) have remote telemetry coming from devices in the field, it might alert you to attacks in progress.

You can also use hardware modules that store secrets in a tamperproof manner. They cost money, of course.

Remote killswitch

The device can be remotely deactivated – perhaps by network control, perhaps by a physical button or radio signal.

This is useful for devices which can cause personal injury if something goes wrong – autonomous aircraft, industrial equipment, surgery robots, perhaps even cars.

Ideally, you want the ‘killswitch computer’ to be separate from the ‘application computer’ – a compromised or damaged application computer might not execute the kill command correctly.

A recent example of this is the Galaxy Note 7 recall, where an OTA update disables charging to reduce the risk of fires.

Frequently asked questions

ian@mutexlabs.com (Ian Howson) — Thu, 08 Dec 2016 00:00:00 +0000

TL;DR: what is going wrong?

There is tremendous pressure for IoT devices to be cheap to manufacture. Cheap means that hardware capabilities are the bare minimum, firmware and security design quality suffer, and there will be no support after the device is sold. As a result, at release time, most devices have obvious vulnerabilities. The older a design gets, the more we learn about breaking it and the weaker the device gets. As there is no support, none of these vulnerabilities will be resolved.

TL;DR: how do we fix it?

As a designer, the OWASP IoT recommendations are a good place to start. Produce a threat model and a risk analysis.

Globally, the device side can’t be fixed. Most new devices will be full of vulnerabilities forever. People need to know that this is the case and trust devices accordingly. Our networks need to be resilient against potentially malicious devices.

Why is IoT special?

It isn’t. The vast majority of software, systems and cloud security applies to IoT with no modification.

IoT varies because:

hardware and firmware are bundled together, causing severe cost constraints
hardware and firmware are often developed by the same team; they usually don’t have software security training
the business usually gets revenue through hardware sales, not subscription services, and so there’s no incentive to produce security updates after the sale is made
often hardware capabilities are tiny, so you don’t have the same range of security controls available
you don’t physically control the hardware, opening up a range of physical and electronic attacks

Why don’t we just put regulations on device manufacturers?

Because all device manufacturing is done in China. How does the U.S. or EU mandate something as vague as ‘secure device’ in China?

Besides, as a consumer, would you pay another $40 for your Internet router? Or an ongoing fee to keep the firmware up-to-date? Of course not! You love cheap stuff and unless you’re in the narrow subset of the population that actually knows that infosec is a thing, you’re going to buy the cheapest device that does what you want.

Who are you and why should I listen to you?

I’ve been shipping embedded systems for over a decade. Most of them connected to the Internet. Some of them have been implanted into people’s bodies. Some of them have been hacked. I’ve also spent a lot of time attacking them, both as a penetration tester and as part of my own test procedures.

The same keys on every device

ian@mutexlabs.com (Ian Howson) — Wed, 07 Dec 2016 00:00:00 +0000

In “How does firmware get onto the device?”, we learned that every single IoT device is identical after leaving the manufacturing floor. This is different to traditional network security where every computer has unique private keys and unique passwords.

Common variations between devices

Anything with an Ethernet, WiFi or Bluetooth interface gets a unique MAC address. Theoretically, these could be programmed at test/bringup time. In practice, this is hard to do (manufacturing floors don’t have reliable network access and can’t afford downtime). It turns out that EEPROM vendors will sell you tiny EEPROMs preprogrammed with a range of guaranteed unique MAC addresses, so they just populate the board with one of those. The production process is thus consistent! Many modern CPUs/SoCs/radio interfaces also provide a built-in unique device ID or MAC address, and so we use those.

(There are plenty of Ethernet devices out there that don’t have unique MAC addresses at all; they just generate one randomly and hope for the best. Thirty cents saved. Just sayin’.)

You can script your device to generate its own private keys at first-boot. I know that Raspberry Pi firmware images do this. I don’t know of any production IoT devices do this, probably because it delays bringup by a minute, and on a manufacturing floor, more time means more floor space means higher manufacturing cost.

So, in practice, we have thousands to millions of devices being shipped which have exactly the same private keys and exactly the same hidden passwords.

Why is this a problem?

Remember risk analysis: likelihood of breach times impact of a breach.

For something like an SSL private key, the likelihood is very low. Consider the XBox public key attacks. We’ve got unlimited access to the public keys and hardware and known plaintexts. But without being able to generate that private key, we’re stuck.

If that private key is leaked or reconstructed somehow, the whole system falls apart. People can write their own software and run pirate games. There’s massive impact that would end that product line and most future development for it.

Consider a DVD player. It contains a symmetric encryption key which is shared across that class of players. If one player’s key leaks, the key is leaked for all players using that key. There was consideration given to this in the DVD CSS scheme (new content will not be decrytable using a revoked key), but the impact is still huge. All past content is now decryptable because one device out of billions was compromised. So the likelihood and impact are both fairly high; this is (was) a high-risk system.

Impersonation becomes a real issue. If you’re trying to attack a particular device on a target network (say, a router or a camera), you don’t have to gain access to that exact device. You can buy a device of the same type and attack that instead, in the comfort of your own home/office/dungeon, taking as much time as you like. You can buy a hundred of them and subject them to a range of attacks, including destructive attacks. You can backdoor one of your devices and physically swap it with the target device. And in extreme examples – say the device carries a private key – you can extract the private key from your device and use it to perform crypto-level attacks on the target device.

None of this is possible if each device has its own unique private keys.

OK, so I’ll generate private keys on first boot

Did you put in a reasonable source of randomness? Most embedded devices have no random number source; many don’t even have a clock (which is a weak but tolerable substitute). There’s no point in generating keys on first boot if every device ‘randomly’ generates the same keys because they’re starting from the same state. Manufacturing processes encourage every device to be identical!

Your unique identifier can help a lot. It’s predictable, but at least every device will be different.

How does firmware get onto the device?

ian@mutexlabs.com (Ian Howson) — Wed, 07 Dec 2016 00:00:00 +0000

One of the big difference between IoT devices and software is that IoT devices are manufactured. Manufacturing processes focus on consistency and reproducibility; all variations must be eliminated.

Once the hardware is assembled, the firmware must be written to the device. Usually, this is done in one of three ways: during testing, using pre-programmed chips, or during chip manufacturing.

During testing

During manufacturing, a hardware assembly will undergo basic electrical testing. Usually this is achieved by putting the assembly into a test jig, connecting to test points on the assembly with pogo pins, and checking that various signals and voltages on the board are within tolerance. Assemblies that fail these checks are returned for rework or discarded.

This is a convenient time to write firmware to the device. This happens automatically if the device passes electrical testing. It costs a few seconds to a minute, depending on the design of the device.

Often, a small bootstrap firmware image is all that is written. Test jig time is expensive and it’s slow to write large amounts of data. It’s also difficult to change the firmware image after the manufacturing line is set up. From a project management point of view, a late firmware project doesn’t delay setup of the manufacturing line. So a minimal bootstrap image is written through the jig, and final production firmware (including any updates) is written at a later stage.

This is the most common way to get initial firmware onto a device.

Pre-programmed chips

Most devices use blank off-the-shelf chips and write their own image to them. Sometimes, it’s easier to program the chips before assembly onto the PCB. This might be done as an in-house manufacturing step using a dedicated device programmer, or it might be done by a third party (usually the chip manufacturer). The same guidelines as above apply:

the image is difficult to change, so you either write a bootstrap image or be very confident that you’re not going to change the image later
obviously, there’s no permissible variation between the images
incorrectly written firmware can write off an entire assembly, so your cost will increase slightly (assembly writeoffs, plus your chip is going to cost a little more)
you have to test the assembly anyway, so there’s not a big argument to be made for reducing the number of test points on the PCB

This is mostly used when you have a stable, high-volume product.

Built-in CPU support

Many modern CPUs, SoCs and microcontrollers have bootloaders built in. Some of these are remarkably sophisticated, able to interface with MicroSD storage, parse FAT filesystems or communicate over network interfaces.

If you’re custom-building your SoC (say, for hidden crypto/trust modules) then you’re in a good position to bake in a bootloader that works for your hardware platform and thus save some manufacturing time.

You might not need to write firmware to the device at all until just before it’s shipped.

All devices are the same

Manufacturing processes are driven by two concerns:

Minimise cost. Cost is driven by manufacturing time, equipment cost, floor space and human interaction.
Minimise variability. A consistent, predictable process can be optimised for lower cost.

At the end of the manufacturing process, every single device is identical. Any variation – except in very specific, controlled ways, such as a unique ID chip – is considered waste and must be eliminated.

Commentary on the Sony IPELA IP Camera backdoor

ian@mutexlabs.com (Ian Howson) — Wed, 07 Dec 2016 00:00:00 +0000

It turns out that a range of Sony IP cameras had a hidden telnet/SSH server: http://blog.sec-consult.com/2016/12/backdoor-in-sony-ipela-engine-ip-cameras.html?m=1

What’s good about the design?

The servers weren’t wide open to the world. Getting access required:

a firmware dump (easy; available on the public Internet)
analysis of the dump (usually hard, but reasonably automated in this case)
reversing password hashes
- Difficulty depends on the password, but unfortunately one of them was ‘admin’. The other is as-yet unknown, so… hard?
disassembly of the firmware to figure out how to access the servers (hard)

So while this looks bad on the surface, this hack actually required a lot of effort. While obviously critically flawed, in the ecosystem of IoT devices, this one is better than most.

Sony responded appropriately and released an update for the cameras.

What’s bad about the design?

The servers weren’t disabled before the cameras were shipped out. This, in my mind, is the critical problem. The manufacturing line needs to have privileged access to the device; this is where firmware gets uploaded, hardware gets calibrated and the device is tested. The servers need to be present. They must not be enabled after device shipment.

I can understand each device having the same passwords. This is a manufacturing convenience which saves money and time. Every device gets the same passwords, has the same public keys and the same binary firmware image. If you’re coming from a desktop/mobile security perspective this is problematic, but in the IoT space, sorry, cost concerns override the impurity of having a million devices with the same keys.

Some devices – especially Internet routers – will assign a different password either at manufacture time or based off a unique device ID embedded in the hardware. The cameras certainly have a unique MAC address, so the hardware is present.

What did we learn?

If you have hidden secrets on your device, they will be discovered given enough time.

Once any secrets are out, they’re out for a whole class of devices. All cameras with this firmware are now vulnerable. One key mitigation for this type of attack is that each device (singular) get different keys and passwords; at least then a breach of one device only affects that device.

Speculation

I don’t think this was intentional (in the sense of “hey let’s run telnet and SSH nobody will notice”). Certainly Sony have enough smart engineers to consider the security ramifications of onboard web, telnet and SSH servers; it’s even likely that they have a threat model and risk analysis. They also have enough history manufacturing this sort of device that they know that manufacturing wants certain access to the device. And this same mistake hasn’t been found in other Sony devices to date.

My bet is that the firmware engineers have handed this off to manufacturing saying, “hey, we enable telnet and SSH so you can test and calibrate on the line. Make sure you turn it off.” And manufacturing, being a totally different sort of engineer, have written their scripts that run over SSH, set up the production lines and forgotten the warning. Or maybe they left it on not understanding the security impact; disabling the servers makes their life difficult if they want to retest or service a device. It’s an easy mistake to make in a big company.

What’s the impact?

Usual vulnerability disclosure ethics require that you give the vendor some time to correct the vulnerability before publishing it. This has been done here; all credit to the SEC Consult team. But I can’t help but feel that this is bad policy for IoT devices. Sony have produced new firmware, distributed it, and yet… the vast majority of cameras in the wild will not get the update. They will be vulnerable and had someone not gone looking (using the specialist knowledge and tools above) it’s unlikely that the vulnerability would have been discovered. Certainly, there are more profitable places for miscreants to search for vulnerabilities on their own.

So while I advocate openness and disclosure, I think the usual disclosure policy might need some adjustment for devices which can’t be easily updated.

As a result, any Internet-connected devices using this firmware will now be easily harvested for botnets.

What's the least I can know?

ian@mutexlabs.com (Ian Howson) — Tue, 29 Nov 2016 00:00:00 +0000

IoT devices have security issues because they’re built to be as cheap as possible. The hardware required to provide adequate security is expensive, large, and consumes a lot of power. IoT devices stay ‘in the field’ for a long time and the business model of most vendors does not incentivise them to produce security updates.

Produce a threat model and a risk analysis before you do anything else. Most devices do not need strong security.

Figure out what sort of hardware you have. This will dictate what security controls are available to you. The vast majority of IoT devices in the field are not capable of strong security.

The pinnacle of IoT security is the modern iPhone. By learning about its security measures, you will learn a lot about what is required to produce a secure IoT device. It’s difficult and expensive.

If your device controls something of value, you should assume that your device will be compromised. Plan accordingly. Design the device to minimise the impact of a breach.

What is IoT?

ian@mutexlabs.com (Ian Howson) — Sat, 26 Nov 2016 00:00:00 +0000

The Internet of Things blah blah revolutionise fifty billions blah blah everything connects to the Internet blah change your life.

There’s a simple, boring, accurate definition: Internet of Things is a new name for ‘embedded systems’. There are already billions of invisible networked devices controlling parts of your life, and they’ve been running for decades. The major change is that more of them are connecting to the Internet.

OK, what’s an embedded system?

A computer that is part of a device. Perhaps an electronic device that contains a computer.

Where a general-purpose computer can be adapted by the user to suit many applications, an embedded system is pre-programmed for a single, specialised purpose. Usually you will buy the device hardware and software as a single unit for a single purpose.

You said that these have been around for decades. How is that possible?

Here’s a few examples. You will own many of them.

Internet routers
HVAC controls
Car ECUs
Fitbit
Electronic children’s toys
Battery chargers
DVD players
Hearing aids
That box on the street that connects your house to the fibre network
Electronic door locks

These are all embedded systems. They all contain a programmable computer. Most of them connect to networks, some of them to the Internet.

Every single one of these has a slew of security vulnerabilities. They were all designed to be cheap to manufacture. Practically none of them receive security updates.

Some IoT/embedded devices that give particular attention to security features are:

DVD players (copy protection of discs)
Game consoles (copy protection of discs)
Pay TV/cable boxes (ensuring that customers have paid for their service)
Smartphones (sandboxed execution of downloaded apps)

We’ll come back to these, as they provide great examples of the cost tradeoffs that we need to make to achieve good security.

Blockchains on IoT devices

ian@mutexlabs.com (Ian Howson) — Fri, 25 Nov 2016 00:00:00 +0000

Assuming you have an application that warrants building a blockchain and further, that it needs to be running on an IoT device, there are a few major implications for your device’s design and cost that follow.

1. You need the full suite of cryptographic capabilities

The smallest device that you’re reasonably going to fit a blockchain application into will:

Be capable of running Linux
Have flexible storage (i.e. a filesystem with integrity guarantees)
With decent Internet connectivity

This excludes most of the cheap/low power hardware platforms and sets a minimum hardware cost starting at around $15.

You could squeeze things harder – Blockchain crypto only requires kilobytes of RAM – but in 2016, your effort needs to be in getting the blockchain side of things right, not in trying to reimplement everything to save RAM.

2. You need flexible storage

You’re going to store a lot of data and update it regularly. A filesystem isn’t a strict requirement, but it’s going to make your life a lot easier.

Most devices have unreliable power and so your life will be a lot easier if you use a filesystem with atomicity and integrity guarantees. Better to re-download and re-verify the last few blocks than the whole chain.

3. You need robust updates

Blockchain is new and largely untested
We’re regularly finding vulnerabilities in old, well-reviewed codebases
Attackers will have a good reason to attack your system

So you will need to be able to roll out updates to your devices quickly and securely.

How you do this in a decentralised manner is an interesting problem. Again, if you’re centralising updates, you might as well centralise the whole database. If you don’t centralise updates, who provides updates, and how do you trust them?

4. Think about what happens as your blockchain grows

Blockchains only grow in length. As transactions are added, the chain gets longer, without bound.

This length is burdensome for Bitcoin right now – we’re at about 80GB, which is about $40 worth of flash memory.

If your device is going to run for say, five years, you need to provide enough storage to last the life of the device, not just the current size of the blockchain.

To address this, many Bitcoin clients only track recent transactions – say, the most recent gigabyte. This reduces the size of the storage required, and importantly bounds it to something predictable. It has two problems:

You don’t have the whole state of the system available. If you need data which is not in your ‘recent transactions’ list, you need to retrieve it from somewhere. If you’re doing a property ownership blockchain, for instance, the last transaction on a property in question may have been 25 years ago.
You need to have a trusted third party who stores the whole transaction history. The difficulty of finding a ‘trusted third party’ is much of why blockchains are interesting right now!

Both of these problems can be solved (hashes over un-stored parts of the blockchain, DHT/torrents for retrieval), but impose further restrictions on your application. We don’t have concrete, reliable solutions to them right now.

If you’re considering running servers to store more transaction history, the same problems as with IoT updates exist: your company may not exist in a few years and so your blockchain will stop working. Of course, this defeats the whole point of a blockchain!

5. Consider battery life

Devices running the above stack need moderate amounts of power, as far as IoT devices go. At minimum, you’ll need to be fixed to an external power source or have a li-ion battery and charger. You can’t run a Linux machine on a primary coin cell for any length of time.

Most modern devices achieve good battery life by turning off the CPU as much as possible. Blockchains, by their design, require a decent amount of computation just to track the active state of the chain.

6. Mining on IoT devices?

You could, if you really wanted to, but your device will then consume so much power that it will need to be plugged into a wall socket. If you (as a miner) wanted a disproportionate amount of mining power, you could run the same software on a desktop machine (or GPU, or ASIC, or whatever). Using proof-of-work mining on battery-powered devices (e.g. cell phones) is a bad idea.

If you can find a proof-of-stake algorithm that you trust (in 2016, still an open problem) then that would probably be feasible to run on a battery-powered device.

7. Reliable Internet connectivity

Lots of the applications for IoT blockchains require that the device be occasionally offline. This is fine if you’re reading data out of the blockchain, but:

you might miss out on recent transactions
you can’t add transactions to the blockchain without being online

8. So we just use cellphones, then?

Pretty much. A modern cellphone with a good Internet connection ticks all of the boxes. Thanks to economies of scale, you can’t build custom hardware cheaper.

Hardware classes of embedded/IoT devices

ian@mutexlabs.com (Ian Howson) — Wed, 23 Nov 2016 00:00:00 +0000

Every IoT device has a different hardware design. Each has different capabilities and makes different security tradeoffs. It’s helpful to describe broad ‘hardware classes’ or ‘technology levels’ of devices.

These classes let us classify the capabilities and cost of devices and help us to understand why particular tradeoffs are made in a device’s design.

Huge

These are 32 or 64-bit CPUs, usually ARM, with an MMU and external RAM. They can run Linux. Implicit in this is that they can also run OpenSSL and perform cryptographic operations (particularly public key operations) quickly.

Power envelope is the largest of anything discussed here, with minimum continuous power draw typically in the 10mW range and very large peak power consumption (watts). Raspberry Pi and BeagleBone are common development boards that fall into this class.

These are easy to develop for – you can use an off-the-shelf Linux distribution in most cases. Your chip vendor will probably supply one for you.

Because you’ve got lots of RAM and can afford dynamic allocation, you can use higher-level languages than C or C++. Your development will be faster and cheaper if you use something like Python or Go for non-real-time parts of the application.

The hardware requires a CPU, external RAM (sometimes on-package, but always separate die), external flash and relatively complex power supplies. The increase in COGS is typically $15 or more, which translates to a $30-$100 difference in the final retail price of the device.

Running Linux?

Linux is a big, heavy operating system for an embedded system. It has longer boot times, unpredictable real-time behaviour and a large, complex software stack that is difficult to reason about. You gain fast firmware development, easy access to third-party software and flexible filesystems.

Linux can take a long time to boot – often longer than users will tolerate. Typical boot time is 30-60 seconds. Sub-20 seconds is achievable without much work. There are research efforts to bring this down to under five seconds. Five seconds is still too long for many applications (control systems – medical, industrial, drones) but they will usually use separate microcontrollers for the time-critical functions.

The possibility that a Linux system might not boot reliably or in a predictable amount of time (e.g. failed fsck, down network) can be enough to eliminate it from many applications. Without a UI or human to power-cycle it, you may not be able to bring it up again.

Lots of Linux systems ship with an extra microcontroller (like IPMI) that can reset the machine if it doesn’t respond in a fixed period of time. Often this micro will need to talk over the network – with all of the associated security risks. Now you have two embedded systems to worry about!

Embedded Linux devices are vulnerable to all of the same problems that a Linux machine on the Internet has. You’ll need to ship regular security updates, which will probably require a longish period of downtime to apply.

Large

You can get the same CPU power as a Huge device (32-bit ARM clocked at anything you like) but bundled with on-chip RAM and flash. This is a great compromise for many devices. COGS is significantly lower due to the integrated storage and the device will have more built-in peripherals.

These can have megabytes of flash and RAM or as little as kilobytes. Power consumption can be very low, but you must go smaller for the lowest-power devices (microwatt and lower).

You can’t (as of 2016) run Linux on these as they do not have enough RAM or an MMU. You need to select an RTOS or ucLinux. You can’t just pull software components from the Internet; you must consider how they will be integrated into your firmware. As a result, firmware development time will be longer.

You might be able to run an interpreted language on here (perhaps Go or JavaScript), but this is rare in practice. Even though the hardware is capable, most developers elect to use C or C++. You do have the luxury of dynamic memory allocation, should you choose to use it.

As you’re using an RTOS from a third-party vendor, you probably have all of the vulnerabilities that a Linux system does, but without the intense scrutiny that Linux has. In other words, you’re just as vulnerable, but you don’t know it.

These devices are capable of the whole range of cryptographic operations, but because you’re using a custom software stack, you can’t just drop in OpenSSL. You need to be very careful to ensure that any cryptographic libraries you include are correct, secure, and legal for you to bundle with your product (in the export controls sense). Supporting the full range of certificate operations (expirations, revocations, updates) requires flexibility in your use of flash, and that is challenging as you don’t have a general-purpose filesystem to rely on.

Increase in COGS is $1-$5.

Sometimes these are embedded into other SoCs, such as WiFi modules.

Medium

16-bit with integrated RAM and flash. There is no dominant architecture at this time. You’ve probably got 16-1024kb of onboard RAM. Clock rates range from 4-80MHz.

You can use C++ but the storage and runtime overheads may be burdensome. You might be restricted to plain C. You could use dynamic memory allocation, but you probably don’t have enough RAM to do so safely.

To save power there are provisions for switching the CPU and peripherals off. You usually won’t see sub-1MHz clock rates.

Public key crypto is doable but takes significant development effort. The CPU time required might be noticeable by the user. You should consider using an external cryptoprocessor.

These are often embedded in BLE and Bluetooth chipsets.

Small

8-bit CPUs e.g. AVR8, 8-bit PIC, 8051.

Often Harvard architecture (split instruction/data memories). Often the instruction memory is mapped directly to flash. This is interesting for security as buffer overflows will trash data RAM but not affect instructions.

These are often embedded in smaller RF chips such as the Nordic nRF series.

They can have tiny power consumption if programmed appropriately; they can run for years on a coin cell.

You technically can run C++ code on these, but there’s no point. You can’t fit much code on them in the first place, C++ gets poor compiled code density, and you’d only use C++ if you have complex software anyway. So just stick with C.

Because quiescent power consumption is so low, some applications stop thinking in terms of continuous power consumption (e.g. 10$\mu$A continuous) and start thinking in terms of number of power-consuming operations (e.g. 400 door strike activations on a single primary cell). You don’t use rechargeable cells in these applications because self-discharge is too high. This saves further on BOM cost.

Symmetric crypto is usually OK on these, but key setup time can be noticeable. Public key crypto is generally impossible as they don’t have enough RAM to hold the key. Of course, any cryptography is going to run the CPU hard for a long time, and this is going to hurt your battery life.

If you’re using coin cells to run your device, the peak power consumption can pull their voltage low enough that the CPU will brown out (or worse, latch up). This is problematic for boot-time firmware verification; a device will be fine if you leave it on, but if you turn it off it won’t be able to start again.

Tiny

8-bit, under 128 bytes of static RAM and maybe a kilobyte of flash – tinyAVR, for example.

There’s probably not enough RAM to store a symmetric key, so any crypto is challenging here. Some algorithms have smaller working set sizes, but you’re in desperate territory. The vast majority of designs that need cryptography will select a large CPU or have a dedicated cryptoprocessor to do the heavy lifting.

Typical clock rates are 128kHz or 4-16MHz. If you’re in the MHz range and have enough RAM, you can run some symmetric algorithms. kHz-range designs will incur noticeable delays.

You can use C. You might use assembler if you’re counting pennies.

CPU clock rates are meaningless now

ian@mutexlabs.com (Ian Howson) — Tue, 15 Nov 2016 00:00:00 +0000

History

Way back when, you bought a CPU, and it had a marked clock rate. You ran it at that rate. The end.

Later you got a turbo button, but that was more for application compatibility. We were spinning for a fixed number of cycles to mark time!

Around the Athlon time, we started throttling CPUs. It turned out that they could be damaged if run too hot for too long, and laptops were having trouble getting the heat out. So Intel (and later AMD) parts started to slow themselves down if they got to a dangerous temperature.

Turbo Boost

Later still – within the last five years – we got ‘Turbo Boost’. Originally, this was to reflect that the CPU could run faster for a very brief time, but eventually we would be unable to remove the heat fast enough and the CPU could reach dangerous temperatures again. In some ways, this reflected the thermal mass of the CPU, its heatspreader and the immediate heatsink. Heatpipes were now in common use, and while they could remove a lot of heat from a small area, they couldn’t change the rate of heat conductance rapidly. While desktops were usually designed to remove all of the heat that the CPU could produce at maximum power, laptops couldn’t afford this – the space and weight required was just too great.

Recently, “a brief time” has become “a really long time”. My wife’s Macbook Air, for instance, runs at a ‘base clock’ of 1.6GHz. If you watch the actual CPU speed, however, it never runs at 1.6GHz. If it’s idle, it will run at less than 1GHz (and it’ll actually be asleep for much of that). If you work it hard, it’ll increase to 2.4GHz. For as long as the workload lasts. So there’s no thermal mass effect here – it’s just 1GHz/sleeping for low load, 2.4GHz for high load, and somewhere in the middle for a mixed load.

Under high load, the clock rate is determined by the cooling capacity of the laptop. But – importantly! – there are no circumstances under which the laptop will ‘prefer’ to run at its rated ‘base clock’ of 1.6GHz. There’s no point. The CPU can adjust its clock rate anywhere from about 1GHz to 2.4GHz in fine-grained steps, and it chooses the exact clock rate that it needs to balance performance and energy efficiency.

So what is the base clock?

Intel Ark has this to say about “Processor Base Frequency”:

The processor base frequency is the operating point where TDP is defined.

Nowhere does it say “this is the preferred frequency” or “this is the maximum” or “this is the most efficient point”. It’s just where the processor runs at its TDP. The TDP is chosen by Intel! The exact same CPU can be sold at two different TDPs, at two different clock rates, to two different markets (e.g. laptop and desktop).

TDP is ‘thermal design power’ – typically between 5W and 150W for modern Intel chips. Importantly, though, it’s an arbitrarily chosen number. For a laptop part, TDP is chosen to be smaller – say, 15W. For a desktop or server part, TDP is larger – 35-135W. TDP is important for manufacturers because it dictates how big a cooling solution is needed. If they have to move a ‘nominal’ 15W from a laptop CPU instead of 135W for a many-core server CPU, they can use a smaller and lighter cooler.

Higher clock speeds and core counts require higher output power. TDP is arbitrarily selected to suit the end-user, but it doesn’t imply that the CPU is more or less capable than another. We know that our ‘1.6GHz’ CPU can run over 2.4GHz! It’s just that at the TDP, this is how fast we can run in steady state. The same CPU could run faster forever if you have a big enough cooler!

So, ‘base clock’ is a pointless figure now. Intel and the machine manufacturers publish it, but it’s more like “under these circumstances (workload, ambient temperature and heatsink efficiency), we can run this CPU at this clock rate indefinitely”.

Cooling

The computer manufacturer thus has a big impact in how fast the CPU will run, because they design the cooling system. A too-small cooling system (e.g. Macbook Air 11” or 2015 Macbook) will constrain CPU performance simply because under load, the CPU will heat up and the clock speed will need to be reduced. A too-small cooling system is great for the manufacturer (less weight and volume leads to a smaller, lighter laptop) but you’re trading off CPU performance. Cooling efficiency is never reported!

For CPUs in the same series and with the same nominal TDP, there might be advantages to the faster ones. They’re sold as faster for the same rated TDP, and conversely they might run slightly cooler at the same clock rate. Given that the difference in clock rate is usually tiny (10%) and the price difference can be huge (hundreds of dollars) there’s rarely any point in buying the faster parts.

All of this is wrapped up in the GHz figure – the one the consumer looks at – but it’s no guarantee that performance is actually better. A laptop with a high clock rate, high TDP CPU might perform worse than one with a lower clock rate if the cooling is inadequate.

Case study: the 2016 Macbook Pros

There’s an interesting comparison to be made between the 2016 Macbook Pros. The ‘Escape Edition’ has a 2.0GHz CPU, while the ‘Touch Bar’ model has a 2.9GHz CPU. On the outside, the machines look identical (except for the Touch Bar). Inside, the differences are tremendous. It’s a completely different design. Notably, the Escape Edition has a single CPU fan, while Touch Bar has two fans and bigger heatsinks.

The Escape Edition’s CPU is an i5-6360U, while the Touch Bar’s is an i5-6267U. Other than the TDP and Base Frequency, the parts are identical!

At the time of writing, the Geekbench single-core benchmarks show:

Escape Edition (2.0GHz): 3608
Touch Bar (2.9GHz): 3769

The Touch Bar model has a 45% faster base clock. We’re testing a CPU-bound workload. We would expect it to get close to a 45% increase in performance. In reality, it only gets a 4.5% increase. The clock rate does not tell the full story!

The Escape Edition CPU has a maximum Turbo speed of 3.1GHz, while the Touch Bar CPU has a maximum Turbo speed of 3.3GHz – a 6.5% increase. This more closely explains the difference in benchmark results!

Better yet, the 1.2GHz Macbook scores 3003. That’s 80% of the performance of the Escape Edition with 41% of the base clock rate.

How to enable the oplog on Ubuntu MongoDB for Meteor

ian@mutexlabs.com (Ian Howson) — Mon, 15 Aug 2016 00:00:00 +0000

If you install MongoDB from Ubuntu 14.04 LTS, there’s a few steps that you need to take to enable the oplog for use with Meteor.

To enable the oplog, we need to enable replication. We’re not actually going to replicate to any other servers.

1. Enable oplog

In /etc/mongodb.conf, add

replSet = rs0

Restart mongodb:

service mongodb restart

As root, run mongo to get a shell. Run:

rs.initiate({_id:"rs0", members: [{"_id":1, "host":"127.0.0.1:27017"}]})

You should see something like:

{
        "info" : "Config now saved locally.  Should come online in about a minute.",
        "ok" : 1
}

You can run rs.conf() and rs.status() for more information.

2. Add a user to access the oplog

Switch to the admin database with:

rs0:PRIMARY> use admin

then create the user:

rs0:PRIMARY> db.addUser({user: "oplogger", pwd: "password", roles: [], otherDBRoles: {local: ["read"]}})

3. Tweak your Meteor config to use the oplog

In your Meteor environment settings, add:

MONGO_OPLOG_URL=mongodb://oplogger:password@172.17.0.1/local?authSource=admin

If you’re using my setup with Meteor running in Docker containers on AWS machines, you need to use the host IP like so:

MONGO_OPLOG_URL=mongodb://oplogger:password@172.17.0.1/local?authSource=admin

Note that you must use the local database, not whatever your application is configured for. Also note that this has security implications if you intend run separate applications on the same database server.

4. Restart your Meteor application

It it comes up without errors, that’s a really good sign!

5. Confirm that the oplog is being used

There’s some advice at https://github.com/meteor/docs/blob/version-NEXT/long-form/oplog-observe-driver.md but it’s pretty old and requires you to change your application. To be continued!

Etymotic ER4XR review

ian@mutexlabs.com (Ian Howson) — Tue, 05 Jul 2016 00:00:00 +0000

How do they sound?

Pretty much as you’d expect ER-4’s to sound. Flat, with very well defined treble. Compared with the original ER-4P, the ER4XR has much more ‘present’ bass. It’s not overwhelming or boomy. It’s just there. You don’t have to search for it like with the ER-P.

I got the ER4XRs to replace some broken UM3xs. The ER4XR seemed to have very harsh treble initially, but this settled down after a few minutes of listening. This isn’t surprising as the UM3x has fairly muted treble. It’s probably a psychological effect.

The UM3x has much more powerful bass than the ER4XR. Whether it’s better would be a matter of personal taste. I did enjoy the bass of the UM3x, but I also enjoy the treble of the ER4XR.

The ER4XR bass can surprise on occasion. The bassline of Red Hot Chili Peppers’ ‘The Getaway’ gave some unexpected thrills. On the other hand, I found the drums in Tool’s Forty Six & 2 to be a little underwhelming; they’re pretty fantastic on the UM3x.

iPhone has a ‘Bass Booster’ EQ option. It’s a little too much, but it helps some tracks. Forty Six & 2 goes back to punching me right in the eardrums without losing too much midrange. On my Mac, I boost the bass with AU Lab and they respond extremely well.

Input level

No complaints. About ¹⁄₃ on my iPhone is comfortable for regular listening. The UM3x was extremely sensitive, and this was a nuisance.

Isolation

I fly a lot and have noisy children, so isolation is very important to me, even more than sound quality.

The ER4XR has the same great isolation that you’d expect from the ER-4, of course. It’s far superior to what you get with the UM3x. I could hold a conversation with my UM3xs inserted; that’s not possible with the ER4XR.

It’s also easier to get good insertion depth with the ER4XR. I found that the body of the UM3x got in the way.

I use foam pads, and they fit securely on the barrel of the earphones. I feel comfortable inserting the pads right inside my ear canal. With the UM3x, pads would sometimes slip off the barrel and get lodged in my ear canal. (I eventually painted some nail polish around the barrel to thicken it, which helped a lot.)

Microphonics and cabling

The top segment of the cable (beyond the splitter) will induce a lot of noise. If I pull the splitter tight up under my chin that eliminates most of it, but that looks silly and gets in the way.

What does work well for me it running the cable over my ears, behind my neck, left over my left shoulder and clipping it to my shirt. I get no microphonics and it’s out of the way.

I do miss the over-the-ear cable from the UM3x. It was also possible to lie on my side with those; it’s impossible with the ER4XR.

One major plus to having a straight (not over-the-ear) design is that there’s a lot less tangling of the earphones themselves.

The cable is nice and long. I had an aftermarket cable for UM3x which hardly tangled at all, but the ER4XR one is pretty good. It’s not braided all the way, which helps.

After about two years the cable has become damaged at the earphone strain relief points. This is a bit disappointing, especially as the cable is expensive to replace (USD50/EUR50). I was able to repair and reinforce my old cable, color-coding the left and right in the process. I tried some aftermarket MMCX cables but none work as the ER4XR’s MMCX connector is recessed into the earphones.

The price

I paid AUD$200 for my old (used) ER-4Ps; I paid probably AUD$450 for the UM3x. The ER4XRs were AUD$539 landed due to the exchange rate and UPS international shipping.

This sounds like a lot, but:

Historically, I get about five years out of a set of IEMs
Etymotic are relatively popular and so there’s a decent market for them used. If I choose to upgrade I’ll get some cash back.
Etymotic use their own pads, and they’re cheap to replace. This actually matters! Comply pads are about $5/pair, Etymotic are $1, and I figure a pair lasts a month, so that’s 60x$4=$240 saved over the life of the earphones.

The pads

Comply pads don’t last very long. They’re quite soft and tear easily under normal use. I can wash a pair once (just soak in boiling water) but they don’t survive a second wash.

Etymotic pads are a bit rougher, but once inserted there’s no comfort difference.

Comply do offer multiple colours. I use the audiologist convention of blue pads for the left ear and red for the right. The ER4XR markings are difficult to see and you can’t get different colours.

Apparently soaking the pads in hydrogen peroxide will remove the earwax without drying out the pads. I haven’t tried it yet.

Ergonomics

The ER4XR plug tip is narrow, so it’ll fit your iPhone while it’s in the case.

The cable clip is too loose – it slips off and gets lost. I wrap tape around the clip to stop it coming apart.

Conclusion

These are great earphones. There’s very little that I could suggest to improve them.

Turbo Boost and MPI

ian@mutexlabs.com (Ian Howson) — Wed, 04 May 2016 00:00:00 +0000

There’s this attitude when optimising that if you’re not maxing out all of your processing resources, you’re wasting them.

Utilisation is a good guideline, but it’s missing the wood for the trees. You actually want your task to run faster! Using more resources doesn’t guarantee that your job will run faster. If you have idle resources, you will usually get gains by using them, but it’s not a guarantee.

MPI programs are often written in the form (A):

across all nodes:
    do the same task

rather than (B):

do a task on node 0
broadcast the result to all nodes

Form A is ridiculously wasteful of hardware resources – it uses N times the CPU cycles as form B, but is often slightly faster. Why? In form B, the total time is <time to run your task> plus <time to broadcast>. In form A, under the assumption that CPU cores are independent, the time is <time to run your task>. You’ve saved <time to broadcast>, which matters if it’s a measureable percentage of <time to run your task>.

Are CPU cores independent? They’re less independent now than they used to be. Traditionally, the memory bus was the primary shared resource, and form A gets pretty good cache utilisation. All cores are doing the same job and will have the same memory access patterns, so caches mostly cover up the increased memory traffic.

Since 2011, Intel CPUs have supported Turbo Boost. This feature lets a single core run at higher than the nominal clock rate for a short period of time. This is motivated largely by thermal considerations. Obviously, the silicon is capable of running at a higher clock rate – otherwise it wouldn’t work at all, ever. The nominal clock speed is a self-imposed restriction that reflects that the heat cannot be removed from such a small area (maybe 1x1mm?) at a high enough rate to keep the core at a safe temperature. For a multicore package running at a lower clock rate, there’s more total heat but it’s spread across the package better. The individual cores do not reach a dangerous temperature.

So now, you can choose between multiple cores doing the same task at a lower clock rate versus single-core task+broadcast at a higher clock rate. Does this matter? It depends a lot on your hardware environment:

are you using a turbo-capable CPU?
are you virtualised?
are there other jobs running that are using cores on your CPU?
is cooling on the machine sufficient that you can keep the machine in turbo for any length of time?

If you’re in a cloud environment (EC2, DigitalOcean, etc) then there’s almost no chance that you can turbo a core as other people will be running on the same machine. Because you don’t have exclusive access to the CPUs, your cores might complete the same job in different amounts of time. A synchronised task will complete in the worst-case execution time, so your final time might be better if you reduce the number of cores you use.

On something like CUDA, there are good reasons to run the same task on many cores even if many of them are idle. The hardware architecture rewards you for orderly memory access and you usually have a scarcity of memory bandwidth, not cores. If you can do exactly the same task across many cores (where ‘exactly’ means ‘the same CPU instructions in lockstep’) then you can use all of the cores. There’s no way to fit spare tasks or other users into the spare cores like in a CPU-based environment, so if you don’t use them, they get wasted. Even different branches breaks the ‘exactly’ requirement, so you’re usually better off wasting cycles on some cores than having the size of your thread group drop from 16 to 1.

What’s the take home lesson? Test, test, test. Don’t assume. CPU cores since 2011 are less independent than they used to be, and the common practice of running identical tasks across many cores often doesn’t hold any more. Test it again.

More MPI performance optimisation

ian@mutexlabs.com (Ian Howson) — Wed, 04 May 2016 00:00:00 +0000

Previously, I compared the following forms commonly used for MPI programs; (A):

across all nodes:
    do the same task

or (B):

do a task on node 0
broadcast the result to all nodes

Under the assumptions that CPUs are independent and identical, for form A, your total execution time is T, the time taken to run one instance of the task. In form B, it’s T+B, where B is the amount of time taken to broadcast the result. Since B cannot be less than zero, T < T+B, so form A is better.

Right? Wrong! Maybe.

T is not constant. Even given a node full of identical unloaded CPUs, T will vary for no particular reason. In my previous post I covered some of the reasons why CPUs are not independent and thus T is not going to be the same for all CPUs.

Form A cannot complete until all CPUs have finished running the task. If T is different for different CPUs, the final execution time is the worst-case execution time for all CPUs. Form A has execution time max(T(all CPUs)); form B has execution time T0+B.

Note that any sensible OS will assign your form B (single-threaded) task to the CPU with the lowest load and hence indirectly attain the lowest T0.

This is particularly relevant for cloud environments which are constantly oversold and so you are always sharing CPUs with someone else. If you use all of your assigned CPUs in a form A program, you’re going to experience a lot of variability in the execution time, and your final execution time is going to suffer. Using less CPUs than you’ve paid for may actually reduce overall execution time.

Form A is better if time B is relatively high – you have a lot of data to move around or your nodes are not sharing a memory bus.

So we should use form B always? No! Test, test, test!

How I build Meteor apps

ian@mutexlabs.com (Ian Howson) — Mon, 28 Mar 2016 00:00:00 +0000

Policies

Publications and subscriptions

I disagree with the official advice to put the subscription as close to use as possible. Most of the time, I put all subscriptions in the global namespace. This works perfectly for me most of the time, with a few obvious exceptions (huge and/or rapidly changing collections). Those ones get special effort to ensure that things remain performant – remember – premature optimisation is still the root of all evil. Global subscriptions save so much development effort and are almost always the right thing to do.

I view publications more like ‘file permissions’ than ’send this data to the client’. That is, the client can see anything that is published at any time. It’s the server’s job to make sure that that is a sensible (appropriate and safe) subset of the data, and all that the client has to worry about is how best to present it. Remember that from a security point of view, you can’t effectively control data once it’s on the client.

Security

I almost always disable client-side updates and inserts on collections, preferring to use server-side Methods (with paranoid checking) wherever possible. The development overhead of doing this is minimal.

Overly permissive publications and subscriptions are by far the most common security issue. Check that your client-side caches are cleared when the user logs out. Manually query the client-side collections to ensure that only the exact data the client requires is actually present – it’s easy to mess up the publication side and overpublish.

Packages are a huge security risk. They’re not well audited at this stage and Meteor is small enough that many useful packages only have a small number of users.

Packages sometimes publish data automatically. It’s rare that this is mentioned in the documentation. You need to check that any publications fit with your access control model and (again) do not publish any more than is necessary.

Often, you’ll need to build an admin interface. Every user will receive a copy of this, and you need to think about whether it’s a risk. I usually leave admin interfaces in the same application. Occasionally it’s worth building admin code in a separate application pointing to the same MongoDB instance. This will create overheads for development. There are packages to automate this but I haven’t evaluated any yet.

The official Meteor Security Guide is excellent and worth reading carefully.

Package updates

Package updates and Meteor updates will cause breaking changes. You will need to retest everything.

So far, there is no mechanism to tell you which updates are security-related and which are merely bug/feature fixes. I hope that this is remedied soon. Individual package authors pay practically no attention to security; you’re on your own there.

Schemas

I don’t bother. Most data models are simple enough that it’s not necessary.

Error tracking

I install Raven/Sentry on every app that I deploy. It’s almost no effort and it will show you amazing debug information if anything goes wrong at runtime.

Google Analytics

There’s probably something better. This is fine for now.

Standard stuff

Deployments

I deploy small apps to Docker instances with a shared Mongo instance on a cheap 1GB VPS (BinaryLane, because they provide great performance at a great price and are in Sydney). This fits about a dozen low-traffic apps. Anything larger gets migrated to its own instance – usually to Amazon or DigitalOcean where it can programmatically scale.

Standard packages

raix:handlebar-helpers

Iron Router. I haven’t taken the time to learn Flow Router or decide if it’s going to be an improvement.

Semantic-UI. I’m no designer, but with Semantic-UI, nobody can tell the difference. I used to use Bootstrap but find Semantic much easier and prettier.

I use Blaze simply because I haven’t learned React or Angular yet. I should probably learn React at some point.

Structure

I put this last as it’s almost entirely personal preference and completely unimportant.

app/
    client/
        subscriptions.js
        templates and JS go here
    lib/
        collections.js
        shared JS goes here
    server/
        publications.js
        methods.js
        other server-side JS goes here
    public/
deployment/
    deploy.yaml - Ansible playbook
    hosts - the name of my server
doc/
    documentation in Markdown format
README.md

Connecting Meteor to Sentry

ian@mutexlabs.com (Ian Howson) — Fri, 11 Mar 2016 00:00:00 +0000

Sentry answers a critical question: are my users experiencing errors while using my application? You can test all day, but your users will do different things using different software, and they’ll find bugs that you won’t.

Sentry collects error log from your application and aggregates them for later resolution. And if you’re proactive, you can contact users who have problems directly!

I assume that you already have a Sentry instance set up, either paid through getsentry.com, or self-hosted (which I use).

Set up a new project in Sentry

Both the client and the server side will log to the same project. You probably want one project for your production deployment and one for any staging or development deployments. There’s no reason not to use it in development (even on your local machine) – just keep the clutter separate from your production logs!

I use ‘Other’ for the Platform, as Meteor isn’t common enough yet to have its own integration helpers. You can also use Node.js; it makes little difference.

You also need to whitelist the domain that your application is running on. This is controlled from the ‘Client Security’ section of the Settings tab. The easiest thing to do is to just allow errors to be submitted from anywhere by adding ‘*’ to the whitelist:

Configure your Meteor project

First, add the logging plugin:

meteor add deepwell:raven

Sentry uses strings called “DSNs” to identify clients that are sending it events. You need to provide these to your Meteor project.

To set up the server, create server/raven.js with the following:

RavenLogger.initialize({
    server: '<long DSN>'
});

where is the first (longer) DSN value.

Similarly, to set up the client, create client/raven.js with:

RavenLogger.initialize({
    client: '<short DSN>'
});

<short DSN> is the second (shorter) DSN value.

Why the difference between client and server DSN?

The execution environment for the client is totally untrusted (it’s random people on the Internet) so there’s no point in authenticating them strongly. Random people can and might push garbage into your logs. You just need to be aware of that when you analyse them.

Your server is (hopefully!) trustworthy, so you can trust it with a longer authentication string, which prevents random or malicious users from pushing useless log entries.

The short DSN is just like a username. The long DSN is like a username/password pair. You don’t need to give the clients the password because you don’t trust them anyway.

Using a settings file

I strongly recommend that you use a settings.json file to store the DSN keys. This lets you easily switch between production and development configurations. This looks something like:

{
  "public" : {
    "ravenClientDSN": <shortDSN>
  },
  "private" : {
    "ravenServerDSN": <longDSN>
  }
}

Then, your server init code looks like:

RavenLogger.initialize({
    server: Meteor.settings.private.ravenServerDSN
});

and the client:

RavenLogger.initialize({
    client: Meteor.settings.public.ravenClientDSN
});

Test

Start your app. On the client console, run:

RavenLogger.log('This is a test message');

Sentry should show your message:

Similarly, somewhere in server code (even in a new file), temporarily insert the line:

RavenLogger.log("This is a message sent from the server");

Using this in practice

Server-side exceptions should be caught and logged automatically. No extra work is required there.

On the client, exceptions are not automatically caught and logged. There’s probably an easy way to automatically wrap Meteor code, but I haven’t worked it out. Right now, you need to either:

wrap relevant chunks of code to catch and manually log exceptions
place RavenLogger.log() statements at relevant points (e.g. at assertion failures, places where you want telemetry)

How to deploy a Meteor project on your VPS using Docker

ian@mutexlabs.com (Ian Howson) — Tue, 19 Jan 2016 00:00:00 +0000

I have a lot of little Meteor projects. Here’s the exact steps that I take to deploy them on a VPS in a resource-efficient manner. A cheap VPS with 1GB of RAM costs about $10/month and can support dozens of small Meteor deployments on its own.

I use a shared MongoDB instance and run the Meteor projects in Docker containers. nginx sits at the front end (port 80) and directs traffic to the appropriate Meteor project.

For the whole server

I’m starting with an Ubuntu 14.04 LTS 64-bit server running on DigitalOcean. The same should work for any recent Debian-like distribution and any VPS host (e.g. Amazon EC2). I assume that you are running as root.

Install packages

aptitude update
aptitude upgrade -y
aptitude install nginx mongodb

Follow the instructions to install Docker. If you don’t want to read all of that, paste the following into your terminal:

apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
echo "deb https://apt.dockerproject.org/repo ubuntu-trusty main" | sudo tee /etc/apt/sources.list.d/docker.list
aptitude update
aptitude install docker-engine

Download the relevant Docker images

docker pull meteorhacks/meteord:base

Expose MongoDB to Docker containers

In /etc/mongodb.conf, change:

bind_ip = 127.0.0.1

bind_ip = 127.0.0.1,10.0.3.1

For each Meteor application you want to deploy

Add a database and user to Mongo

Run mongo and enter the following.

use <databasename>
db.addUser( { user: "<username>",
              pwd: "<password>",
              roles: [ "readWrite", "dbAdmin" ]
            } )

Replace <databasename>, <username> and <password> as appropriate.

Note that this uses Mongo polling, not the oplog.

But… the oplog! Doesn’t polling suck?

Yes, polling sucks. Remember that this setup is meant for small deployments and multiple deployments where you have many applications sharing the same MongoDB. I haven’t figured out (yet!) how to securely share the oplog between applications – you don’t want one insecure application to compromise the others. If this installation started to grow, you might notice that your machine load was high (maybe due to polling, maybe something else) and you’d be best moving to a dedicated MongoDB server with oplog access. But you’re small for now, so don’t sweat it. Premature optimisation is the root of all evil, etc.

Set up your nginx frontend proxy

In /etc/nginx/sites-enabled/<sitename>:

server {
        server_name <hostname>;

        access_log on;

        location / {
                proxy_pass         http://localhost:9001;
                proxy_redirect     off;

                proxy_set_header   Host              $host;
                proxy_set_header   X-Real-IP         $remote_addr;
                proxy_set_header   X-Forwarded-For   $proxy_add_x_forwarded_for;
                proxy_set_header   X-Forwarded-Proto $scheme;
        }
}

Replace <sitename> with a shortname for your project (e.g. todolist) and <hostname> with the URL that you want to access your project at (e.g. todolist.example.com).

Every time you update a new version of the application (and the first time)

Build a Meteor bundle

Within the Meteor project directory:

meteor build --architecture=os.linux.x86_64 ./

This will create a new .tar.gz file in your project directory.

(The minification process is a little stricter than the standard meteor run, so you might run into new syntax errors and the like.)

Copy it to the Docker host

You need a directory to store the bundles; all of the .tar.gz files in the directory will be decompressed by the Docker image.

rsync --inplace -vP <bundlename>.tar.gz root@vpshost.example.com:/opt/whatever/whatever.tar.gz

Each bundle either needs to go into a unique directory, or you need to clear out old ones when you upload new ones.

Shut down the old container (if necessary)

Run the new container

docker run -d \
    -e ROOT_URL=http://<app-url> \
    -e MONGO_URL=mongodb://<user>:<password>@10.0.3.1/<database-name> \
    -v /<bundle-dir>:/bundle \
    -p 9001:80 \
    meteorhacks/meteord:base

Note that 10.0.3.1 is the default IP address of the host where we are running the MongoDB server.

Why Apple would make a phone with no headphone jack

ian@mutexlabs.com (Ian Howson) — Mon, 18 Jan 2016 00:00:00 +0000

TL;DR: Apple’s BLE audio streaming protocol cannot be easily copied and allows them to make a thinner phone than its competitors without fear of copycats.

In order for Apple’s phones to remain competitive and continue demanding a premium price, they need to be differentiated against Android phones. One of Apple’s strategies for this is to integrate technologies which the Android manufacturers can’t. A prime example is their purchase of Authentec to produce Touch ID sensors - Authentec’s sensors are the best on the market, and now no other phone manufacturer can use them. They can’t redevelop the technology thanks to high technical difficulty and extensive patent protection.

Apple has Bluetooth Low Energy audio streaming technology. BLE doesn’t normally support audio streaming; there’s no standard protocol, and conventional Bluetooth audio streaming has a lot of problems (high energy consumption, inconsistent codec support, poor audio quality, difficult pairing). So an Apple-controlled BLE audio streaming protocol is important: they can solve all of the above problems. Apple has enough marketshare - and owns Beats, a headphone manufacturer - that it can make such a protocol commonplace. Apple thus has a high quality product that other manufacturers cannot easily duplicate.

So a headphone-jack-free phone is technically feasible. Without the headphone jack, Apple can make the phone even thinner than before. The diameter of the headphone jack dictates the thickness of the iPhone 6 and 6S. Because no other manufacturer has good audio streaming tech, they can’t remove the headphone jack from their phones, and the jack continues to dictate the thickness of their phones. iPhone 7 can be thinner than the competitors - and Apple establishes another difficult-to-duplicate point of differentiation.

Predictions:

Beats headphones will have a built-in receiver and will not look different at all
Apple will sell chips to device manufacturers who want to make wireless headphones under MFi, like they do with Lightning ports. They will not open the protocol so as to lock out non-Apple devices.
iPhone 7 will ship with an external headphone dongle that connects between the phone’s Lightning port and provides a 3.5mm jack
Apple (or an MFi licencee) will sell an external headphone adaptor that connects to the iPhone over BLE. It will have a small internal battery, achieve 12 hours of listening time, and charge using a built-in Lighting port that plugs directly into the bottom of the iPhone (much like the Apple Pen does with the iPad Pro). It will have a small lapel clip to attach to clothing.
There will be an iPhone 7 with no headphone jack - consider it an iPhone Air. The iPhone 7 Plus might have a headphone jack. The cheap iPhone (iPhone 5S, maybe iPhone 6) will still have a headphone jack.
Android manufacturers will come up with a low-quality alternative protocol - but strict adherence to the BLE spec will restrict audio quality to very low levels. Latency will be poor.

How to fix your Skull Shaver Bald Eagle if it turns itself off

ian@mutexlabs.com (Ian Howson) — Mon, 18 Jan 2016 00:00:00 +0000

I used my Bald Eagle successfully three times. On the fourth, I was travelling. I got halfway through cutting my hair and it turned itself off. I pressed the button, it ran for a few seconds and turned itself off again. I charged it, I rinsed it, I soaked it, but it still would not stay on for more than a few seconds. I looked around for a hat so I could go and buy a disposable razor.

Skull Shaver customer service didn’t respond to my email. I saw a lot of similar complaints on Amazon. I started to worry a bit – being in Australia, I had little hope of getting a warranty replacement, and shipping was expensive and slow.

The motor unit seemed to run alright – was there something wrong with the heads? I pulled them apart, and surprise, surprise:

HAIR.

So much hair.

I cleaned it out and haven’t had a problem since. Here’s how you can do the same.

Step 1: Detach the heads.

Grasp the heads and pull directly away from the battery.

Step 2: Open up one of the heads

There’s a little notch between the silver and black part of the head. I stick a thumbnail in there and twist to separate the two halves.

Step 3: Clean out the hair

I used the little nylon brush that came with the unit.

Step 4: Reassemble the head

The two halves just click together.

Step 5: Clean the other three heads

Step 6: Clean the centre head

It just twists off. There are some little arrows that show the direction. Mine is usually pretty clean, though.

Step 7: Wash the hair down the sink before your wife sees it

This is what was in my shaver heads. Yes, I was doing the recommended “immerse in water and turn it on” routine.

So why does this happen? I clean mine thoroughly after every shave!

In my case, I think there were two causes. One, I only shave my head once a week – so it’s longer than the recommended 0.4mm. The hair didn’t seem to escape the heads during the cleaning routine.

Two, the hair that came out was clumped together and thick with grease. I use sunscreen a lot. It didn’t occur to me that that would get inside the shaver and gum it up.

I thought initially that the grease was part of the head mechanism, intended to lubricate it, but given that it’s all stainless steel and nylon, it probably doesn’t need lubrication.

So, the gears got gummed up. The load on the motor and the batteries increased. The lithium batteries probably have a polyswitch (resettable fuse) for safety – you don’t want them catching fire if there’s a short circuit. The polyswitch tripped, the motor stopped. It cooled, reset, and I press the power button again. The polyswitch trips again. Repeat.

Epilogue

For me, regular cleaning helped, but did not solve the problem. Even with a completely clean head, I wasn’t able to finish a shave.

I pulled mine apart and figured out how to add a mechanical switch that would bypass the electronics. I ran out of time and never got any love from customer service, so I wound up throwing it in the bin. Man, that was expensive.

I’m using Mach 3’s and am very happy.

Persistent computer hardware myths

ian@mutexlabs.com (Ian Howson) — Sun, 17 Jan 2016 00:00:00 +0000

1. Don’t buy SSDs because they’ll wear out if you write too much

OK, yeah. They’ll wear out. But you’ll probably throw them out first.

SSDs have a finite write lifespan, sure. But unless you’re running a database server with a heavy write workload, it’s not going to wear out for at least 10 years.

All SSDs track their lifespan. They are perfectly capable of warning you when they’re wearing out through the SMART system. They won’t just forget everything; they can degrade gracefully.

Sudden death is definitely a common way for SSDs fail – but this is true of spinning disks as well. The cause is not the FLASH media wearing out; it’s the usual problems with any electronic device failing suddenly. Spinning disks certainly fail suddenly and for a litany of mechanical reasons that SSDs aren’t vulnerable to.

I’ve got 10-year-old SSDs (which have been running for all 10 years) which still show > 90% lifetime remaining.

It’s just not an issue.

The worst instance of this that I’ve heard is “don’t buy an SSD because it will wear out. Buy a regular hard drive instead.”. Talk about throwing the baby out with the bathwater. You’re going to have a slow computer just so it won’t wear out in what, 10 years? Just buy another one!

(A spinning disk won’t last that long, anyway.)

2. Disk encryption is slow

This wasn’t true when encryption was done in software, and now that it’s 100% hardware accelerated (on both phones and regular computers), it’s a complete non-issue. Hard drives (even SSDs) are really slow compared with CPUs. The bottleneck is the drive, not the encryption software.

Practically everything written to an iPhone is encrypted – even the really old ones — and you don’t see complaints about slow writes on them.

3. Faster CPUs are worth paying for

At best, you’ll get 10% performance increase from a 10% clock speed bump. But computer performance is a ‘weakest link’ type of affair; it’s no good having a 100GHz CPU if you’re bottlenecking on memory or I/O. This is why I’m so gung-ho about buying SSDs; a spinning hard drive is practically always the bottleneck these days.

You’re paying, what, $300 for a 5% clock speed increase, which won’t amount to anything in reality? Buy an SSD first and more RAM second. Once you’ve got a Samsung 950 SSD, at least 16GB of RAM and a massive GPU, then maybe consider a CPU bump. But probably just save your money for the next revision of hardware in a year.

Design

ian@mutexlabs.com (Ian Howson) — Thu, 18 Jun 2015 00:00:00 +0000

The EM algorithm for mixtures of inverse Gaussian distributions

Literature review did not show any existing implementations of the EM algorithm for inverse Gaussian mixture models. Therefore, we must derive one from scratch.

As a model, the method given in Bilmes (1998, p. 3-7) is used. It shows the derivation of the EM equations for Normal mixture models.

The following variables are used:

$ \Theta $: the set of parameters estimates for the mixture model. $ \Theta^{g} $ refers to the ‘guessed’ parameter set.

$ \lambda_{\ell} $: the shape parameter for the $ \ell $th mixture component

$ \mu_{\ell} $: the mean parameter for the $ \ell $th mixture component

$ \alpha_{\ell} $: the mixture weight parameter for the $ \ell $th mixture component

E-step

From Bilmes (1998, p. 2), we define:

$$ Q(\Theta,\Theta^{(i-1)})=E\left[\log p(X,Y|\Theta)|X,\Theta^{(i-1)}\right] $$

For the inverse Gaussian distribution, we have the following expression for the conditional probability of mixture component $ \ell $ given parameter guesses $ \lambda_{\ell} $ and $ \mu_{\ell} $ (Wikipedia, 2015a):

$$ p_{\ell}(x|\lambda_{\ell},\mu_{\ell})=\left[\frac{\lambda_{\ell}}{2\pi x^{3}}\right]^{1/2}\exp\frac{-\lambda_{\ell}(x-\mu_{\ell})^{2}}{2\mu^{2}x}\label{eq:probfunc}\tag{1} $$

We denote proportion of mixing components by $ \alpha_{\ell} $ in order to reduce confusion with the constant $ \pi $. We have the constraint that

$$ \sum_{\ell=1}^{M}\alpha_{\ell}=1 $$

The $ Q $ function is given by Bilmes (1998, p. 4):

$$ Q(\Theta,\Theta^{g})=\sum_{\ell=1}^{M}\sum_{i=1}^{N}\log(\alpha_{\ell})p(\ell|x_{i},\Theta^{g})+\sum_{\ell=1}^{M}\sum_{i=1}^{N}\log(p_{\ell}(x_{i}|\theta_{\ell}))p(\ell|x_{i},\Theta^{g})\label{eq:bilmes-eqn-5}\tag{2} $$

M-step

On each iteration, we need to improve the parameter estimates based on the previous estimates (Bilmes, 1998, p. 2):

$$ \Theta^{(i)}=argmax_{\Theta}Q(\Theta,\Theta^{(i-1)}) $$

The left-hand term of $ \eqref{eq:bilmes-eqn-5} $ is independent of the inverse Gaussian parameters and the right-hand term is independent of $ \alpha $. Therefore, we can reuse the result from Bilmes (1998, p. 5):

$$ \alpha_{\ell}^{new}=\frac{1}{N}\sum_{i=1}^{N}p(\ell|x_{i},\Theta^{g}) $$

We now need to maximise the following expression with respect to each of $ \lambda_{\ell} $ and $ \mu_{\ell} $

$$ \sum_{\ell=1}^{M}\sum_{i=1}^{N}\log(p_{\ell}(x_{i}|\theta_{\ell}))p(\ell|x_{i},\Theta^{g})\label{eq:sumsum_logp}\tag{3} $$

Taking the log of $ \eqref{eq:probfunc} $, we get:

$$ \log p_{\ell}(x|\lambda_{\ell},\mu_{\ell}) = \log\left(\left[\frac{\lambda_{\ell}}{2\pi x^{3}}\right]^{1/2}\exp\frac{-\lambda_{\ell}(x-\mu_{\ell})^{2}}{2\mu^{2}x}\right) $$

$$ = \frac{1}{2}\log\left(\frac{\lambda_{\ell}}{2\pi x^{3}}\right)-\frac{\lambda_{\ell}(x-\mu_{\ell})^{2}}{2\mu^{2}x} $$

$$ = \frac{1}{2}\log(\lambda_{\ell})-\frac{1}{2}\log(2\pi x^{3})-\frac{\lambda_{\ell}(x-\mu_{\ell})^{2}}{2\mu^{2}x} $$

Substituting this into $ \eqref{eq:sumsum_logp} $, we get:

$$ \sum_{\ell=1}^{M}\sum_{i=1}^{N}(\frac{1}{2}\log(\lambda_{\ell})-\frac{1}{2}\log(2\pi x_{i}^{3})-\frac{\lambda_{\ell}(x_{i}-\mu_{\ell})^{2}}{2\mu^{2}x_{i}})p(\ell|x_{i},\Theta^{g})\label{eq:func-to-diff}\tag{4} $$

We wish to maximise $ \eqref{eq:func-to-diff} $ for $ \mu_{\ell} $, so we take its partial derivative with respect to $ \mu_{\ell} $ and set it equal to 0. Note that the first two terms inside the summations are independent of $ \mu_{\ell} $, so we can ignore them for the purpose of maximisation; we seek

$$ \frac{\partial}{\partial\mu_{\ell}}\sum_{i=1}^{N}\left[-\frac{\lambda_{\ell}(x_{i}-\mu_{\ell})^{2}}{2\mu^{2}x_{i}}\right]p(\ell|x_{i},\Theta^{g}) $$

$$ = -\lambda\frac{\partial}{\partial\mu_{\ell}}\sum_{i=1}^{N}\left[\frac{(x_{i}-\mu_{\ell})^{2}p(\ell|x_{i},\Theta^{g})}{2\mu^{2}x_{i}}\right]\label{eq:partial}\tag{5} $$

Concentrating just on the inner term of the summation, we need

$$ \frac{\partial}{\partial\mu_{\ell}}\frac{(x_{i}-\mu_{\ell})^{2}p(\ell|x_{i},\Theta^{g})}{2\mu^{2}x_{i}} $$

Apply the quotient rule:

$$ = \frac{2\mu_{\ell}^{2}x_{i}\left(-2x_{i}p(\ell|x_{i},\Theta^{g})+2\mu_{\ell}p(\ell|x_{i},\Theta^{g})\right)-4(x_{i}-\mu_{\ell})^{2}p(\ell|x_{i},\Theta^{g})\mu_{\ell}x_{i}}{4\mu_{\ell}^{4}x^{2}} $$

$$ = \frac{p(\ell|x_{i},\Theta^{g})(\mu_{\ell}-x_{i})}{\mu_{\ell}^{3}} $$

Substituting this back into $ \eqref{eq:partial} $ and setting it to zero, we get:

$$ -\lambda\sum_{i=1}^{N}\frac{p(\ell|x_{i},\Theta^{g})(\mu_{\ell}-x_{i})}{\mu_{\ell}^{3}} = 0 $$

$$ \frac{1}{\mu_{\ell}^{3}}\sum_{i=1}^{N}p(\ell|x_{i},\Theta^{g})(\mu_{\ell}-x_{i}) = 0 $$

This is always defined as $ \mu>0 $ for the inverse Gaussian distribution.

$$ \sum_{i=1}^{N}p(\ell|x_{i},\Theta^{g})(\mu_{\ell}-x_{i}) = 0 $$

$$ \mu_{\ell}\sum_{i=1}^{N}p(\ell|x_{i},\Theta^{g})-\sum_{i=1}^{N}p(\ell|x_{i},\Theta^{g})x_{i} = 0 $$

$$ \mu_{\ell} = \frac{\sum_{i=1}^{N}p(\ell|x_{i},\Theta^{g})x_{i}}{\sum_{i=1}^{N}p(\ell|x_{i},\Theta^{g})} $$

This is identical to the $ \mu_{\ell} $ maximiser for the Normal distribution (Bilmes, 1998, p. 7).

In the same way, we maximise $ \eqref{eq:func-to-diff} $ for $ \lambda_{\ell} $:

$$ \frac{\partial}{\partial\lambda_{\ell}}\sum_{i=1}^{N}(\frac{1}{2}\log(\lambda_{\ell})-\frac{\lambda_{\ell}(x_{i}-\mu_{\ell})^{2}}{2\mu_{\ell}^{2}x_{i}})p(\ell|x_{i},\Theta^{g}) = 0 $$

$$ \frac{\partial}{\partial\lambda_{\ell}}\sum_{i=1}^{N}(\frac{1}{2}\log(\lambda_{\ell})p(\ell|x_{i},\Theta^{g})-\frac{\partial}{\partial\lambda_{\ell}}\sum_{i=1}^{N}\frac{\lambda_{\ell}(x_{i}-\mu_{\ell})^{2}p(\ell|x_{i},\Theta^{g})}{2\mu_{\ell}^{2}x_{i}} = 0 $$

$$ \sum_{i=1}^{N}\frac{p(\ell|x_{i},\Theta^{g})}{2\lambda_{\ell}}-\sum_{i=1}^{N}\frac{(x_{i}-\mu_{\ell})^{2}p(\ell|x_{i},\Theta^{g})}{2\mu_{\ell}^{2}x_{i}} = 0 $$

$$ \sum_{i=1}^{N}\frac{p(\ell|x_{i},\Theta^{g})}{\lambda_{\ell}} = \sum_{i=1}^{N}\frac{(x_{i}-\mu_{\ell})^{2}p(\ell|x_{i},\Theta^{g})}{\mu_{\ell}^{2}x_{i}} $$

$$ \lambda_{\ell} = \frac{\sum_{i=1}^{N}p(\ell|x_{i},\Theta^{g})}{\sum_{i=1}^{N}\frac{(x_{i}-\mu_{\ell})^{2}p(\ell|x_{i},\Theta^{g})}{\mu_{\ell}^{2}x_{i}}} $$

In summary, the update equations are:

$$ \alpha_{\ell}^{new} = \frac{1}{N}\sum_{i=1}^{N}p(\ell|x_{i},\Theta^{g}) $$

$$ \mu_{\ell}^{new} = \frac{\sum_{i=1}^{N}x_{i}p(\ell|x_{i},\Theta^{g})}{\sum_{i=1}^{N}p(\ell|x_{i},\Theta^{g})} $$

$$ \lambda_{\ell}^{new} = \frac{\sum_{i=1}^{N}p(\ell|x_{i},\Theta^{g})}{\sum_{i=1}^{N}\frac{(x_{i}-\mu_{\ell}^{old})^{2}p(\ell|x_{i},\Theta^{g})}{(\mu_{\ell}^{old})^{2}x_{i}}} $$

We have converged when the likelihood decreases by less than $ \epsilon $, where the likelihood is given by:

$$ L(\theta;\boldsymbol{x},\boldsymbol{z})=\sum_{n=1}^{N}\log\left(\sum_{m=1}^{M}\alpha_{m}p_{m}(x)\right) $$

Initialisation of the EM algorithm

Originally, we attempted to set initialisation parameters the same on every run (i.e. $ \mu_{1}=0.99 $, $ \mu_{2}=1.01 $, $ \lambda_{1}=1 $, $ \lambda_{2}=1 $, $ \alpha_{1}=0.5 $, $ \alpha_{2}=0.5 $). This yielded poor quality fits on many datasets. For example, on the BMI dataset included with the mixsmsn package (Prates et al., 2013), the following model was generated:

This is clearly unsatisfactory as it does not adequately model the bimodal nature of the data.

Additional research was conducted to find effective ways to initialise the EM algorithm. Many papers suggested clustering approaches such as k-means, but they require well-separated data.

We elected to use a random initialisation strategy to improve the fit, roughly following the ‘alternative method’ from McLachlan et al. (2000, p. 55) or the ‘subset approach’ algorithm described in Schepers (2015). We chose this method as it is easy to implement and produces higher-quality fits than other methods at the expense of computation time Schepers (2015, p. 142). The algorithm proceeds as follows:

for each random start (e.g. 100 times):
    for each mixture component (e.g. two):
        sample p+1 observations from the dataset
        use the ML equations to estimate distribution parameters for those observations
        use those parameters as initial values for the component
    EM fit using those initial parameters and record the log-likelihood of the solution
choose the fit that achieves the highest log-likelihood

$ p $ is the number of parameters that characterise the distribution (2 for the inverse Gaussian distribution). We always set the mixing weight ($ \alpha $) components to have equal weighting (i.e. 0.5 for a two-component mixture) as we have no information on the true weighting of the components.

Using this algorithm on the BMI dataset, we achieve the following fit:

The new model better reflects the bimodal structure of the data.

System design

Traditionally, CUDA is used to make a single large task run very quickly. Each observation in the dataset is assigned to a separate CUDA thread; this is termed data parallelism (Wikipedia, 2014). Larger CUDA hardware can process more observations simultaneously. Woolley (2013) states that “To get good performance … You want to have 14K or more threads running concurrently.” To maximise performance, we must make each step run as quickly as possible, even if it is wasteful of machine resources. Most tasks proceed as follows:

bulkem is intended to work efficiently on relatively small datasets with less than 14,000 observations. It expects to see a large number of datasets (thousands) and/or random starts. This suggests that we will need to have multiple tasks running simultaneously on the GPU in order to achieve good performance; task parallelism (Wikipedia, 2015c) is more appropriate. This is a relatively uncommon usage of CUDA hardware; Tzeng et al. (2012) has a brief overview.

Recent CUDA hardware supports a feature called “streams” (Rennich, 2011) which allows the GPU to perform a number of tasks simultaneously. Recent hardware can simultaneously execute up to 16 CUDA kernels while copying data back and forth between host RAM. We can use streams to:

Overlap memory copies to and from the GPU
Execute multiple kernels simultaneously. As our datasets are relatively small, this ensures that the GPU is not sitting idle.
Use extra CPUs to queue more work for the GPU to perform, again ensuring that the GPU is kept busy.

Assuming sufficient GPU resources are available, the execution flow might look like:

The high-level strategy for bulkem’s CUDA path is therefore:

Retrieve the list of datasets from R
Assign each dataset to a different CPU thread. Each CPU thread has an associated CUDA stream.
Each thread generates a number of initial parameters for the dataset. It uses the GPU to execute the EM algorithm for each set of initial parameters.
The best fit is stored in a list
When all threads have finished (i.e. all datasets have been fit) the list is transferred back to R

CPU thread design

Each CPU thread controls a single CUDA stream. It chooses a dataset to fit, generates many sets of initial parameters and fits each using EM. The bulk of the work of EM fitting is performed on the GPU. After fitting, the best fit is selected and stored.

EM kernel design

The final kernel design is guided by the need to minimise the number of kernel launches. The reasons for this are explored in failed kernel designs.

Recall the equations that we need to evaluate to perform a single iteration of the EM algorithm:

$$ \alpha_{\ell}^{new} = \frac{1}{N}\sum_{i=1}^{N}p(\ell|x_{i},\Theta^{g}) $$

$$ \mu_{\ell}^{new} = \frac{\sum_{i=1}^{N}x_{i}p(\ell|x_{i},\Theta^{g})}{\sum_{i=1}^{N}p(\ell|x_{i},\Theta^{g})} $$

$$ \lambda_{\ell}^{new} = \frac{\sum_{i=1}^{N}p(\ell|x_{i},\Theta^{g})}{\sum_{i=1}^{N}\frac{(x_{i}-\mu_{\ell}^{old})^{2}p(\ell|x_{i},\Theta^{g})}{(\mu_{\ell}^{old})^{2}x_{i}}} $$

Each operation within the equations can be classified as either:

performing an operation on each observation (e.g. evaluating $ p(\ell|x_{i} $, $ \Theta^{g}) $ or $ (x_{i}-\mu_{\ell}^{old})^{2} $ ), or
summing combinations of these operations (the $ \sum_{i=1}^{N} $ operation appears multiple times)

Each iteration of the EM algorithm requires $ 2+M $ kernel launches (where $ M $ is the number of mixture components being fit). The first performs the per-observation calculations. At the end of this launch, the operands to every summation are available. This stage is referred to as member_prob_kernel in the source code.

The second launch performs the summation required to evaluate the current log-likelihood. This is called lp_sum_kernel and is described in more detail in fused single launch sum kernel. After this launch, we stop if convergence has been achieved.

The remaining launches perform the summations required to evaluate the new mixture parameter values, again using lp_sum_kernel.

The process to perform a single iteration of the EM algorithm is therefore:

Perform per-observation operations using member_prob_kernel
Sum over the per-observation likelihoods to calculate the solution log-likelihood using lp_sum_kernel
If we have converged (if the new log-likelihood is within $ \epsilon $ of the old log-likelihood) stop the process
Otherwise, perform the remaining summations using lp_sum_kernel. The update equations can then be evaluated to generate the next iteration’s initial parameter estimates.

Learning from failed kernel designs

It took a number of attempts to design a CUDA kernel that performed well. The following table summarises the failed attempts².

For reference, the pure R implementation took around 80ms to produce each fit using a single CPU core. Each fit runs on the same problem with the same initial conditions, running 100 iterations of the EM algorithm to produce a result.

Description	Time to fit	Problems
Basic CUDA C	2.3ms	Summations were implemented incorrectly, so results were incorrect
Using the Thrust API	80ms	Slow; roughly the same performance as R on CPU
Thrust with Streams	80-290ms	Multiple threads interacted poorly; often slower than before
CUB single-threaded	21ms	Not fast enough to justify use of GPU
CUB multi-threaded	100ms	cudaFuncGetAttributes call using a large amount of time
Modify CUB	15ms	Not fast enough to justify use of GPU
Replace CUB with single-launch sum kernel	3.5ms	Kernel launch time is now the chief bottleneck
Fuse multiple summations	2.5ms	Kernel launch time is now the chief bottleneck

Broadly, we learned the following:

With such small datasets, kernel launch overhead takes the vast majority of the time. This is explored further below.
Libraries such as Thrust (Hoberock et al., 2015) and CUB (Corporation, 2015), while making it relatively easy to develop code that runs on CUDA hardware, assume that kernel launch overhead is relatively small. They perform a lot of independent kernel launches. This makes them unsuitable in this application. We must write CUDA C/C++ by hand.
Support for CUDA streams is still relatively new to Thrust. A lot of operations – particularly memory copies – are performed without streams in mind, which costs performance.
Both CUB and Thrust run poorly in multithreaded environments such as this one. The different threads interact, costing performance³.
Performing a summation in CUB or Thrust launches many kernels

Kernel launch time

Kernel launch time is the time that it takes the CPU to launch a kernel on the GPU. Boyer gives measurements showing that calling a CPU function takes about 3.3ns, but launching an asynchronous CUDA kernel takes between 3.0 and 3.9$\mu$s – a thousand times longer. Lee et al. (2010) notes that “For GPUs, we found that global inter-thread synchronization is very costly, because it involves a kernel termination and new kernel call overhead from the host.”

This implies that:

the time that the GPU spends executing the kernel must be greater than the time taken to launch the kernel or the GPU will be idle for some time
if the CPU can execute the task in less time than the kernel takes to launch, we do not benefit from using the GPU at all

Using CUDA streams or CPU threads does not impact this restriction. If many threads are trying to launch a kernel at once, they enter a queue. Only one CPU-GPU operation (a memory copy or kernel launch) can be started at any time, even though many can be in-progress simultaneously.

With small datasets, the kernels do not take long to execute. Kernel launch time then becomes the main determinant of performance. The only way we can reduce launch time is to minimise the number of kernel launches.

Fused single launch sum kernel

The algorithm requires $ 2+M $ summations per iteration. Using the standard CUB or Thrust libraries, two kernel launches are required per summation, giving a total of nine kernel launches for each iteration on a two-component mixture.

To reduce the number of kernel launches required, a new kernel was developed with two important features: it can perform multiple summations with a single kernel launch.

Sum reductions

Most sum kernels use a reduction tree, demonstrated on page 3 of Harris (2010). Rather than having a single thread step through each item and keeping a running total, the many cores of CUDA hardware are used. A large number of threads are launched, proportional to the number of items to be summed⁴. Each thread sums two adjacent items, a task which can be performed extremely quickly. Then, the adjacent items of those summations are summed, and so on until the last two items are summed. The effect of this is that the summation is performed in roughly $ O(\frac{N}{T}\log(N)) $ time. The traditional ‘running sum’ algorithm operates in $ O(N) $ time; far slower on a GPU where $ T $ is large (hundreds or thousands).

For 2 million items, the kernel might be launched across 1 million threads. No existing GPU hardware has this many hardware execution units available, so the launch is split into blocks. Each block runs on the hardware in turn. The blocks are not resident in the GPU at the same time and cannot synchronise with each other. Therefore, multiple kernel launches are needed to achieve synchronisation (Harris, 2010, p. 4).

The downside to this algorithm is that each stage must run completely before the next starts, and there is no way to do this except for waiting for the kernel launch to complete. This implies that many kernel launches are required.

The new kernel solves this problem by adding a stage before the reduction tree. If $ T $ threads are launched, the $ t $th thread calculates

$$ \sum_{i=0}^{\left\lceil N/T\right\rceil }\begin{cases} x_{t+iT} & \text{for t+iT<N}\\ 0 & \text{otherwise} \end{cases} $$

In other words, it sums every $ T $th element into a $ T $-wide array. Then, the sum reduction can proceed as per normal. This is slower than the traditional algorithm, but it has the key advantage that global synchronisation is not required. The entire summation can execute with a single kernel launch.

$ T $ can be selected to be any number smaller than the maximum thread block size supported by the CUDA hardware. It is critical that all threads are executed simultaneously (i.e. the kernel must be launched with a grid size of 1).

Kernel fusion

The M-step of EM requires three summations across $ N $-sized arrays. These three summations can be performed in a single kernel launch by providing the sum kernel with the details of all three arrays in a single launch, rather than performing three launches. This technique is called Vectored I/O in other contexts (Wikipedia, 2015d).

Footnotes

2 Source code for these can be browsed at https://github.com/ihowson/CUDA-Task-Pipeline

3 Profiling revealed a significant bottleneck in CUB when used in multithreaded applications. This bottleneck has been patched in https://github.com/ihowson/cub/commit/0c90360c9b9c397398a646d689ddd980aa5da811

4 This is a simplified explanation; the curious reader is encouraged to read Harris (2010)

References

JA Bilmes. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical Report TR-97-021, International Computer Science Institute, 1998.

M Boyer. CUDA kernel overhead. URL http://www.cs.virginia.edu/~mwb7w/cuda_support/kernel_overhead.html.

M Harris. Optimizing parallel reduction in CUDA, Mar 2010. URL http://docs.nvidia.com/cuda/ samples/6_Advanced/reduction/doc/reduction.pdf.

G McLachlan and D Peel. Finite mixture models. John Wiley & Sons, Inc, 2000.

J Hoberock and N Bell. Thrust - parallel algorithms library, Mar 2015. URL http://thrust.github.io/.

MO Prates, CRB Cabral, and VH Lachos. mixsmsn: Fitting finite mixture of scale mixture of skew-normal distributions. Journal of Statistical Software, 54(12), August 2013.

NVIDIA Corporation. CUB, Apr 2015. URL http://nvlabs.github.io/cub/.

S Rennich. CUDA C/C++ streams and concurrency, 2011. URL http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf.

J Schepers. Improved random-starting method for the EM algorithm for finite mixtures of regressions. Behavior Research Methods, 47(1):134–146, Mar 2015.

S Tzeng, A Patney, and JD Owens. GPU task-parallelism: primitives and applications, 2012. URL http://on-demand.gputechconf.com/gtc/2012/presentations/S0138-GPU-Task-Parallelism-Primitives-and-Apps.pdf.

Wikipedia. Data parallelism, Dec 2014. URL http://en.wikipedia.org/wiki/Data_parallelism.

Wikipedia. Inverse gaussian distribution, February 2015a. URL http://en.wikipedia.org/wiki/Inverse_Gaussian_distribution.

Wikipedia. Task parallelism, Mar 2015c. URL http://en.wikipedia.org/wiki/Task_parallelism.

Wikipedia. Vectored I/O, February 2015d. URL http://en.wikipedia.org/wiki/Vectored_I/O.

C Woolley. GPU optimization fundamentals, 2013. URL https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf.

Results

ian@mutexlabs.com (Ian Howson) — Thu, 18 Jun 2015 00:00:00 +0000

Time to fit a single dataset

To test performance across varying dataset sizes, we sample from a two-component inverse Gaussian mixture model with known parameters. Only a single dataset is fit.

Dataset size	CPU time (seconds)	GPU time (seconds)	GPU speedup
100	0.00620	0.01576	0.39
1,000	0.06032	0.01572	3.84
10,000	0.67876	0.03924	17.30
100,000	6.35048	0.19740	32.17
1,000,000	67.98868	1.87952	36.17

On the test hardware, we see that the GPU is slower for small dataset sizes (100 samples) but outperforms the CPU for larger dataset sizes. For datasets with 1 million samples, the GPU runs around 36 times faster than the CPU.

Time to fit many datasets

In this case, the dataset size is held constant (2000 samples) and we fit many datasets simultaneously, generating them in the same way as for the single dataset case.

Number of datasets	CPU time (seconds)	GPU time (seconds)	GPU speedup
1	0.10452	0.02092	5.00
10	1.12048	0.05364	20.89
100	10.12788	0.35904	28.21
1000	102.40840	3.42036	29.94

We see similar results – the ratio of GPU-CPU performance increases as the number of datasets increases. When 1000 datasets of 2000 samples are being fit simultaneously, the GPU runs around 30 times as fast as the CPU.

Multiple datasets on EC2

Comparing performance of CPUs vs. GPUs is somewhat unsound; there is no obvious way to say “this CPU is equivalent to this GPU”. Most papers, including this one, compare performance using whatever hardware the author had available at the time (Gillespie, 2011). No effort was made to optimise the CPU implementation, while significant time was spent optimising the GPU implementation, an issue discussed in depth in (Lee et al., 2010).

Fortunately, services such as Amazon EC2 (Services, ??) provide an alternative way to compare the CPU and GPU approaches: cost of rental. For a given price, one will be able to rent a certain amount of hardware which will perform the desired computations in an amount of time. Both CPU and GPU time can be rented. A fairer way to compare the two technologies is the cost to perform your computation.

A summary of the machine configurations is available at Services (2015). For reference, ECUs are a measure of allocated CPU capacity. The US East region was selected as it is generally the lowest priced.

For the CPU implementation, we selected a c4.large instance as they provide the best price-performance ratio at the time of writing (eight ECUs and two CPU cores at USD$0.116/hour as of 2015-06-09). t2 instances are not suitable as they provide ‘burstable’ CPU performance; they are not intended for long-running jobs. This machine has two CPU cores, but the R implementation will only use one. As there are no dependencies between datasets, we will assume that additional CPU cores will provide a linear speedup (that is, with appropriate software, we could obtain double the performance with double the CPU cores). The rationale for this is explored further in the linear speedup assumption. Also note that pricing for c4 instances is close to constant per CPU core and ECU allocation; the cost-to-fit ought to remain constant regardless of instance choice.

Name	Number of CPU cores	ECU allocation	Price per hour (USD)
c4.large	2	8	0.116
c4.xlarge	4	16	0.232
c4.2xlarge	8	31	0.464
c4.4xlarge	16	62	0.928
c4.8xlarge	36	132	1.856

For the GPU implementation, we chose a g2.x2large instance at USD$0.650/hour. Rephrasing this in terms of speedup ratios, the GPU implementation must achieve a 0.65/(0.¹¹⁶⁄₂)=11.2x speedup ratio in order to break even on cost.

As before, all datasets contain 2000 randomly generated samples.

Datasets (D)	CPU time	GPU time	CPU cost	GPU cost
	(seconds)	(seconds)	(USDx10^-6)	(USDx10^-6)
1	0.08912	0.01708	1.44	3.08
10	0.87360	0.06940	14.07	12.53
100	9.17784	0.63868	147.86	115.32
1000	84.44992	6.39072	1360.58	1153.88

From this, we can see that the GPU implementation is slightly more cost-effective than the CPU implementation for larger problems. The difference is not large and could probably be eliminated altogether with some optimisation work on the CPU implementation.

These prices differences may seem to be trivial (who cares about microcents?) but recall that use cases may include many more datasets (tens of thousands of datasets is the intended use case) and require random initialisation to achieve a good fit (100 random initialisations means 100 times as much work, and therefore cost). For 40,000 datasets and 100 random initialisations, the cost is around USD$5.44 using the CPU implementation and USD$4.62 using the GPU implementation.

For the large dataset test, we obtain the following results:

Samples (N)	CPU time	GPU time	CPU cost	GPU cost
	(seconds)	(seconds)	(USDx10^-6)	(USDx10^-6)
100	0.00724	0.01616	0.12	2.92
1,000	0.05264	0.02032	0.85	3.67
10,000	0.57628	0.03264	9.28	5.89
100,000	6.03700	0.22568	97.26	40.75
1,000,000	50.11764	2.25400	807.45	406.97

For sufficiently large problems, the GPU instances can perform the model fits at roughly half the price.

References

Amazon Web Services. Amazon EC2. URL http://aws.amazon.com/ec2/.

Amazon Web Services. Amazon EC2 pricing, 2015. URL http://aws.amazon.com/ec2/pricing/.

C Gillespie. Reviewing a paper that uses GPUs, July 2011. URL https://csgillespie.wordpress.com/2011/07/12/how-to-review-a-gpu-statistics-paper/.

VW Lee, C Kim, J Chhugani, M Deisher, D Kim, AD Nguyen, N Satish, M Smelyanskiy, S Chennupaty, P Hammarlund, R Singhal, and P Dubey. Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In ISCA ’10 Proceedings of the 37th Annual International Symposium on Computer Architecture, pages 451–460. ACM, 2010.

Background

ian@mutexlabs.com (Ian Howson) — Thu, 18 Jun 2015 00:00:00 +0000

The inverse Gaussian distribution

The inverse Gaussian distribution is an exponential-family probability distribution with the density function:

$$ f(x)=\left[\frac{\lambda}{2\pi x^{3}}\right]^{1/2}\exp\frac{-\lambda(x-\mu)^{2}}{2\mu^{2}x} $$

for $ x>0 $, mean $ \mu>0 $ and shape $ \lambda>0 $ (Seshadri, 1993, p. 1).

The expectation-maximisation algorithm

The EM algorithm (Dempster et al., 1977) iteratively refines a maximum likelihood estimate in the presence of missing data. Here, we use it to fit mixture models, as described in Bilmes (1998). Two characteristics are of note:

As it is an iterative algorithm, the execution time can be quite long. Hence our desire to find a faster way to perform model fitting.
EM is sensitive to the choice of initial parameters (Aitkin et al., 1980, p. 327). Therefore, we must think about how best to select them.

GPUs and CUDA

A GPU is a component of a computer that accelerates graphics operations. All modern computers and mobile devices include a GPU. GPUs can be programmed to perform general-purpose computations in the same way as a CPU; this is referred to as ‘general purpose GPU computing’.

GPUs are similar to CPUs in that they run user-defined software. The key difference is that while a CPU generally has a small number of execution units (two or four are common for consumer hardware), a GPU may have thousands of execution units operating in parallel. The CPU is optimised for serial operations – performing a sequence of instructions as quickly as possible. The GPU is optimised for parallel operations where the same instructions are performed many times on different data. Each execution unit of the GPU is simple and more restricted, but there are many more of them. The peak computational output (measured in instructions per second) is far greater than for a CPU. Redesigning the problem to take advantage of this structure is the major challenge of the programmer working on a GPU computing problem.

There are two major standards for GPU computing: OpenCL and CUDA. OpenCL is supported by most GPU vendors. CUDA is only supported by NVIDIA hardware but it is a mature standard with excellent tools and documentation. We will only consider CUDA from this point on.

There are significant barriers to widespread adoption of GPU computing.

Not everyone has access to ‘large’ GPU hardware. Most CPUs now ship with on-board GPUs which are sufficient for graphics, but do not provide significant performance advantages in the GPU computing context.
The effort required to port a given algorithm to a GPU is large. Programming GPUs is significantly more difficult than CPUs
CPUs will always be the first to get any new algorithm
Most problems do not require a large amount of computing power. There is no incentive to speed up something that is already fast.
Not all algorithms run faster when executed on a GPU. For an algorithm to be a good candidate for GPU execution, it must generally:
- require many iterations (thousands), each of which is independent of the others
- require a large amount of computation time relative to the amount of memory access
For many problems and institutions, a cluster of general purpose PCs is a better fit. It has the advantages of running all available software, requiring minimal rework and being ‘familiar’ (programming a cluster is very similar to that of programming a single PC).

References

M Aitkin and GT Wilson. Mixture models, outliers, and the EM algorithm. Technometrics, 22(3): 325–331, Aug 1980.

AP Dempster, NM Laird, and DB Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977.

V Seshadri. The inverse gaussian distribution: a case study in exponential families. Oxford Science Publications, 1993.

Introduction

ian@mutexlabs.com (Ian Howson) — Thu, 18 Jun 2015 00:00:00 +0000

We wish to fit mixture models to a large number of datasets. We assume that an appropriate model is a two-component mixture of inverse Gaussian distributions. The components of the data are not well separated. The use case is roughly 40,000 datasets of 2,000 observations each.

A natural way to estimate the mixture parameters is to use the Expectation Maximisation algorithm. However, two problems need to be overcome:

Because of the large amount of data, most software implementations of the EM algorithm take a long time to execute
Because the EM algorithm is sensitive to the selection of initial parameters, many attempts must be made to fit any given dataset. This makes the fitting process even slower.

One way to reduce the time needed to generate the models is to use CUDA hardware. CUDA uses graphical processing units (GPUs) found in many computers to perform general-purpose computation. The software used to perform model fitting must be customised to suit CUDA.

This report describes the design and development of bulkem¹, an R package which fits mixture models using CUDA hardware. Using CUDA hardware, bulkem can fit a large number of small datasets around thirty times faster than a conventional CPU. It can fit very large datasets around 36 times faster than a conventional CPU.

Conventions

The following variables are used throughout this report:

$N$ : the number of elements in an array or the number of observations in a dataset
$M$ : the number of components in the mixture model being fit
$T$ : the number of threads being launched
$D$ : the number of datasets

A number of performance measurements are quoted. Unless otherwise specified, those measurements were performed on an Intel i5-4460 (quad-core 3.2GHz) CPU with an NVIDIA GeForce GTX 660. The machine is running OS X Yosemite, CUDA Toolkit 6.5 and R 3.1.2.

Footnotes

1. The bulkem source code is available at https://github.com/ihowson/bulkem

Discussion and conclusion

ian@mutexlabs.com (Ian Howson) — Thu, 18 Jun 2015 00:00:00 +0000

The linear speedup assumption

Fitting models to independent datasets is an embarrassingly parallel (Wikipedia, 2015) problem. The datasets have no dependence on each other and can be fitted separately.

This implies that, in an ideal world:

If we have enough datasets, we can expect the performance figures quoted here to extrapolate linearly (i.e. if fitting 1000 datasets takes 10 seconds, we can expect that fitting 2000 datasets will take around 20 seconds)
If we have more hardware to parallelise across (e.g. more CPU cores, more computers with or without GPUs) we can expect them to reduce the computation time proportionally to the amount of resources added
Assuming that EC2 has an unlimited supply of hardware for us to rent, we can perform very large model fits in an arbitrarily small amount of time with the same total cost. Renting twice as much hardware will halve our model fit time, so the total expenditure remains the same.

Of course, in practice things are not so ideal:

EC2 bills per-hour, so we cannot reduce execution time for extremely large jobs below an hour without incurring additional costs
EC2 instances take some time to boot, so there is a cost to using large numbers of instances
Large clusters of machines incur some overhead for communication
Very small tasks have fixed overheads. We saw this in the GPU results, where there was almost no time difference between a 100-sample dataset and a 1000-sample dataset.

Using R, the foreach package (Analytics et al., 2014) makes it easy to parallelise code across cores within a computer. The snow package (Tierney et al., 2013) is suitable for use across networked clusters of computers.

Future work

Improving GPU performance

Kernel launch time is still the limiting factor, so further reducing the number of kernel launches is the natural way to improve performance. Ideally, the entire EM algorithm (across multiple iterations) could be moved to the GPU using similar techniques to lp_sum to handle datasets larger than the thread block size.

On the author’s hardware, compute occupancy is around 60%, so we could expect, at most, another $ \frac{1}{0.6}= $ 67% performance gain.

At that stage, GPU performance would likely be the limiting factor. The current implementation of the lp_sum summation is not very efficient, but it would be prudent to check for hotspots with a profiler before investing further development time.

Finally, while running the GPU software, the CPUs have significant idle time. With additional software support, one could perform additional model fits using that idle CPU capacity, improving performance-per-dollar further.

Improving CPU performance

The obvious way to improve performance of the CPU implementation is to take advantage of additional CPU cores. The easiest way to achieve this is with the foreach R package, which can run arbitrary R code across any number of CPU threads. Running the R code under a profiler ought to reveal hotspots which can guide optimisation of the R code, Finally, rewriting the R implementation in C might provide further improvement.

New functionality

Modifying bulkem to fit Normal mixture models would be fairly straightforward and very useful; Normal models are far more common than inverse Gaussian.

Conclusion

This report describes bulkem, an R package which fits inverse Gaussian mixture models using the EM algorithm. It has demonstrated that GPUs can provide significant performance and cost advantages over CPUs in this application. Unlike most GPU computing packages, bulkem offers significant performance improvement even on small datasets. Directions for further improvement of the bulkem algorithms have been identified which might provide further improvement.

References

Revolution Analytics and S Weston. foreach: foreach looping construct for R, Apr 2014. URL http://cran.r-project.org/web/packages/foreach/index.html.

Wikipedia. Embarrassingly parallel, Mar 2015b. URL http://en.wikipedia.org/wiki/Embarrassingly_parallel.

L Tierney, AJ Rossini, N Li, and H Sevcikova. snow: simple network of workstations, Sep 2013. URL http://cran.r-project.org/web/packages/snow/index.html.

Dell P2715Q review

ian@mutexlabs.com (Ian Howson) — Mon, 05 Jan 2015 00:00:00 +0000

Summary

Buy it.

Buy it and a new camera and the biggest graphics card you can find, because you’ll need them to take advantage of the beautiful panel.

Do your research to make sure it will work with your computer. My MacBook didn’t work at full resolution, despite being on Apple’s 4K compatibility list. (Wait, no, they’ve removed it. MacBook 13” Retina before 2015 definitely doesn’t work at 60Hz.)

The Good

It’s an affordable 4K IPS monitor. What’s not to like?

I paid AUD$747 with a 15% discount coupon – easy to find if you Google a little bit. It’s cheaper again in the US.
It’s 4K (3840x2160) and looks absolutely stunning. Once you go Retina, you never go back.
- Now you can see all of the flaws in your photos. Time to buy a new camera.
- Practically all of the downloadable wallpaper labelled 4K is actually resized from something lower – it’s noticeably blurry.
IPS panel means that you don’t get colour or brightness shift as you change viewing angle. It always looks accurate.
The display rotates 90° onto its side so you can easily adjust the cables.
Built-in USB3 hub.
Ships with a mini-DP to DisplayPort cable. This is a great choice. Laptop users can use mini-DP on the laptop to DisplayPort on the display; desktop users can use DisplayPort on the computer to mini-DP on the display.
- Out of the box, this worked perfectly with OS X Mavericks and a Geforce GTX 660 at 60Hz.
- It didn’t work with a 2014 Retina MacBook Pro.
OS X’s displays thing lets you choose the native 3840x2160, which is very usable if you have good eyesight. I’m mostly running at scaled 2560x1440 equivalent. Scaling doesn’t seem to hurt performance at all on the GTX 660.
Power consumption is about 30W, which is half of the monitor it’s replacing (a Dell 2407WFP).
The panel has a matte finish, but it doesn’t get in the way – some high-resolution matte displays look shimmery.
Games look amazing, but the GTX 660 isn’t really up to the task.

The Bad

Sometimes the display wakes up at 2560x1440 instead of 3840x2160. Turning it off and on again fixes this, but it also turns off any connected USB devices, so don’t do that if you have a hard drive connected.
There’s no more speaker power connector, which I used to power my DAC.
The display locks hard occasionally and needs to be power-cycled (unplug power cable).

The MacBook

I have a mid-2014 13” Retina MacBook Pro. This machine was on Apple’s 4K compatibility list but has since been removed.

Using the DisplayPort interface, it syncs at up to 2560x1440.

Using the HDMI interface, it syncs at 3840x2160, but only at 30Hz. This is mostly OK.

Oddly, using HDMI, the list of offered scaling options is different to my Mac Pro using DisplayPort. Using the Mac Pro, I get a ‘looks like 2560x1440’ option, which is my preference. Using the Macbook, the only options are 3840x2160 (native), 1920x1080 (native HiDPI), 1504x846 (scaled HiDPI) and 1152x648 (scaled HiDPI).

Fortunately I didn’t buy this monitor to use with the Macbook, so I’m not at all bothered.

The imperfect, but inconsequential

The rotating stand needs a little fiddling to sit level.
The OSD is still HD, not QHD (it pixel doubles)
16:9 ratio, but we lost that battle a long time ago
There are two DisplayPort ports but only one appears in the OSD and only one seems to work (the one near the power cable). No idea why.

Common Production Tasks

ian@mutexlabs.com (Ian Howson) — Mon, 10 Nov 2014 00:00:00 +0000

Also see https://github.com/edx/configuration/wiki/edX-Managing-the-Production-Stack#updating-versions-using-edx-repos

Install updates

As root:

/edx/bin/update edx-platform release

Upgrade procedure

put up a ‘down for maintenance’ message
make sure the original server is accessible to you (public must not be able to make changes); might need to close any existing connections
take a snapshot using LXC (have to take down the server to do this); also verify that the snapshots can be restored if things go badly
perform the upgrades
verify that upgrades are working correctly
remove the ‘down for maintenance’ message
LATER: remove any snapshots once you’re sure that they’re not needed
- List LXC snapshots with lxc-snapshot -L -n <container name>

Update database tables

On Devstack:

paver update_db -s devstack

In production:

ubuntu@edxprod:/edx/app/edxapp/edx-platform$ sudo -u www-data /edx/bin/python.edxapp ./manage.py lms migrate --settings=aws

Configuration

ian@mutexlabs.com (Ian Howson) — Mon, 10 Nov 2014 00:00:00 +0000

After installation, there are a lot of settings that you’ll need to tweak to suit your situation. There’s a fast way to do this and a ‘correct’ way to do this.

Underneath, there’s one master config file (server-vars.yml) which generates config files for each of the components.

There’s a list of potential config variables here: http://iambusychangingtheworld.blogspot.com.au/2014/05/edx-platform-server-varsyaml-variables.html

The quick way

Modify the config files in /edx/app/edxapp/ directly. lms.env.json and cms.env.json contain the most useful variables.

The issue with this method is that your changes could be overwritten during an upgrade, so you’ll need to reapply them manually. The upside is that you can try things out relatively quickly, which is nice when you’re experimenting.

After you modify a config file, you’ll need to restart the relevant service using supervisorctl. Usually, this means:

/edx/bin/supervisorctl restart edxapp:  # Run as root. Note the trailing colon.

The right way

This method is more tolerant of server upgrades. Everything is stored in source control so it can be quickly deployed later.

Create /edx/app/edx_ansible/server-vars.yml
- But which script creates this file from source control? Is it from the configuration repo?
Run ansible to generate the service config files

Suggested settings

You should almost certainly change the following settings:

CODE_JAIL/python_bin: I make this something that isn’t real. I don’t run programming MOOCs and do use LXC (which doesn’t have AppArmor support) and so want to hobble the sandbox as much as possible for security purposes.

Various email addresses

The Facebook address

Twitter address

SITE_NAME (most stuff works without it, but occasionally you’ll get broken links/IP addresses showing through)

TIME_ZONE

BitTorrent Sync

ian@mutexlabs.com (Ian Howson) — Wed, 01 Oct 2014 00:00:00 +0000

Once upon a time, before Dropbox was a thing, I had a big desktop and a small laptop. I thought, wouldn’t it be wonderful if my files could be the same on both sides? And if it could do that automatically, without me asking. So I started writing a program called SyncDroid (no relation to the Android app) to do file-sync-over-LAN. I wrote some blog posts to explain my thinking.

Time passed, and I got busy with other things. I ceased having a consistent desk and became completely dependent on my laptop, which got much bigger to compensate for having to do everything and at the same time only slightly physically heavier thanks to the wonders of technology.

Time passed some more, and I find myself at the same desk for three days of the week, with a really nice desktop and another very nice but just-not-quite-as-beefy laptop. Dropbox is not a good option due to the proud tradition of Crap Australian Internet, and besides, security and cloud services do not mix. (Yes, I am aware of SpiderOak. No, I will not use it until I can audit it and compile it myself.)

So BitTorrent Sync is a thing, which is basically what I dreamed of when I started SyncDroid. Zero-interaction LAN file sync between machines. No dependency on Internet services. Free. Sold.

File sync is a really tricky problem. It cannot be fully and correctly solved without massively overhauling how applications deal with data. (The Cloud helps a lot. It is not the complete solution.) Therefore, I have some advice on how to make BitTorrent Sync work without too much pain or unexpected data loss.

When you’re starting out with say, your Documents folder, don’t try to sync two complete versions of the folder. You’ll end up with all files from both sides on both sides, and/or a bunch of conflicts, where you just expected nothing to happen. Much better to delete all of one side and sync it across. It is slow, sadly, but you only have to do it once.

BitTorrent Sync has a ‘relay service’. If two machines are not on the same LAN but do have Internet access, they can talk through the relay service.

I live in the Land Down Under with Slow Internet, so relaying through servers in the US is too slow to be useful. In each folder’s preferences (on every single peer) you need to unclick ‘Use tracker server’, ‘Search DHT network’ and ‘Use relay server when required’. (Try version 1.4.83 if the setting isn’t working.)

Checksumming and transferring large files (like VM images) takes a long time. There is still room for someone to make an efficient VM synchronisation system. It might be impossible to make it ‘nice’, but you could at least provide snapshots or something rather than leaving one side corrupted most of the time. Parallels might do this by accident, but I’m not willing to risk my data to find out.

When you’re setting up to start with, I found it easiest to copy the Share URLs into a file in Dropbox and copy-paste them into BTS on the receiving end.

DO NOT copy the keys into Dropbox if you worry about the NSA reading your data. Those keys don’t expire and give access to your data. Dropbox keeps snapshots of everything and the NSA works with Dropbox. Of course, we can’t audit BTS anyway, so probably best to keep government secrets locked away a little more securely.

Pay attention to the ‘Store deleted files in folder archive’ setting. Definitely turn it off for VM folders.

Deleted or modified files go into an archive folder (.sync/Archive under the synced folder root). I’ve seen references to a 30 day cleanup period, but am yet to confirm this.

Don’t use BTS to sync your Dropbox folder between machines unless the Dropbox client is only running on one of them. They’ll confuse each other. Dropbox already does LAN sync.

Rather than having many shares, you can store the canonical copy of each folder in a synced folder and then symlink it to where you want it to appear.

Sync isn’t necessarily between two machines. You can sync three or more machines to the same folder.

This would be awesome if you had, say, office workers which need disconnected access to a shared folder. You can then disconnect a machine, keep the local copy, modify it and have your changes sync when you reconnect. This might cut down your need for corporate fileservers and VPNs

If you’re using a Mac, you might want to prevent BTS from syncing .FinderInfo and .ResourceFork files. As of October 2014 (version 1.4.83) they fail to sync but BTS can’t figure out why, causing your folders to perpetually be out of sync. Add the following to the end of .sync/IgnoreList in your folder:

*.FinderInfo
*.ResourceFork

Hopefully you don’t actually need the resource fork for any of your files. Does OS X actually use the resource fork these days?

I had to disconnect the folder from each peer and reconnect it. The peers remember all of the FinderInfo files that they’re meant to be ignoring. Disconnecting forces BTS to start over without the FinderInfo files. Sometimes you can just disconnect and reconnect one peer (usually the one that started with all of the data).

This also highlights a nuisance in the system: configuration is not synced. You need to do this on every single machine. Did I mention that file sync is nontrivial?

.SyncIgnore is not a thing any more. It is now called .sync/IgnoreList. The format and use of the file is the same, but .SyncIgnore no longer works.

This is a bit of shame, really, because (I never tested this, but…) .SyncIgnore could get synced automatically, saving you from manually making the same config changes on each host. Perhaps it caused flapping and conflicts.

There is useful logging in ~/Library/Application Support/BitTorrent Sync/sync.log. You don’t need to turn on debug logging in the menu (it’s extremely verbose).

Sleep mode might be affected on Macs? My MacBook seems to run down the battery while it’s supposed to be sleeping. And it’s definitely awake some of the time – sync continues while it’s asleep. Hopefully it doesn’t do this while it’s disconnected from the network (i.e. in my bag, away from home).

Parallels virtual machines take forever to sync after modification and burn a lot of CPU power.

Hopefully you realised this already, but never boot the same VM image on two different machines at once. BTS will faithfully propagate the changes to the other machines, which will be unaware that their disk images are changing underneath them, and you’ll probably end up with corrupt, unusable VMs everywhere.

I still can’t recommend BTS for synchronising virtual machines. Hashing a 60GB image just takes too long. To reduce the time, you can:

Split the VM image into 2GB chunks
- These should take under 20 seconds to hash, though it’s still a long time for one file; many might change
Take snapshots when the VM is stable (i.e. will not change too much more)
- The snapshots will take a while to sync, but they shouldn’t change much afterwards. The diff-since-snapshot should be relatively small and easy to synchronise
Put data files in the host filesystem (i.e. outside the VM) wherever possible
- This is a good strategy if you use Time Machine or other backups, too. An entire VM image is difficult to back up efficiently for the same reason that it’s difficult to synchronise. Should the VM be corrupted, your data is still intact if you kept it outside the VM. The data is also relatively small and easy to synchronise.

I also see relatively slow sync speeds (3-6MB/sec, even though the network will easily do ten times that). That’s a different avenue that I should explore.

I am tempted to write a VM-specific sync application that solves these problems, but it’s very likely that I can do no better anyway. If you have some spare time and want to try it, I suggest:

When scanning for changes in a large image, skim across the whole at say, 1k or 4k intervals and just check a byte at a time. I’m hoping that this will let you compute a hash while still detecting the area that changes occur in fairly quickly. (This might not actually help, as your disk will still need to retrieve all of the data anyway; perhaps try larger intervals like 1M or 20M.)
Transfer data faster. I don’t know why BTS is so slow for me. Perhaps it’s to keep the machine load low, or perhaps it’s a bug.
Changes should be in-place as there’s usually not enough disk space (or time) to create a duplicate file and copy it atomically.
Put in some application-specific knowledge, such as
- taking advantage of Parallels’ snapshots to transfer less data and maintain correctness of data if the sync is not complete
- lock the VM image so it cannot be modified if it is inconsistent

Underneath it all, most filesystems (a) do not track changes within a file, and (b) do not checksum files. If you were to put your VM images on say, a ZFS volume, changes to them could be synchronised very quickly and efficiently (seconds, instead of hours) simply because the filesystem already keeps the hashes and diffs that are needed for the synchronisation app to do its job. Without that information the app must scan through the (extremely large) VM images to find the (relatively small) changes that it should propagate.

If you restore a peer from Time Machine, things seem to go screwy. By ‘screwy’, I mean:

Files on all peers reverted to the time of the Time Machine backup (very bad!)
Failure to resync (FinderInfo issues resurface)

If you’re going to restore a peer from Time Machine, I would suggest removing any synced folders from it altogether and resyncing them from the other peers.

Notes on specific applications

Office for Mac 2011

Outlook for Mac 2011 stores its data files in Documents, so if you sync that, rename each machine’s identity so they don’t conflict. (You’ll get duplicate emails and error messages.)
Shut down all Office applications
Go to Documents/Microsoft User Data/Office 2011 Identities and rename ‘Main Identity’ (or whatever you use) to something else; I use ‘Main Identity ‘
Open Microsoft Database Utility, click your renamed identity, click the gear icon and click ‘Set as Default’

Custom Theme

ian@mutexlabs.com (Ian Howson) — Thu, 25 Sep 2014 00:00:00 +0000

Theming

See here: https://github.com/edx/edx-platform/wiki/Developing-on-the-edX-Developer-Stack#configuring-themes-in-devstack

When you go to rename the .scss file, leave the leading underscore in place or you’ll get:

error /edx/app/edxapp/themes/usbs/static/sass/example.scss (Line 47: Undefined variable: "$sans-serif".)

i.e. rename it to _example.scss, not example.scss.

I also had issues if the LMS was running at the same time I was trying to update (Line 47: Undefined variable: "$sans-serif".). Things worked much better if I shut it down first.

To run the custom them, you need to start the LMS with

paver devstack lms

This will recompile everything at startup. The easiest way I’ve found to do theme development is to just Ctrl-C the paver process and restart it when I change something.

Then, to deploy the theme on your production server:

TODO

/edx/bin/supervisorctl -c /edx/etc/supervisord.conf restart edxapp:
# Note the colon at the end!

Deploying themes

The default reference is at [https://github.com/edx/edx-platform/wiki/Custom-Theming]. It mostly worked, but I had issues with the SSH key; /edx/app/edxapp/tmp_id_rsa was being zeroed.

I could not find any reference to EDXAPP_LOCAL_GIT_IDENTITY in the edX code. The relevant Ansible playbook uses the content directive, so… I guess we put the private key directly in server-vars.yml. Don’t do this for a key with read-write access to anything!

Here’s my working server-vars.yml that worked, sans key. Note the indentation before the contents of the key.

edxapp_use_custom_theme: true
edxapp_theme_name: 'themename'
edxapp_theme_source_repo: 'git@bitbucket.org:username/themename-edx-theme.git'
edxapp_theme_version: 'HEAD'
edxapp_git_identity: '/edx/app/edxapp/tmp_id_rsa'
EDXAPP_GIT_IDENTITY: |
  -----BEGIN RSA PRIVATE KEY-----
  MII...
  ...
  ...Wg
  -----END RSA PRIVATE KEY-----
EDXAPP_USE_GIT_IDENTITY: true

Getting Started

ian@mutexlabs.com (Ian Howson) — Thu, 25 Sep 2014 00:00:00 +0000

I know nothing about edX and want an instance to start playing with right away. What’s the easiest thing to do?

Set up the Amazon AMI. You can have an instance to work with inside an hour.

Amazon’s hosting is pretty expensive, but you’ll be up and running fast.

OK, so USD$1500/year for Amazon hosting is pretty insane. What do I do?

Install edX using the Ubuntu 12.04 instructions below. You have a few options when looking for a machine to run it on:

Find a VPS host. In the US, I’ve had great experiences with RamNode. I rent a VPS from them with similar specs to the Amazon AMI and (as of Sep 2014) it’s costing USD$120/year.
In Australia, I’m happy with Ransom IT, though you either have to buy their largest services or negotiate something with the owner. That comes to $480/year.
- You can trim down edX’s memory usage without too much trouble. This lets you fit into smaller and cheaper VPSes.
You could also buy a really nice desktop computer for $500-$1000 and run edX directly off it. This assumes that you have a fast Internet connection and your admins don’t mind or don’t know that you have a public-facing server within your office.

(Obviously, these prices will change. USD$1500/year is what Australia-region Amazon hosting would cost us once bandwidth and storage are factored in. Both Amazon and VPS hosts are continually dropping their prices, so do your own research, slacker!)

You can run the Production edX stack on your dev machine/laptop, but it uses 4GB of RAM and isn’t at all convenient for development. It runs tolerably with 2GB of RAM. You probably want the Developer Stack instead.

Hosting Options

ian@mutexlabs.com (Ian Howson) — Thu, 25 Sep 2014 00:00:00 +0000

Assuming that you’re setting up a small instance (less than 1000 users), the most difficult hosting requirement is enough RAM. Open edX is composed of many software components, all of which use RAM even when they’re idle.

Put simply, you need 4GB of RAM. More won’t hurt. You can get away with less, but you need to fiddle around; see my post on reducing edX memory consumption for more details.

For a single machine instance, you’ve got a few options for hosting:

Amazon Web Services
A VPS
Your own hardware

Amazon Web Services (AWS)

Pros:

Extremely easy to set up (minutes)
Scales up as much as you like
- Amazon’s tools make scaling extremely easy
- Open edX is designed to run on AWS and automatically scale, so you’ll save a lot of effort if your instance is big enough
If you’ve got a large/busy setup you might end up saving money by scaling up/down with demand

Cons:

Expensive
- Amazon will bill you for instance usage, disk space, I/O and network usage separately.
- In Sep 2014 the instance costs $0.098/hour (in Sydney), but my actual running costs for an empty edX setup were around $120/month due to the extras.
Low performance disk and CPU
- You have to worry about scaling issues sooner
- You have to buy more hardware to compensate for your low-performance hardware
Amazon doesn’t make any guarantees about uptime. If you only have one machine and it goes down, your whole edX instance will be down until you start up a new machine.
- Amazon machines used to be quite unreliable and go down every few weeks. In the last year or two they’re much better and will usually last for months at a time.
- Amazon’s storage infrastructure (EBS) means that if a machine goes down, you can just start a new one; your data should remain intact.
- Starting up a new machine is pretty easy, but you’ll be out for a few minutes, assuming you find out about the outage quickly.
- Amazon encourage you to design your service to tolerate machine outages.

Moving an AMI to your region

The Open edX AMIs are only available in a few regions. I wanted one in Sydney. To transfer the AMI to ap-southeast-2 (Sydney):

Switch to the region that you’re going to copy the AMI from (e.g. us-east-1)
Ideally, you’d find it in the directory, go to Actions, hit Copy AMI and choose Sydney. For whatever reason, Copy AMI is disabled, so…
Create a new Micro instance in us-east-1 using the openedx AMI
Boot the instance
In the instance’s Actions menu, click Create Image
Wait a few minutes. Don’t shut down the instance yet.
In the AMIs page, the new AMI will list as ‘pending’. Wait until it’s ‘available’.
You can now shut down the new instance
In the AMIs page, now you can go to Actions->Copy AMI to Sydney.
Wait about half an hour

EBS vs. Instance Storage

EBS is unlimited, highly reliable storage. It lasts forever.
Instance Store is temporary, high-performance storage. Physically, it’s disks on the same VM server that your instance is running on. Once you shut down the instance, anything that you put in here is lost.
- Obvious uses for this are /tmp and swap

A Virtual Private Server (VPS)

Pros:

Affordable
- Quality hosting in the US is about USD$120/year.
- In Australia, AUD$480/year.
Per-machine, performance is higher than AWS (assuming you’ve chosen a reputable VPS host that doesn’t overprovision their machines). Less machines means less maintenance.
You might get an uptime guarantee
- This is probably not worth anything; your compensation if a machine fails will be an email apology, if anything
- You can check reviews and uptime reports to see if your host has a good record
You could rent a managed VPS and thereby outsource some of the administration hassle
- Administering a Linux VPS is trivially easy compared with Open edX, so I wouldn’t bother spending the money

Cons:

Scaling up/down with demand is your problem.
- For a small enough instance (a few thousand users), you’ll fit on one machine, so this isn’t an issue
- You could build your own OpenStack or CloudStack cluster, but this involves significant engineering effort; AWS probably works out cheaper once engineering cost is considered.

OpenVZ or KVM?

Roughly, OpenVZ isolates multiple users of the same hardware and kernel from each other. KVM is a virtual machine, so it will behave more like a real computer.

Both have some advantages:

OpenVZ can change RAM, disk and CPU allocations instantly
- There’s no rebooting or repartitioning if you need to scale up/down
- This is nice if you’re not sure how much hardware you need or you get unexpected traffic
- With KVM, you need to reboot and resize the partitions by hand; downtime could be significant
KVM can use swapspace (file-based swap) or zram (compressed RAM-backed swap) to squeeze more RAM out of the same hardware
- OpenVZ usually blocks you from changing the swap config; you will need to pay for more RAM if you run out
- If you run out of RAM, the OOM killer will kill a process. Usually nothing bad will happen, but occasionally it’ll hit something important, like Postgres.
- You can’t load kernel modules in OpenVZ, so you can’t use zram unless the host explicitly supports it (unlikely)
KVM can use LXC or OpenVZ containers within your VM
- You can nest virtual machines inside your virtual machines. Confused yet?
- This is really handy for running production and staging systems on the same (virtual) hardware without paying for a second VPS

My preference is KVM, but if I had more users, I’d be leaning toward OpenVZ or Amazon.

There’s some excellent discussion of different virtualisation options over here.

Your own hardware

Pros:

Cheap! A $400 desktop is plenty to get started and there are no ongoing rental fees.
Fast
You might have suitable hardware lying around already

Cons:

You need a really fast Internet connection
Any hardware failures are your problem
You need to know how to build and maintain the hardware

Monitoring

ian@mutexlabs.com (Ian Howson) — Thu, 25 Sep 2014 00:00:00 +0000

Setting up Sentry crash reporting

If you want to provide a reliable service, it’s extremely important to be aware of when things are going wrong on the website.

Sentry is a wonderful free system to catch Python exceptions.

For Django, we use raven to catch and report the errors back to Sentry.

Set up a Sentry server

This is left as an exercise to the reader. Commercial hosting is available if you don’t want to administer another service (or email me and I’ll do it for a fixed fee).
Set up two new services (LMS and Studio) in Sentry and get their DSN strings

The DSN strings are inside your Sentry instance under the Settings->Python->Django page.

Modify /edx/app/edxapp/edx-platform/lms/envs/common.py

Add to the end of the file:

# Sentry integration
INSTALLED_APPS += ('raven.contrib.django.raven_compat',)

RAVEN_CONFIG = {
    'dsn': '<your-DSN-string>',
}

Install the raven module

Ensure that you’re using the edxapp virtualenv. One easy way to do this is by typing which python. If you get back /usr/bin/python, you’re NOT in the virtualenv. If you get back /edx/app/edxapp/venvs/edxapp/bin/python, you are.

Then,
```
pip install raven
```
If you’re using an edx-platform fork, you might want to add raven to edx-platform/requirements/edx/base.txt so it gets installed automatically (e.g. when you bring up Devstack).
Repeat for Studio

This time, modify /edx/app/edxapp/edx-platform/cms/envs/common.py and use the DSN string for Studio.
Restart everything
```
/edx/bin/supervisorctl restart edxapp:
```
Test

Hopefully, your Open edX instance doesn’t regularly give 500 errors. If you want to verify that things are working, we need to induce some.

TODO: describe how to modify some edx-platform code to break.

Server/uptime monitoring

I strongly recommend setting up something for server monitoring; it will alert you when the server goes down, and it’ll warn you if you’re running out of memory.

I’m using Observium, primarily because it has a TurnKey Linux image and modern web interface.

Go to RamNode, set up their cheapest VM (right now, $5/quarter) and load the Observium image.

Cheap, reliable and will help you sleep at night.

Setting Up Devstack

ian@mutexlabs.com (Ian Howson) — Thu, 25 Sep 2014 00:00:00 +0000

Development is much easier using Devstack instead of modifying a production instance. Significantly, it’s configured to only use 2GB of RAM, which makes it fit much better on your dev machine.

The instructions on this are reasonable, but they’re scattered around a bit, so:

Download the base VM
- There’s no resuming on the download, so I used the torrent (4GB is nontrivial in Australia).
Set up VirtualBox Get the exact version mentioned on the edX wiki. Newer ones will cause headaches. I’ve had success with 4.3.12 and 4.3.20.

“mount.nfs: requested NFS version or transport protocol is not supported”

I never figured out exactly what went wrong here. The nfsd port was being held open by something, but lsof would not list a process. A reboot fixed it.

Also, if you are prompted for a password, it means the admin/root password on the host machine, not the VM. This is not entirely clear.

You can log in to Devstack with

vagrant ssh

You will need a particular user/virtualenv to do anything, so almost always do

sudo su edxapp

on the VM after you ssh in.

Devstack doesn’t start the services automatically. To start the LMS, for instance, run the following on the VM:

cd /edx/app/edxapp/edx-platform
paver devstack lms

This will take a little while the first time around.

You can then access the LMS at http://localhost:9000 (on the host).

MongoDB won’t start

I see this error a lot:

pymongo.errors.ConnectionFailure: could not connect to localhost:27017: [Errno 111] Connection refused

If this is the first time installing/running Devstack, running vagrant provision again is probably the right thing to do. It reinstalls everything, so don’t use it after you’ve made changes.

Usually the error is caused by MongoDB shutting down unexpectedly. I put the following into a shell script and run it whenever I see the error:

sudo rm /edx/var/mongo/mongodb/mongod.lock
sudo -u mongodb mongod --dbpath /edx/var/mongo/mongodb --repair --repairpath /edx/var/mongo/mongodb
sudo start mongodb

If that doesn’t fix it for you, there is more information on repairing the Mongo database at http://docs.mongodb.org/manual/tutorial/recover-data-following-unexpected-shutdown/.

More errors

ImportError at /
No module named exceptions

Per https://groups.google.com/forum/#!topic/openedx-ops/bk4dvZRH1dk:

pip uninstall edx-analytics-api-client
pip install -e git+https://github.com/edx/edx-analytics-data-api-client.git@0.1.0#egg=edx-analytics-data-api-client

Unsorted notes

This uses VirtualBox, which is really a lowest-common-denominator type of decision. My Mac freezes hard if I start up another VM (either Parallels or HAXM for Android), which is somewhat inconvenient, because all of my VMs are in Parallels and Android dev is very slow without HAXM.

There are hooks in the Vagrantfile to use VMware Fusion, but I haven’t tried it.

There is a Parallels target for Vagrant, but I haven’t tried it.

The Developer Stack is configured with 2GB of RAM by default, but you can reduce it to 1GB through the VirtualBox GUI. Performance at 1GB is fine. This helps a lot if your dev machine only has 4GB of RAM.

Using LXC Containers

ian@mutexlabs.com (Ian Howson) — Thu, 25 Sep 2014 00:00:00 +0000

This page is extremely rough. It’s just my notes with very little editing or checking. Be warned!

Instead of running Open edX directly on your VPS, consider running it within an LXC container.

Pros:

Run production and staging systems on the same hardware
- It’s usually cheaper to rent one big VPS instead of many small ones
Run other services on the same hardware. Open edX requires specific versions of everything which will conflict with your other services.
Test security patches or custom code on a production-like system

Cons:

Some increase in complexity
You have more machines to deploy security updates onto
- You could automatically deploy updates
- Better yet, you could use your staging container to test the updates before you deploy them onto the production container
A trivial decrease in performance
A very small increase in disk space usage
- The LXC host needs its own copy of the Ubuntu base packages
You need enough RAM to run both systems simultaneously
- A VM would use double the RAM, as you’d make a fixed allocation. With LXC, you’re just running two copies of the applications under the same kernel, so you’ll use the space much more efficiently.
- Therefore, swap is shared. If you have applications which aren’t use often – like for a staging site – they might get swapped out and thus not impact your production site too much.
Open edX uses AppArmor for a number of security features, and it’s not clear to me that that works within an LXC container
- LXC is probably not a good idea if you’re running untrusted code (e.g. for a programming MOOC).
- LXC containers, right now, do not provide strong protection of the host against malicious clients (such as your students).

What is LXC?

LXC is containers for the Linux kernel. What that means is that you can run multiple Linux userspace instances under one kernel. They’re isolated but can still share resources efficiently.

As LXC isn’t full machine virtualisation (like VMware, Parallels or VirtualBox), you can run it underneath another VM instance, such as one rented from a VPS host.

How do I set up the LXC host?

LXC features are built into most modern kernels. The tools are available back to Ubuntu 12.04 LTS (and probably further). I’ve had significantly better results under Ubuntu 14.04 than 12.04.

If you’re running Ubuntu, just install the lxc package.

How do I set up LXC guests?

If you want to set up (say) an Ubuntu guest, you would run:

lxc-create -n <container name> -t ubuntu

<container name> can be anything; I use ‘edxstaging’ and ‘edxprod’.

The files for the guest will appear under /var/lib/lxc/<container name>/rootfs

The first time you set up a particular template/release combination, packages will probably be downloaded. Be prepared to wait a little. Subsequent creations of the same template/release should be very fast.

It’s useful to be able to specify exactly which release of Ubuntu will be installed. For edX, you probably want the 12.04 AMD64 release, so run:

lxc-create -n <container name> -t ubuntu -- -r precise -a amd64

(I don’t know what happens if you try to run amd64 on a 32-bit host; it probably won’t work.)

Start the guest with:

lxc-start -d -n <container name>

The -d starts the container in the background. If you leave this off, the container will run in your terminal like a program and it’ll die when you close the terminal (unless you were already in a tmux session).

You can then get a console on the container with:

lxc-console -n <container name>

For the ubuntu template, you can log in at the console with username ubuntu and password ubuntu. You can then set up Open edX as you would a normal VM, per the deployment checklist.

Later on, you might like to stop the guest with:

lxc-stop -n <container name>

or delete the guest with:

lxc-destroy -n <container name>

Let the guest access the network

Depending on your release and configuration, the guest might not be able to access the network by default. Assuming that you’re running Ubuntu 12.04, you’ll need to set up an IP address on the guest and then use iptables to share the host’s network with the guest.

On the guest, as root, edit /etc/network/interfaces to look like:

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
    address <guest IP address>
    netmask 255.255.255.0
    network 10.0.3.0
    gateway 10.0.3.1
    dns-nameservers 8.8.8.8 8.8.4.4

Change <guest IP address> to something unique for that guest. The default uses 10.0.3.1 for the host, so use something like 10.0.3.100 for the guest. Each guest must have its own IP, obviously.

On the guest, run /etc/init.d/networking restart to apply these changes. You should then be able to access the Internet through the guest.

I’m not sure why the default config uses a 10.x.x.x IP but only a /24 subnet; doesn’t hurt anything, though.

On the host, run:

/sbin/iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
/sbin/iptables -I FORWARD 1 -i eth0 -o lxcbr0 -m state --state RELATED,ESTABLISHED -j ACCEPT
/sbin/iptables -I FORWARD 1 -i lxcbr0 -o eth0 -j ACCEPT

Check that the updated iptables rules make sense with:

/sbin/iptables -L -v

To commit the rules for the next boot:

iptables-save > /etc/iptables.up-rules

Let the Internet see your edX instance

You probably want to make the edX instances accessible to the Internet. If you want all of them (or other websites) accessible on port 80 but with different hostnames, use nginx on the host per http://nginx.org/en/docs/beginners_guide.html#proxy.

You’ll end up with stacked nginx proxies (the host nginx talks to the guest nginx, which talks to the application server) but this isn’t a big deal.

On Ubuntu-like hosts, you can just stick a file in /etc/nginx/sites-enabled/edx that looks like:

server {
    server_name edx.example.com;

    access_log on;

    location / {
      proxy_pass         http://10.0.3.101:80;
      proxy_redirect     default;

      # These fix the headers for the guest's server. Without these, you'll get broken redirects and less useful logging.
      proxy_set_header   X-Real-IP  $remote_addr;
      proxy_set_header   X-Forwarded-For $remote_addr;
      proxy_set_header   Host $host;
      #proxy_set_header   X-Forwarded-Proto $scheme;
    }
}

At this point, the guest will log the IP address of the LXC host instead of the actual IP that requested the page. You can fix this by modifying the nginx config on the guest. For the LMS, edit /edx/app/nginx/sites-available/lms. Where it says:

location @proxy_to_lms_app {
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header X-Forwarded-Port $server_port;
    proxy_set_header X-Forwarded-For $remote_addr;

modify the X-Forwarded-For line like so:

# forward the correct IP from our upstream nginx
proxy_set_header X-Forwarded-For $http_X_Real_IP;

Starting LXC containers automatically

You probably want your edX LXC containers to start automatically when you boot the machine.

There’s some conflicting information about how to do this. The way I’m doing it (on Ubuntu 14.04 LTS) is:

Edit /var/lib/lxc/<container name>/config
Add lxc.start.auto = 1 somewhere
Verify that it has taken effect with lxc-ls -f

Also check that /etc/default/lxc has LXC_AUTO="true".

`lxc-*` commands very slow, or `lxc-console` takes minutes to return

Your guest network probably isn’t configured correctly.

Unsorted below this line. Beware!

You might also want to open up the SSH ports (depending on how paranoid you are about security). You can use iptables again to forward ports on the host to the LXC guests:

# TODO: this doesn't work yet
/sbin/iptables -t nat -A PREROUTING -p tcp --dport <new ssh port> -j DNAT --to-destination <guest ip>:22
# e.g. /sbin/iptables -t nat -I PREROUTING 1 -p tcp --dport 2222 -j DNAT --to-destination 10.0.3.102:22
/sbin/iptables -I INPUT 1 -p tcp -m state --state NEW -m tcp --dport <new ssh port> -j ACCEPT

Then, you’d access SSH on the guest with a command line like:

ssh -p <port number> <hostname>

Deployment Checklist

ian@mutexlabs.com (Ian Howson) — Thu, 25 Sep 2014 00:00:00 +0000

This page is extremely rough. It’s just my notes with very little editing or checking. Be warned!

Deploying a production service

You probably want to use the release branch of edx-platform. This is slightly more stable that the main branch; apparently it is what runs on edx.org.

Before updating your production server, it’s probably a good idea to run any updates against a staging server just to make sure things are sane. (A batch of unit testing wouldn’t hurt, either.)

Initial user and security setup

If you’re using a machine that is directly exposed to the Internet, the first thing to do is get basic account and network security in place. You can skip this if you’re in an LXC container on an isolated network.

Create a new user account, if the default install has you log in as root.
Fix /etc/sudoers so that the new account can sudo.
Copy your SSH public key to the new account’s ~/.ssh/authorized_keys.
Configure SSH: change the default port (Port <new port number>), only allow the new user account (AllowUser <username>) and disable password login (PasswordAuthentication no).
For some reason, service ssh restart or service ssh reload don’t actually do anything, so inside a tmux session I do /etc/init.d/ssh stop ; killall sshd ; /etc/init.d/ssh start. Obviously, this will kick you out of the SSH session, which is why we do it inside tmux. If you mistype this or have done something wrong in the config, you will be locked out. Be warned. (This is also why we do this config right at the start, so we can nuke from orbit if necessary.)
For paranoia’s sake, I set up the ufw firewall. Yes, it’s fiddly and annoying. But remember, this is a public-facing web service. Randoms will poke and prod it. You probably want (as root):

ufw allow <ssh port>/tcp  # permit SSH
ufw allow 80/tcp          # permit edX LMS
ufw allow 18010/tcp       # permit edX Studio
ufw default deny          # drop anything else

Again, setting these rules might lock you out of the system. Be careful.

Side note 1: I don’t believe that firewalls actually achieve much in reality, but it’s cheap insurance. Notably, if you forget to make a service internal-only and accidentally bind it to a public IP, the firewall will still protect you.

Side note 2: The edX codebase is huge and undoubtedly contains security problems. The firewall will not protect you against these. You will need to stay up-to-date with security alerts and patch your edX instance regularly.

Set up the host machine

There are a few tweaks that I like to make to all new Ubuntu machines.

Install the following packages on all machines:
- wget: downloads files through the command line. Needed for edX installation and not always installed by default.
- aptitude: like apt-get, but better
- tmux: detach and resume terminal sessions
Install the following packages on anything that isn’t an LXC or OpenVZ guest:
- swapspace: automatically scaling swap files
- zram-config: automatically compresses memory (like swap)
- You can see how your swap is allocated with cat /proc/swaps
- Obviously, it’s best for performance if you don’t need swap at all, but running out of memory and invoking the OOM killer can be dangerous. You don’t control which process is killed (usually, it’s the largest one). This works for a while but eventually something important (like a database) will be killed and Bad Things will happen.
- iotop: tells you which processes are hammering the disk

Setting up edX

Follow the instructions at [https://github.com/edx/configuration/wiki/edX-Ubuntu-12.04-64-bit-Installation]. If you’re in a rush, you can skip to ‘One step installation’, which I find works pretty well.

LXC: apparmor issues

While running vagrant.sh, you’ll get an error like:

stderr: apparmor_parser: Unable to replace "/edx/app/edxapp/venvs/edxapp-sandbox/bin/python".  Permission denied; attempted to load a profile while confined?

I spent a while trying to get this to work correctly but was not successful. It’s related to a Python sandbox, used for programming MOOCs (to ensure that students can’t run malicious code on the server). I’m not running a programming MOOC, so I disabled it.

Edit /var/tmp/configuration/playbooks/roles/edxapp/defaults/main.yml. Change:

EDXAPP_PYTHON_SANDBOX: true

EDXAPP_PYTHON_SANDBOX: false

Re-run the deployment script with

cd /var/tmp/configuration/playbooks && sudo ansible-playbook -c local ./edx_sandbox.yml -i "localhost,"

This is the same as the last line of vagrant.sh. Ideally, you would check that config change into a local branch of the edX Configuration repository.

The slightly nicer way to do this is to add the EDXAPP_PYTHON_SANDBOX line to your server-vars.yml, as described here.

LXC: rabbitmq issues

TASK: [rabbitmq | remove guest user]
stderr: Error: unable to connect to node rabbit@localhost: nodedown

I didn’t solve this completely, but a functional (if horrible) workaround is to edit /etc/hosts:

127.0.0.1 <hostname>
127.0.0.1 localhost

Installing your theme

Setting up user accounts

Setting configuration variables

In configuration repo, modify /playbooks/roles/edxapp/defaults/main.yml

###

Tasks to complete before live deployment

Set up SSH access with public keys (preferably not on the default port 22)
Disable the default accounts:

https://github.com/edx/edx-platform/wiki/Frequently-Asked-Questions User: honor Password: edx User: audit Password: edx User: verified Password: edx User: staff Password:edx
Verify that only your LMS, CMS and SSH ports are visible through the firewall. There are a lot of TCP-enabled services running; while they are probably configured to allow connections to localhost only, why take the chance? ** Run netstat -al to check
Review the settings in /edx/app/edxapp/.json, especially things like cms.env.json which define contact details and titles for your instance. ** Or maybe you’re not supposed to touch those – https://groups.google.com/d/msg/edx-code/VjVFT4-Etjw/UrpzDbpazo0J says that they get overwritten during ansible update
Add Google Analytics API key
Set up your DNS to point to your instance.
- Talk about how to use different DNS names to give Studio vs. LMS instead of different port numbers
Think about backups and disaster recovery
Set up authentication (Shibboleth, LDAP)
Important URLs
Configure the instance
Adding users
Creating a course
Setting start and end dates
Uploading SCORM zip files

Reducing Memory Consumption

ian@mutexlabs.com (Ian Howson) — Thu, 25 Sep 2014 00:00:00 +0000

This page is extremely rough. It’s just my rough notes with very little editing or checking. Be warned!

If you’re just developing on your laptop/desktop, run Devstack. it is a lot easier to develop on than the full production server and “only” uses 2GB of RAM.

The stock edX Ubuntu deployment is set up to give you good performance, but it assumes that you have a lot of hardware available.

The recommended config for the edX Ubuntu deployment recommends an Amazon instance with 4GB of RAM. There are two problems with this:

4GB of RAM is a lot to allocate to a VM in development. If you want to demo or develop on your laptop… not a lot of laptops have 16GB of RAM yet, and spending half of your RAM on a VM is annoying.
- For development, 2GB is sufficient, but it’s still chugworthy
4GB Amazon instances are expensive. As of August 2014, they’re about $100/month, so $1200/year just in hosting.
- To put that in perspective, you could buy a really nice desktop computer or server, put it under your desk and use your university’s Internet connection. And have no ongoing costs.
- If you’re in the sort of institution whose IT department charges $15k/year for a server or even $100k (hello, Australian banking sector), then… sucks to be you. I guess Amazon works out cheaper, then.

If you’re running a big edX instance (tens of thousands of students), then yeah, you probably need some bigger hardware and a lot of RAM. If you’re just doing a closed course, 4GB instances are vast overkill.

You can reduce the memory usage to something sane by:

Reduce the number of workers for a bunch of services (lms, cms, xqueue). By default, lms uses 8, and each uses ~80MB of RAM (so 640MB just for the LMS). I use 3 both in dev and production.
- The optimal number here is highly debated. If you’re CPU-bound then is good, but unless you’re using SSDs, you’re probably not CPU bound. Best to test it and adjust accordingly. Keep in mind that if you’re using regular spinning disks, you’ll probably never peg the CPUs even with many workers; adding workers will just make the disks thrash more. I/O is pretty much always the bottleneck.
- Also, timeout=300? That seems crazy. Who has a 5 minute request? Better to kill it early rather than block all the new requests coming in. Make it 30 seconds, tops (and even then, you’re still boned).
Restart the LMS with /edx/bin/supervisorctl restart edxapp:lms
Restart the CMS with /edx/bin/supervisorctl restart edxapp:cms
Restart xqueue with /edx/bin/supervisorctl restart xqueue
Restart ora with /edx/bin/supervisorctl restart ora
Restart ora_celery with /edx/bin/supervisorctl restart ora_celery
- You might not use it at all, so you could just turn it off
Restarting through supervisor doesn’t seem to change the number of workers; as a stop-gap solution, just reboot the machine (yuck!)
You could also turn off some services, like the grader, forums or java
You could also use zram
Try KSM. Only for KVM VMs right now, but there are some attempts to make it work for all processes, which would be excellent with LXC: https://plus.google.com/+MaksimMelnikau/posts/QfhAchyzYva http://vleu.net/ksm_preload/ http://kerneldedup.org/en http://kerneldedup.org/en/projects/uksm/introduction/ https://github.com/prashmohan/lxc-fork/blob/master/Documentation/vm/ksm.txt

Common Errors

ian@mutexlabs.com (Ian Howson) — Thu, 25 Sep 2014 00:00:00 +0000

I’ve got Internal Server Error when accessing the production server (LMS)

Check /edx/var/log/lms/edx.log for the reason.

Fixing common error messages

(c/o [https://groups.google.com/forum/#!topic/openedx-ops/bk4dvZRH1dk])

from analyticsclient.exceptions import ClientError
ImportError: No module named exceptions

To fix, roll back to an old release version (in this case, v0.1.0):

sudo -u edxapp bash
source /edx/app/edxapp/venvs/edxapp/bin/activate
pip uninstall edx-analytics-api-client
pip install -e git+https://github.com/edx/edx-analytics-data-api-client.git@0.1.0#egg=edx-analytics-data-api-client

Implementing XKCD-style passwords on a real website: lessons learned

ian@mutexlabs.com (Ian Howson) — Wed, 09 Jul 2014 00:00:00 +0000

I recently completed a project requiring a few thousand pre-set-up user accounts. For their passwords, I decided to implement XKCD-style passwords instead of the usual collection of random characters.

I specifically wanted to avoid letting the users choose their own passwords. They would probably use the same password as a related, sensitive system. Keeping the two systems isolated was desirable.

Getting the dictionary file right is difficult

I used a fairly complete dictionary that I pulled from a mailing list (the exact URL eludes me, unfortunately).

Dictionaries contain a lot of offensive words; we’d prefer not to use them for passwords. There’s a fuzzy line for what dictates ‘offensive’, though. The plural of ‘ball’, ‘balls’, can be offensive when combined with the right (or wrong) modifiers.

What is offensive varies across cultures. For example, the Brits have a lot of words which are sort of funny and inoffensive if you’re a native English speaker (e.g. ‘spatchcock’). A large proportion of my users were not native English speakers. Many of the funny British-isms had to go.

Even after screening for offensive individual words, it’s possible to get weird combinations of words that have meaning. After stripping out the obvious swear words and other dangerous (but non-sweary) words, I generated a bunch of random passwords and skimmed through them by hand. This turned up some other dangerous combinations, like ‘hate indian’. Not good.

Once, I encountered this problem with a random character password; somehow the string ‘cute’ snuck into a female user’s password.

Dictionaries contain lots of obscure words and non-words words, like ’re’ or ‘b’. They have no meaning to me and thus no recall value.

Performance

Dictionary files are big. They take a long time to read from disk and use a lot of memory. I elected to read the dictionary every time I created a user, which took a good fraction of a second each time.

It would have been smarter to keep it in memory and put some effort into freeing that memory when done. Better yet, generate the passwords offline.

User acceptance

Random word passwords look different to normal passwords, and most users have never encountered a passphrase before.

Many users didn’t recognise the string as a password at all; they emailed saying that they hadn’t received a password, or thought that it was part of a sentence that had been mistyped, or asked what the words meant. One thought that it was a cryptic word puzzle that they had to solve and that once solved, that answer would be the password.

Other users didn’t know where to put it. I had to draw a diagram showing that the passphrase should be typed in just like a regular password. It seems almost comical, in hindsight, but this genuinely reduced the number of support emails that I received.

Usability

Long passwords are easy to mistype. I elected to not show the user’s password as they typed, but I think that might be a mistake. Such a long password is easy to get wrong.

A lot of users don’t type the spaces. Some users will type in all caps.

If the password is long and difficult to enter, users will just copy and paste it. This somewhat defeats the purpose of providing a memorable password.

Even copy and paste has its problems. A lot of users will select too many or too few characters at the start and end of the password string, and if they can’t see the password when they paste it, they can’t see the error.

I added a new Django auth backend to strip spaces and lowercase everything. It almost eliminated the “my password isn’t working” emails. I strongly recommend it.

Logging your failed password attempts (securely!) will help a lot with diagnosing these problems.

Was it worth it?

Quantifying security differences is tricky at the best of times.

In this case, probably not. It wasn’t a system that required a high level of security. Once everyone had logged in at least once, there were no more complaints – but a lot of people (0.5%?) had trouble entering that password correctly once.

In hindsight, perhaps this is a policy best restricted to your own personal password security and not enforced on other people.

Links

Jeff Preshing’s xkcd Password Generator. I recommend that you go and mash the ‘Generate Another’ button. The wordlist is dangerously small (a few hundred, I’m guessing), but it’s still easy to generate offensive passwords.

Correct Horse Battery Staple: Another, slightly more paranoid option.

xkpasswd: More paranoia again.

I find the addition of numbers and punctuation to be a bit odd; the whole point of using words is that you get sufficient entropy for your password without having to resort to difficult-to-memorise features such as numbers and punctuation.

Choosing A Secure Password by Bruce Schneier. Long, but an excellent read.

Afterword

For a later group of users, I used standard Django random passwords (a sequence of 8 random numbers and letters).

Not one user complained that their password was not being accepted. A few couldn’t figure out where to type it in (even with the helpful image!) but they could all readily identify the password.

Do you really need ECC RAM with ZFS?

ian@mutexlabs.com (Ian Howson) — Thu, 27 Feb 2014 00:00:00 +0000

In short, not really. But your life will be better if you do.

The thing to remember is that ZFS will absolutely refuse to give you data that it thinks is incorrect. If it detects an error, it will give you an error. It will never ever (to a vanishingly tiny probability) give you wrong data.

So, if your non-ECC RAM is already perfect, great! You won’t gain anything by getting ECC RAM.

The thing is, all hardware is imperfect. Modern hard drives are specced with an error rate of 1 in 10^15 or so. And while this number should be taken with a large grain of salt, it’s worth mentioning that the capacity of hard drives is approaching this number. That is, if you merely fill a modern hard drive with data, you should expect that the drive itself has introduced an error into your data.

Most filesystems trust the data that the hardware gives them, and they in turn will pass that data to you, the user. And if there’s an imperfection, you’ll get that imperfection. You almost certainly won’t notice; nowadays, most data is highly-compressed video or audio or pictures, and humans are mostly forgiving of small flaws.

The thing that makes ZFS difficult to use with non-ECC RAM is that it won’t give you flawed data; it’ll give you no data. If you have a 20GB VM image on a ZFS volume and it develops a single uncorrectable bit out of place, the whole thing is marked ‘broken’ and ZFS won’t give it to you. Over a single bit. Which probably wasn’t important anyway.

Note that I said ‘uncorrectable’. If your data hits the disk intact, that error will almost certainly be correctable by one of the other volume members.

If your data hits the disk incorrectly, such as if you have not-quite-perfect non-ECC RAM and it was written to all of the mirrors incorrectly, you’re in trouble. You now have redundantly incorrect data that ZFS won’t serve to you. Hope you have a backup.

You don’t need ECC memory for ZFS. It won’t run any better or faster or clean your bathroom. What it will do is reduce the chance that your data becomes inaccessible because it’s slightly wrong; something which you didn’t know happened before, but which ZFS makes obvious.

How to set up a private IPython parallel cluster

ian@mutexlabs.com (Ian Howson) — Mon, 03 Jun 2013 00:00:00 +0000

IPython Notebook (now Jupyter Notebooks) is frickin’ awesome. With the parallel extensions, it’s even awesomer.

I want to use spare desktops around my house to speed up my parallel jobs. There is lots of documentation on how to do this. It is very long.

My setup is:

MacBook running OS X 10.8
Two desktops running Ubuntu

Step 1: Install the software

On the MacBook, you need ipython+notebook+parallel. I use MacPorts, so you can install this with:

sudo port install py-ipython +notebook +parallel +scientific

On the Ubuntu machines, you just need the ipython-notebook package, and the rest of the dependencies will install automatically:

sudo aptitude install ipython-notebook

Step 2: Test it out

To start some workers (‘engines’) on your local machine, run:

ipcluster start --n=4

To start a Notebook instance, run:

ipython notebook

You should get a Notebook instance in your web browser. Fire it up and run:

from IPython.parallel import Client

c = Client()
c.ids
c[:].apply_sync(lambda: "Hello, world!")

You should get back:

[0, 1, 2, 3]
['Hello, world!', 'Hello, world!', 'Hello, world!', 'Hello, world!']

You get one for each engine.

Shut down the cluster on your local computer by hitting Ctrl-C on the terminal window running it.

Step 3: Connect more cluster nodes

Set up your laptop

Cluster configuration is described in a ‘profile’. On your local machine, run:

ipython profile create --parallel --profile=home

This creates a profile called ‘home’. Modify ~/.ipython/profile_home/ipcluster_config.py:

c = get_config()

c.IPClusterEngines.engine_launcher_class = 'SSH'
c.LocalControllerLauncher.controller_args = ["--ip='*'"]

c.SSHEngineSetLauncher.engines = {
    'localhost': 4,
    'tyler': 4,
    'par': 4,
}

# FIXME NASTY HACK We need to use non-system-default Python on the Mac
# (i.e. the /opt path in the default config below) but we want default
# Python on the Linux machine. I couldn't figure out a way to specify it
# on a per-host basis, and profile/bashrc/whatever are not executed for
# ssh login, and the ~ alias doesn't seem to work, so... I created a
# symlink in / for ipengine (i.e. `ln -s /opt/local/bin/ipengine
# /ipengine`). Horrible, but it works!
c.SSHEngineSetLauncher.engine_cmd = ['/ipengine'] # works on Linux (thought nothing necessary)
#c.SSHEngineSetLauncher.engine_cmd = ['/opt/local/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python', '-c', 'from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance()'] # works on Mac

Note that this will open up these ports on your MacBook to the whole local network. If you’re on an untrusted network segment, don’t do this! A future revision of this guide might deal with doing everything across SSH port forwarding.

Set up the cluster nodes

You need to be able to log in to the remote Linux machines automatically via SSH, ideally using your username on the frontend machine. Check here if you’re not sure how.

Per the above dodgy hack, you also need to link /ipengine to the relevant ipengine binary on that machine. For the Macs:

sudo ln -s /opt/local/bin/ipengine /ipengine

Also make sure you enable SSH (Control Panels -> Sharing -> Remote Login). You also need passwordless login on your local machine; test with ssh localhost.

For the Ubuntu machines:

sudo ln -s /usr/bin/ipengine /ipengine

ipcluster will try to share config files with the engines. By default, the directories do not exist on the Linux hosts. On each of them, run:

mkdir -p .ipython/profile_home/security/

Running the thing

To start the cluster, on your local machine, run:

ipcluster start --profile=home

If you add new machines to the cluster, you need to re-run the Client(profile='home') line to get access to them. This also fixes things if you break the cluster (e.g. by exhausting RAM).

Attacks on Proximity Card Systems

ian@mutexlabs.com (Ian Howson) — Tue, 28 May 2013 00:00:00 +0000

DRAFT

Introduction

Many attacks have been described against low-frequency (125kHz) proximity card systems. Nobody should be surprised to learn that they’re not considered very secure – and yet, most new installations use these cards.

I want to raise awareness of the limitations of these systems, demonstrate the many simple ways in which they can be defeated, and discourage new installations. This article lists the most practical ways in which to attack these systems. As I become aware of new attacks, I will update the article.

I will mention specific product names, but these attacks can be applied to many different products which use the same operating principles.

System design

We will be dealing with the ‘common case’ when proximity cards are used for access control: HID Proximity cards, an HID reader such as the ProxPoint, transmitting Wiegand back to the controller. Other designs are possible (and desirable), but in my experience, the vast majority use this setup.

An access control system using proximity cards is usually laid out like so:

For a user to authenticate against the system, the following steps take place:

The user brings the card within range of the reader.
The field around the reader powers up the card
The card transmits its pre-programmed code to the reader
The reader transmits the code to the door controller
The door controller may decide to allow, powering the door strike (and unlocking the door) or it can defer to the management backend for an allow/deny decision
If the door strike is powered (the solenoid activates and unlocks the door), the user can push the door open and enter

Card

All proximity cards contain a coil and some circuitry. Some long-range types contain a battery, but we will not be dealing with them here. The coil picks up an RF field from the reader and uses it to provide power to the internal circuitry.

The most common coils are tuned to 125kHz or 13.56MHz. We will refer to cards running at about 125kHz as Low Frequency (LF) cards. We will refer to cards running at 13.56MHz as High Frequency (HF) cards.

There are many different types of card on the market. The most common cards run at 125kHz and are made by HID (for example, the ProxCard II. There are very similar cards running at 128kHz or 134kHz from manufacturers such as Indala.

These cards all behave in the same way. When the card is placed in the reader’s field, the circuitry derives a power supply and clock. It then transmits a preprogrammed code to the reader and powers down.

The method of transmission is interesting. Instead of the card actively transmitting RF energy, it manipulates its own power draw in a way that can be detected by the reader. Because the card and reader are inductively coupled together, an increase in load causes a decrease in the output voltage on the reader side. This change in voltage can be measured by the reader and is used to transfer information from card to reader. (For more information, see here.)

Card reader

The reader generates an RF carrier to power and clock the card. When it receives a valid transmission from a card, it will transmit the card number out of its Wiegand interface.

The reader does not make the access control decision (allow or deny). This is good design; the reader is physically accessible to the insecure side of the door and can easily be tampered with. Instead, it transmits the card’s code to the door controller using its Wiegand output.

The Wiegand interface

The Wiegand interface uses three wires: GND, D0 and D1. To transmit a ‘0’ bit, the D0 line is pulled to 5V. To transmit a ‘1’ bit, the D1 line is pulled to 5V. There are no formal timing requirements, but most devices transmit and recieve pulses around 50uS wide and with a gap of 5000uS between pulses.

Wiegand formats

The most common format for card numbers is as follows:

(image from HID’s Understanding Card Data Formats document)

This is often referred to as Wiegand 26.

Cards almost always support other formats, but Wiegand 26 is the defacto standard. Almost all system components default to Wiegand 26 without further configuration.

User codes should be different for every card. They’re often printed on the card itself:

Facility codes are the same for an entire ‘facility’, which is usually the domain controlled by a single access control system. This can be as small as a single office or can span multiple buildings. It has the obvious purpose that if two people from different companies have the same user code, they can’t open doors at each other’s buildings. It partitions the code space to prevent collisions.

Door controller

The door controller receives and decodes the Wiegand signal from the reader. Depending on how the controller is configured, it can make a decision (allow or deny) based on that signal, or it can forward it on to the management backend. Usually, the door controller communicates with the backend through an RS485 bus. This is a multidrop bus, so many door controllers can communicate using a single set of cabling.

If the door controller or management backend elect to unlock the door, a relay on the door controller is energised.

Door strike

The door strike is wired to the relay on the door controller. Its purpose is to physically lock or unlock the door. Normally-open and normally-closed types are available, so:

With an NC type, the strike is locked by default. When powered, the strike is unlocked. This is also called fail-secure.
With an NO type, the strike is unlocked by default. When powered, the strike is locked. This is also called fail-safe.

The strike can also be wired to the relay in normally-open or normally-closed configurations. In this way, default behaviour for the system (when unpowered) can be specified. For example:

If you want the door to unlock when the power fails (improving safety but reducing security), use an NO strike wired to the relay in NC configuration.
If you want the door to absolutely positively not unlock unless the management backend asks for it, use a NC strike wired to the relay in NO configuration.

Usually, there is a fire safety requirement that says that there must be an exit path in the event of a fire. This may require you to choose fail-safe strikes and have them continuously energised in some cases.

Door strikes usually require 12V at a few hundred milliamps to trigger.

Weaknesses and attacks

Most attacks occur in two stages:

Obtaining a card number (facility and user code)
Using the card number to gain access to the facility

Wiring

The wiring between components must be physically secure. Communication between card reader, door controller and door strike is completely unencrypted and unauthenticated.

If an attacker can access the wiring carrying the Wiegand codes between reader and door controller, they can sniff the Wiegand codes. They can then replay the codes directly into the wiring or use them to create a new card.

The same cards may be used for unrelated systems (e.g. printers, vending machines) which may not have secured cabling. The same Wiegand codes can be sniffed from there and used to clone a card.

If an attacker can access the door strike wiring, they can manually energise the strike (by applying a voltage across the wiring) or de-energise it (by cutting one of the wires).

If an attacker can access the door controller itself, the preceding attacks are also possible.

Most RS485 wiring between door controllers and the management backend can also be sniffed and replayed, though the exact formats are not standardised. Some door controllers and backends do use encrypted communication over the RS485 lines.

One big problem is that the card reader itself must be able to withstand physical attacks. The card reader usually has a ‘pigtail’ collection of wires coming out of the back, including the critical Wiegand lines. If an attacker removes the reader from the wall, the Wiegand lines are exposed. This is usually easy to do. Some readers have their mounting screws well secured, such as these:

Some have their mounting screws exposed, so a few moments with an electric drill is all that’s necessary:

Some are physically quite robust. Potting of the electronics is common, and is a good precaution.

One factor that makes physically securing a reader difficult is that any metals in or around the reader will affect the read range. As a result, almost all readers are made of plastic and can be broken off the wall with a hammer.

The same is true of door strikes, but even low-end door strikes are solidly built; they just need to be at least as strong as the door that they lock.

Attacks on the RF transmission

Use an off-the-shelf reader

One obvious way to read people’s cards is with an off-the-shelf reader, like what might be installed on a building. They can be powered from batteres and the beep can be disabled. The attacker might get near someone on a lift and swipe the reader over their pockets. Some companies mandate visible security passes, making it even easier as you can see the pass and know where to swipe the reader.

The reader will output the Wiegand code of the card, so you need a sniffer/replayer and some way to use that code (either replay it into the Wiegand wiring or clone a card). Slightly simpler might be to use a reader with RS232 output and connect that to a laptop.

This is a design flaw in the system – there is nothing the card can do to know that it’s talking to a legitimate reader. As soon as it’s powered, it transmits its code, not knowing if the receiver is friendly or malicious.

A less practical attack, but using the same design flaw, is to put a fake reader on a wall near an entry point. Users will swipe their cards on it thinking that it will open the door. The card numbers can be collected and exploited as above.

RF sniffing

The reader and cards transmit on a known frequency. If you can get close to a reader while it’s communicating with a card, you can capture the card’s transmission. It may be possible to do this at long range, since the reader operates at a very high power level (necessary to power the card). The attacker only needs to observe the card’s transmission, not power the card.

Again, the card does not make any attempt to hide its transmission. Every transmission is identical, permitting replay attacks.

Once you’ve captured the card’s transmission, you can either replay it directly at a reader or extract the user/site code.

With the user/site code, you can:

produce your own card
program a programmable card with the code
purchase a card from a vendor
conduct Wiegand wiring-level attacks

Attacks on the card numbers

The numbers themselves provide opportunities for attacks.

Brute force

The total number of code available is relatively small (by cryptographic standards). With Wiegand 26, there are 256 site codes and 65536 user codes, for a total of 16,777,216 card numbers.

Assume a random distribution of card numbers and that you can make one swipe attempt every three seconds. If there is exactly one valid card in the system, it will take (on average) 291 days to find it; not very useful.

If you know the site code (say, you cozy up to someone at their security vendor), you can guess a single valid card in 1.4 days, on average.

If there are more cards set up in the system, guessing a valid user code becomes proportionally easier. If you have 100 cards set up, a valid card can be guessed in 2.8 hours, on average.

If you know someone’s user code (not many people know that the number printed on their card is important!) you can brute-force the site code in 13 minutes, on average.

Guessing more user numbers

Usually, cards are sold in sequential order. If you know one card number (perhaps your own, if you’re an inside attacker or corporate spy) it’s very likely that:

there are other card numbers near your own
lower-numbered cards belong to longer-serving employees, potentially with more access rights
less lower-numbered cards will work, due to employees moving on

It also means that you really shouldn’t reissue old cards to new employees, as the old employee may be able to produce a new card with the same site/user code. They’re cheap; destroy them (securely!)

Some companies require that people display their security passes on their body (often with additional conditions like “it must be above your waist” and “you must challenge anyone who isn’t wearing a pass”). Some subset of those companies also print the user number on the card. Obtaining a user code then becomes a simple matter of reading the number off people outside the building.

Attack tools

Wiegand sniffer/replayer

This is a device which connects to the D0, D1 and GND lines of the Wiegand interface. When it detects a Wiegand transmission, it captures it. Under user control, it can replay that same transmission onto the Wiegand lines. Each transmission is identical, so that will appear the same as a legitimate card swipe.

Examples of these devices are:

RF sniffer/replayer

The RF sniffer/replayer works identically to the Wiegand sniffer/replayer, but looking at the RF transmissions instead. An antenna or coil picks up transmissions from legitimate cards, stores them and later replays them, impersonating the original card.

Examples of these devices include:

Or just purchase cards online

If you get someone’s card number and facility code, the low-tech approach is to simply order an identical card from the manufacturer.

Mitigations

In a perfect world, you’d use a different access control system. If you must use 125kHz prox cards, there are some things you can do to make life more difficult for attackers.

Security context

Does any of this matter?

It depends entirely on your situation.

What are your security requirements? Do you just need to keep out random passers-by? Do you have valuables or secrets? What is the impact of a successful attack?
Is it worthwhile to use a more expensive but secure system?
Can you add additional authenticators (biometrics or PINs/passwords)?
Can you mitigate the above threats through other means (cameras, security checkpoints)?

Most physical access control systems are subject to the following attacks:

Bash the door down
Smash a window and climb in
Steal an access card
Coerce someone with access into letting you in (bribes or ‘rubber hose attack’)
Tailgate in after someone

Like all security systems, you must weigh up the costs and risks for your own situation.

Future work

For me, the most interesting future work here is the ability to clone a person’s card from a distance. Physical attacks on the reader and wiring make your entry obvious and traceable. Trying to excite a person’s card while it’s on their body carries risk that you may later be identified as “that guy who stood too close to me”. RF capture and replay offers a low-risk option.

To this end, I intend to replicate the work on proxclone.org, focusing on long-range capture of card transmissions. I will document the work and provide costings to make it easier to evaluate the costs of an attack on these systems.

A quick guide to using MySQL in Python

ian@mutexlabs.com (Ian Howson) — Sun, 03 Jul 2011 00:00:00 +0000

Need to access some MySQL databases in Python right now? As in now, really, I don’t have time to read stuff, and please stop rambling because you’re wasting my time now? Read on!

Getting started

Access to MySQL databases is through the MySQLdb module. It’s available in the python-mysqldb package for Debian/Ubuntu users.

Your first step in any Python code is:

import MySQLdb

Python database access modules all have similar interfaces, described by the Python DB-API. Most database modules use the same interface, thus maintaining the illusion that you can substitute your database at any time without changing your code. I suspect that anyone doing this in reality has failed with hilarious consequences, but nonetheless…

Create the connection with:

db = MySQLdb.connect(host="localhost", port=3306, user="foo", passwd="bar", db="qoz")

substituting appropriate local values for each argument.

db is now a handle to the database. Normally, you’ll create a cursor on this handle like so:

cursor = db.cursor()

MySQL doesn’t really support cursors in any sense that’s useful to us here, but the DB-API requires that you interface to them that way. So just copy and paste the line into your code.

Queries

To execute queries:

cursor.execute("SELECT name, phone_number FROM coworkers WHERE name=%s AND clue > %s LIMIT 5", (name, clue_threshold))

String interpolation is a bit different here. You can still use Python’s built-in interpolation and write something like:

cursor.execute("SELECT name, phone_number FROM coworkers WHERE name='%s' AND clue > %d LIMIT 5" % (name, clue_threshold))

but the DB-API interpolation will automatically quote things and guard you from SQL injection attacks, to some extent. If you had a name value of "'; DELETE FROM coworkers;" in the first case, you’d be fine (as the single-quote character would be auto-quoted), but you might run into some slight data loss in the second case.

SQL queries are a good place to use Python’s multi-line strings, so you can write something like:

cursor.execute("""SELECT name, phone_number 
                  FROM coworkers 
                  WHERE name=%s 
                  AND clue > %s 
                  LIMIT 5""",
               (name, clue_threshold))

if you want to get fancy about it.

The DB-API quoting seems to work best when using %s quoting exclusively (even for numbers). I’m not exactly sure why.

cursor.execute() will return the number of rows modified or retrieved, just like in PHP.

When performing a SELECT query, each row is represented in Python by an array. For the above SELECT query with columns ‘name’ and ‘phone_number’, you’ll end up with something like:

['Bob', '9123 4567']

cursor.fetchall() will return you an array containing each row in your query results. That is, you get an array of arrays. So the above SELECT query might give you:

[['Bob', '9123 4567'], ['Janet', '8888 8888']]

The easiest thing to do with this is to iterate with something like:

data = cursor.fetchall()
for row in data:
    do stuff

You can also use cursor.fetchone() if you want to retrieve one row at a time. This is handy when you’re doing queries like "SELECT COUNT(*) ..." which only return a single row.

Cleanup

Finally, db.close() will close a database handle. I only mention this because some versions of MySQLdb don’t garbage collect correctly, so you can run out of database connections if you’re not careful.

My own experience has been that exceptions make it extremely difficult to clean up fully by hand; you always end up leaking a connection here or there. I get around this by manually invoking the Python garbage collector:

import gc 
gc.collect()

which will close off any old MySQL connections. You could do it just before creating a new connection.

Getting your results as a dictionary

The Python DB-API doesn’t have a mysql_fetch_assoc() function like PHP. mysql_fetch_assoc() would return an associative array/dictionary containing the results of a SELECT query, like so:

[name: 'Bob', phone_number: '9123 4567']

The nice thing about this is that you can write code like if row['name'] == 'blah':, instead of being dependent on the row ordering in the query.

I wrote this little function to do the same in Python. It’s MySQL-specific, which is why there’s no mysql_fetch_assoc() equivalent in the DB-API already:

def FetchOneAssoc(cursor):
    data = cursor.fetchone()
    if data == None:
        return None
    desc = cursor.description

    dict = {}

    for (name, value) in zip(desc, data):
        dict[name[0]] = value

    return dict

A few notes on the Lenovo X220

ian@mutexlabs.com (Ian Howson) — Mon, 13 Jun 2011 00:00:00 +0000

I ordered off an eBay seller in the US. Lenovo Australia doesn’t even list the X220 yet (and they charge almost double what eBay sellers do.) So far, I’ve ordered one laptop from Lenovo directly and two from eBay sellers. So far, eBay is much cheaper and a little faster, despite this one getting stuck in Customs for about a month.

It’s damn fast, and I can’t explain why. My T410 had a first-gen i5 and NVIDIA graphics. This has a second-gen i5 and Intel 3000, but once I stick in my LUKS password, it takes about a second to reach the login screen.

I’m running Ubuntu Natty. All of the hardware just works. Suspend doesn’t, despite what’s listed on the Ubuntu Wiki, but installing PPA kernel 2.6.39rc4 fixes things. VMware doesn’t work with this kernel, but the patch on this page fixes that.

I still get the occasional hard lock or failure to wake from suspend. The graphics driver seems to be the cause of most problems. There are occasional glitches like when opening the screen, occasionally you get random patterns (though the mouse pointer looks sane.) Switching to the console and back sometimes fixes it; closing the lid and opening sometimes fixes it; suspending and resuming sometimes fixes it, but just occasionally, I have to reboot. I’ve never had random patterns on the DisplayPort, only the internal screen.

Update, 04 April 2012: I compiled a 3.0.22 kernel for Ubuntu Oneiric which has the RC6 fix. It’s been rock-solid and power consumption is 7-12W most of the time. Very happy. Ubuntu Precise should have the same fix, but I haven’t tested it yet.

The IPS screen (“HD Premium”) looks amazing, even better than my MacBook Pro thanks to the matte filter. There isn’t great mechanical isolation between the frame and the screen, so you get shimmering effects if you twist the screen or press the edges. The 16:9 ratio isn’t ideal, but it fits side-by-side 80 column terminal/gvim with Terminus 12, and that covers 90% of my usage.

The backlight’s LED PWM controller runs at a sometimes-visible frequency. Seriously, people, 500Hz plus. There’s no good reason for LEDs to visibly flicker, EVER. Protip: if you run them at a constant current, you’ll achieve even better efficiency, and that means free battery life.

The keyboard feels a bit better than the T410; less mushy. They must be changing the keyswitches or something between models, because it looks almost identical physically.

There’s no eSATA port on the machine, but it does work through the dock.

DisplayPort works through the dock, too, unlike the T410. DisplayPort works happily with the 2560x1440 monitor at work.

When you plug in a DisplayPort monitor, it shows up instantly (and potentially switches it on.) This is a massive improvement over mouse clicking through the NVIDIA control panel. xrandr --auto works, as it should.

The VGA output is reported to not work, but I had no issues. I finally have full-screen Flash videos. They didn’t work on NVIDIA. I don’t have 3D acceleration in VMware, apparently, but who cares?

The ThinkLight is brighter than before, but again, one must ask the question: who cares? You have a screen illuminating the keyboard or reading material or whatever. I suppose you could turn off the screen and use the ThinkLight while reading a book, but a $5 book light will achieve the same function and not run down your laptop battery. Remove it and add an ambient light sensor; they’re useful and save battery power.

I’m hitting the touchpad a bit with my palm. Will probably disable it.

The touchpad-with-integrated-buttons thing doesn’t really work. I mean, yeah, there’s clearly not enough room there for buttons. But the moment you touch the button like you’re going to press it, the pointer jitters all over the place. This makes it tough to actually click anything, which is sort of the point of having buttons. I recommend just using the upper buttons and ignoring the integrated buttons. Having the integrated buttons there is no worse than leaving them off, but this shouldn’t have made it into the product. The HP Mini puts the buttons on the sides of the touchpad, and that works pretty well; I think that Lenovo should do the same thing on the X220++.

One baffling bug that I’ve experienced is that my VMware virtual machines won’t start (‘Unable to change virtual machine power state: Cannot find a valid peer process to connect to’) if the machine is in the dock. I have to undock, start the VMs and plug it back in. Virtualbox is fine.

I had a lot of trouble getting the thing to boot. For a few weeks, I carried around a USB stick with the System Rescue CD on it, just in case I needed to reboot. (Combined with not working out the suspend problem for a while, I was just leaving the machine running in my backpack for extended periods.) The X220 uses the newer EFI firmware standard, and it appears that Lenovo’s implementation won’t legacy boot from a GPT-partitioned disk. No idea why – the T410 is perfectly happy with this arrangement. Once I worked that out, I converted back to MBR (gdisk makes this fairly safe), reinstalled GRUB, and things started working.

I did spend a lot of time trying to make it boot in EFI mode. I could get both GRUB2 and ELILO to start up, but the moment they tried to execute the kernel, nothing. ELILO would reboot and GRUB2 would just hang. Natty is not really set up for EFI booting. There’s no consistent mount point for the EFI System Partition, so kernel updates are likely to fail, and building a startup disk yields an EFI-style startup disk that doesn’t work, either.

Battery life is fantastic – with the 90W battery, 11 hours is quite achievable, more if you dim down the screen and turn off WiFi. Of course, with the 90W battery, it feels like you’re carrying just the battery and there happens to be a screen hanging off it. Following the suggestions in Powertop helps a lot. There’s a nasty bug in Chrome which ruins battery life – it increases power consumption from about 8W to 16W (i.e. you will achieve half of your battery life, or just running Chrome is doubling the machine’s power consumption.) Firefox doesn’t have this problem, but Firefox renders a lot of stuff strangely on Linux, so I guess I’ll live with it for now.

I had a bit of trouble finding out what the wireless card was from the eBay seller – he just said ‘wireless N’ when I asked (repeatedly.) It is a:

03:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8188CE 802.11b/g/n WiFi Adapter (rev 01)

It continues Realtek’s proud tradition of making really fucking awful network cards. When copying a large file, it reads less than 1MB/sec and jumps around a lot. Sending a file gets peaks of 3MB/sec then drops to near-zero for a while. This is noticeable during web browsing, where if you load a bunch of pages at once the whole lot will stop.

I swapped in the wireless card (Intel Ultimate-6300) from my T410 and things are much better – 5MB/sec down, no pauses. I realised that I don’t have the white MIMO antenna cable – I didn’t think to order a 3x3 WiFi antenna. I’m not sure why it’s an optional extra, especially as the WWAN antennae are always installed. I connected one of the WWAN antennae to the MIMO socket and it bumped throughput to about 7MB/sec, which I’m happy with. Lesson learned: don’t cheap out on the wireless card or antennae. I’m pretty sure that the eBay seller removed the original card and sold it separately – the same happened with my X61s, and I’m doing the same with the T410 when I sell it.

When I reassembled the machine after messing with the WiFi cards, I had trouble getting the right-hand edge of the palmrest to sit flat. It turns out that the antenna cables taped just underneath had shifted. There’s no hook or plastic to hold them in position; you just have to tape them in the right spot to line up with the channel in the palmrest.

Interestingly, the WLAN LED works with the Intel card, where it didn’t on the Realtek. It is a bit distracting when watching movies in the dark. It can be disabled with the led_mode parameter to the iwlagn module.

The 7mm high drive bay (instead of the usual 9mm) could be a problem for some. It doesn’t bother me so much as I use Intel SSDs. With the 80GB MicroSSD option, though, I can see the sense in installing a huge spinning disk and using the 80GB to boot from. Except that there aren’t really any huge 7mm drives. So an SSD is your best bet, both on capacity and performance grounds (the 300 and 600GB G3 models both cost less than my 160GB G2!) The main place where I expect it to bite me is if I have a drive failure and need to buy/install something fast; I can’t just buy a drive from any old computer store, whack it in and restore from backups. I need to have the drive that shipped with the machine.

To handle this, I installed the shipping drive in an eSATA drive box. I have my root partition (LUKS-encrypted, including /home) set up as a RAID1. When I get to work, I hot-add the external drive to the RAID. It background syncs; when it’s complete, I can (theoretically) remove the drive from the box and plug it directly into my laptop to replace the failed SSD. The external drive is a lot slower than the SSD (due to the drive itself, not the interface) so I use the --write-mostly parameter to mdadm. It still slows things down a little, but it’s rarely an issue.

File synchronisation algorithms

ian@mutexlabs.com (Ian Howson) — Wed, 18 Jun 2008 00:00:00 +0000

You have two filesystem trees, A and B. You want the files on both sides to be the same.

Cases that you need to handle:

File exists on A but not on B (and vice-versa)
File exists on both and is identical
File exists on both and is different

Right about this point in time, you’re in trouble. (That was fast!) Only one of those situations can be handled automatically, and that’s if the file is identical on both sides. You need a lot of user input to figure out what the directories should look like, and users tend to say “too hard!” Unison assumes that if a file is present on one side and not on the other, it has just been created. So it copies it across. Already we’re in dangerous territory because this is frequently not what you want to do.

If the file exists and is different, you have to ask the user how to merge them or which one to pick. Asking regular users how to merge files is a bad idea. (Asking developers how to merge files is usually a bad idea.)

Sigh.

This algorithm is not going to work very well. It doesn’t handle any common cases, makes a lot of mistakes in its assumptions, and asks users too much information (which will probably be wrong anyway). Anyone using this algorithm in their synchronization product (*cough* Microsoft *cough*) is going to have a lousy product.

(Don’t get me wrong. I like Office. I like many Microsoft games. I’m not anti-Microsoft at all. It’s just Sturgeon’s Law: 90% of everything is crap.)

Unfortunately, this case is unavoidable on the very first synchronization of a pair of trees. We have no history data -- even disconnected history data -- and so cannot make informed decisions about what’s new, deleted or changed. The files just are or they are not and we can’t say which of the two trees is correct.

The next refinement is to store history data when you look at the file trees. Every time you perform a synchronization you record some metadata for each file. You want to store the filename and the modification time. That way, when you do the next synchronization, you look at what changed between time X and time Y and apply those changes to the remote file tree, somewhat like generating a diff and then patching a tree. You do this twice -- once for each direction (A to B and B to A). You can get conflicts, of course.

Conceptually, this looks like:

Compare this with the first algorithm, which looks like this:

Note that if you have no history data, Algorithm 2 works exactly like Algorithm 1. Badly.

This all operates much like a version control system and has similar problems and implications. A VCS usually can’t detect renames of files or directories -- you have to explicitly tell the VCS what you’ve done. When you want to perform a synchronization you have to traverse the entire directory tree to find out what’s changed -- and this can be very time-consuming. The metadata has to be stored somewhere. Merges almost always require manual intervention and will often be unresolvable (either the user won’t know what to do and will just overwrite one side, or the file format won’t support lines-of-text style merging).

Also note the similar distinction between traditional client-server VCS (e.g. CVS, Perforce) and modern distributed VCS (Mercurial, git). Client-server VCS and propagates the nodes (or the actual files being worked on). Distributed VCS propagates the edges (or diffs). Algorithm 1 is looking purely at the file data and attempting to match it on both sides; algorithm 2 is looking at the changes between the ’sync points’ (or nodes) and propagating the changes.

The actions table for each file looks something like:

File A change	File B change	Action

Created (checksum P)	Created (checksum P)	Nothing
Created (checksum P)	Created (checksum Q)	Merge
Deleted	No change	Delete
Deleted	Deleted	Nothing
No change	No change	Nothing
Modified	No change	Use file A
Modified	Modified	Merge

(The actions for File A and File B can be interchanged -- I didn’t feel like writing out those cases twice.)

If you include the possibility of renames (and horror of horrors, renames with modifies) then you can get a whole lot more combinations and it gets really nasty. I must give kudos to SourceGear for Vault for this: it does handle all of those nasty cases, a headache which I can do without.

Detecting what’s happened between time X and time Y is similarly mechanical. For a given file:

Time X	Time Y	Change
Does not exist	Exists	Created
Exists	Does not exist	Deleted
Checksum P	Checksum P	Nothing
Checksum P	Checksum Q	Modified

Without having looked at the source code, I’d say this is the algorithm that Unison uses. I’d also guess that most ‘proper’ synchronization programs use this. It’s the simplest thing that works in most cases.

Note that you also need to be able to reliably detect a change in a file. The (almost) infallible way to do this is to hash the file. I say almost because hash collisions do happen -- they’re just extremely rare. ‘Extremely rare’ becomes a lot more common when you’re talking about a million files (32 bits of hash is not enough).

The other option is to look at the modification time of the file. Software can and does manipulate the modtime, however, and you might miss changes. Users might change the system time and confuse your sync program (if a change was made a long time ago). You might not be syncing to a device that has a real-time clock (some mobile phones, notably). You also have to sync the times between the two systems, but that’s not too hard.

Aaaanyway, the gist of it is:

Checksums: reliable, slow (you have to read the entire contents of every file)
Modification time: less reliable, much faster

Some filesystems such as JFFS2 keep a revision number on each block (roughly). If the revision number goes up, you can be assured that a write has happened regardless of what the modification time says. This is not a common feature, however, and probably not accessible to userspace programs anyway. There’s no easy solution here.

It still sucks. How to make it usable

Algorithm 2 (a.k.a. ‘what everyone is using’) has some shortcomings:

Detecting changes takes a long time
It won’t detect renames or directory moves
There are still some cases where you need to resolve conflicts and/or merge files

There are also some usability issues:

You need to manually initiate a sync. You can’t just pick up your laptop and go anytime.
Performance sucks. I may have mentioned that a dozen or so times.
There’s nothing to stop you modifying a file on both sides; you have to remember which is the most recent and remember to sync before working on the other machine.

Here’s how I’ll fix these problems.

Constantly monitor for changes

The existing tools require you to manually initiate a sync, at which point you’ll have a few minutes of disk grinding. I’d rather have the program running constantly and being notified of changes as they happen. The common case is that only a few files will change between syncs -- reading all of the files is inefficient.

What I want is an API that notifies me when files change (or are created or deleted). I think inotify will do this, perhaps FAM. I have no idea what to use on Windows or OSX yet. On a technical level, this is an unsolved problem.

There is a risk here that if files are modified while the application is not running (and hence not receiving notifications) the modifications could be lost.

The fallback option is to scan the file trees while the machine is idle. If you’re checksumming files to detect changes, this can happen during idle time as well.

I think idle time is a grossly underutilized resource right now -- we could be doing virus scanning, file indexing, backups and the like constantly instead of at intervals (3am cronjob) or while the user is trying to use the system (like most on-demand virus scanners).

Constantly synchronize changes

If you’re going to scan all of the time, you might as well copy files straight away rather than waiting until the user requests a sync. This will cut down the odds of a merge conflict somewhat, since the files are less likely to be modified simultaneously on both sides. This introduces the idea of a pair of machines being connected; while they are connected, their files are always synchronized. Since you’re probably modifying small amounts of data at a time, this will work reasonably well over a slow network connection.

Lock in-use files

Another way to prevent merge conflicts is to lock a file on machine A if it’s being written to on machine B. This prevents an application on machine A from modifying it at the same time.

Identify machines by a UUID rather than IP address

A common situation is to have a laptop and a desktop that you want synchronized together. You have the laptop at home and sync the files. You take the laptop to work, but because you’re on a different IP the sync program thinks it’s a different machine. If you give each machine a UUID or name, you can be (reasonably) sure of its identity and hence use the right indexes or file trees.

Checksum files or their metadata in order to detect renames

If you’ve got a checksum of each file (or just the modification time and size) and you detect a file deletion, you can look through any new files and see if they’re actually the same file. You can then infer that a file was moved or renamed rather than deleted and a new file created, saving time and bandwidth during the synchronization. It may be possible to optimize this further by looking at inode numbers or their equivalent on whatever filesystem is in use.

Later reflections

In my classic inability to actually focus on a single task for any length of time, I’ve been working on SyncDroid.

I’ve been attacking the tricky areas of data storage and what I refer to as the ‘datapath’ -- the chain of events that takes place between a change occuring on a computer and it propagating (across physical space and time) to another computer . I can partly explain why nobody has done this before: it’s really tricky.

Unison (and most other synchronizers) make some simplifying assumptions:

There is always a master computer and a slave computer
We only care about what is happening at this exact moment in time
We can synchronize the times on the two computers when the synchronization occurs
We can suck up as much CPU and IO time as we like while synchronization takes place

Unfortunately, none of these are true for SyncDroid. They have interesting consequences.

There is always a master and a slave

This makes configuration management really easy: you always look to the master computer. In network-connect hosts, the master is (by definition) contactable, so you can just tell it to update its configuration with any changes made on the slave end.

SyncDroid doesn’t have this luxury. In the case of USB-drive synchronization, the two computers cannot just tell each other about changes. So there’s an interesting sub-synchronization problem: in order to know what data we need to synchronize, we need to synchronize the configuration first.

We only care about a single moment in time

There’s really only one trap if you use this assumption: files might change between the time you detect a change and when you actually synchronize it. This is easy to solve if you take out an exclusive lock on the file-being-synchronized and ensure that it still looks like it did when you scanned it.

SyncDroid cares about lots of points in time. Because it syncs constantly, we have to be very careful about what state we think a file is in versus what state it actually is in. If you’re doing syncs to multiple partners, you have to keep track of all relevant metadata for all partners. If a partner goes away -- say the user loses the USB drive -- we shouldn’t waste time and resources tracking data that will never be used. And we can’t just rescan things constantly or lock files because that would hurt performance (or make it impossible for users to actually do work). I’m a user of this thing, too, and if it doesn’t perform acceptably, I won’t use it!

We can synchronize computer times easily

On a network-connected synchronizer, this is easy. You run some variation of the NTP protocol between the two hosts and calculate an offset so that you don’t disturb the user’s clock. You can then work out relative change timings and the best course of action.

Because this version of SyncDroid works over USB drives, it can’t synchronize times easily. I get around that with a ‘mountcount’ -- it’s just a number that is incremented every time the metadata on a drive is loaded. RAID arrays use the same idea to detect drives that were unplugged from an array and are now out-of-sync with the rest of the array. Each computer using a USB drive can then use the mountcount to determine relative change times without being dependent on the computer’s clock, which will probably be wrong.

The consequence of the mountcount is that multiple access to the metadata is strictly forbidden. This is reasonably easy to ensure and shouldn’t be visible to the user.

We can suck up as much CPU and IO time as we like

This is a big one, and it’s one of the major reasons I started this project. None of the current synchronizers are sensitive to the user. Perhaps I’m a dreamer, but I would like my files to be synchronized without taking a massive hit in PC performance (or battery life).

Unison (as well as most synchronizers) will do exactly what you tell them to. If you say ’scan for changes’, they will scan right now. If you say propagate changes, they will propagate right now. While they are working, the computer is struggling under massive IO load, and if you have large amounts of data (like I do) that could lead to several minutes where the disk is spinning and you can’t use the computer and you have to sync right now because your plane is leaving but it’s still running and argh I’m going to be late.

SyncDroid has a fairly involved set of priorities to determine under what circumstances it should scan and sync and bookkeep. For example, it has two scanner types: a notification scanner (which uses the OS to determine when files have changed) and a comprehensive scanner (in case SyncDroid wasn’t running and you changed a file). The notification scanner runs all of the time, but if you’re on battery or using the computer, it just remembers the changes in RAM and gets out of the way as quickly as possible. The comprehensive scanner only runs when the computer is connected to power and you’re not using it. In this way, you get the effect of non-stop change scanning without any perceptible difference to your computer’s responsiveness.

There is a big ‘but’ here, and it’s one of those annoying engineering tradeoffs: if you are not aggressive enough about scanning, you will miss changes (say, the user disconnects their laptop without warning). If you are too aggressive, you’ll slow down the computer. The trick is to find a set of tradeoffs that works well in most circumstances. In those cases that it doesn’t work, you can warn the user and give them an opportunity to fix the problem (by plugging the laptop back into the network for a minute, for example).

Data Storage

And then, there’s the hairy issue of where to put all of this data that we’re collecting. What we have is roughly a parallel filesystem to the one on the disk: for a file, we want to store some metadata. The best way to store this, from a design point of view, would be to store it in the filesystem itself, but this is impractical for a number of reasons (don’t want to change the user-visible view of their data, no filesystem support, differing semantics between systems, and so on).

So we have to create a filesystem within a filesystem. It’s another meta-problem like the sub-synchronization problem in configuration management. I considered doing this in the literal fashion -- creating an image on disk with a virtual ext2 filesystem. Instead of files, there would be structs of metadata that I had collected. Licensing issues were, well, issues here, and it would require me to maintain a fairly complicated data access layer. The big technical problem is that contemporary filesystem assume a constant-sized disk, while I wanted to be able to expand and shrink the image size dynamically.

My stopgap solution (while this is all stubbed out in my code) is to use a YAML file. I adore YAML. It is not a high-performance data access layer, however, and it was not designed as such. It’s just very easy to use.

Another option was a custom C data type -- or, phrased another way, ‘write my own filesystem’. Lots of effort. Transaction management is a big hairy problem that I don’t want to get into.

Finally, SQLite. I love SQLite -- it’s very easy to use and gives you very powerful query functionality. It handles on-disk consistency well and -- used sensibly -- can be very high-performance.

Many applications, sadly, do not use SQLite in a sensible fashion. (I’m looking at you, Meta-Tracker). Like any SQL database, you can do silly things to it that will absolutely destroy its performance characteristics. A classic in this situation is if you want a directory listing and your rows look like { filename | data }; the database needs to do a ’starts-with’ check on each row in the database because there’s no easy way to index efficiently by filename and retain simplistic tree-searching operations. This is Really Really Slow.

My current plan is to solve this by implementing a more traditional inode/parent structure within my database schema. I have the big advantage of knowing exactly which operations are necessary (read record by path+name, write record by id, create record by path+name, list children by path) and so can optimise specifically for them.

Getting Started with the BlueSMiRF Silver V2 Bluetooth Module

ian@mutexlabs.com (Ian Howson) — Fri, 09 May 2008 00:00:00 +0000

Quick! Make it do something!

The board has the pinout printed on it. Connect 3.3V to the VCC pin and GND to the GND pin. The red LED should turn on. Hooray!

At this point, you won’t be able to see the module over Bluetooth. It starts up with the Bluetooth interface disabled. You need to send it some commands to get it started.

I connected my module to my PC with a SparkFun RS232 Shifter board. You can then use a HyperTerminal (Windows) or minicom (Linux) to type commands directly to the module. Link the RTS/CTS pins on the module together while you’re at it. I connected all of this up with little IC test clips.

The terminal settings are 9600bps, 8N1, no flow control. Type ‘+++’ quickly to get to command mode on the module.

As a quick sanity check, type

ATI

The module should reply with:

1SPP - Ver: 1.2.5
OK

Send:

AT+BTSRV=1

and you should be able to pair with the module and send serial data through it.

Read on for more advanced usage.

AT Commands

The wonderful thing about standards is that there are so many to choose from.

In an interesting tip of the hat to history, the module uses AT commands for control, just like a serial modem. After you send an AT command to a modem, it will always end its reply with OK or ERROR. You can send an empty command (‘AT’) to confirm that the modem is responding as expected.

Every Bluetooth serial modem has a different command set. The BlueSMiRF Silver V2 uses the one for the Philips/NXP BGP203. NXP appears to dislike money and won’t give you programming info for their chips, even if you beg for it. To save us the hassle, SparkFun has dug up the programming info.

You can also get the BGB203 ‘datasheet’ from NXP, but it’s useless. Even after a few hours on the phone and offers to buy a lot of silicon, they wouldn’t give me anything better.

You want to read the User Guide backwards. It has:

the Table of Contents on the LAST PAGE
a tutorial in chapter 9, near the end
all of the parameters you need to get started at the back in ‘Default Configuration Parameters’
all of the AT commands in the middle
some useless stuff at the front, where it’s easy to find

The interesting commands are:

Get information on the firmware in the Bluetooth module:

ATI
1SPP - Ver: 1.2.5
OK

Get the Bluetooth display name:

AT+BTLNM
+BTLNM: "SparkFun-BT"
OK

Get the Bluetooth MAC address:

AT+BTBDA
+BTBDA: 031F08071729
OK

Get the UART parameters:

AT+BTURT
+BTURT: 9600, 8, 0, 1, 0
OK

Start the Bluetooth server on channel 1:

AT+BTSRV=1

Automating startup

Ultimately, you’re probably going to want to write a program to set all of this up automatically. The script that I use goes something like:

+++
AT&F
AT+BTLNM="somename"
AT+BTAUT=1, 0
AT+BTURT=115200, 8, 0, 1, 0
AT+BTSEC=0
AT+BTFLS
AT+BTSRV=1

with appropriate checks to make sure commands are actually executing properly.

This script:

resets to factory settings (so we know what state we’re in)
changes the Bluetooth display name to ‘somename’
allows automatic Bluetooth connections to the module
sets the module to 115200bps (at which point you will probably have to change the bit rate on your UART as well)
disables security so you don’t need a complex pairing process (naughty, but it makes prototyping a whole lot easier)
writes all of this to the Flash on the module
starts the Bluetooth server

Tidbits

Any Bluetooth activity (querying, scanning, etc) seems to block the AT command interface.

If you’re typing commands by hand, don’t hit backspace. The command will be rejected. I recommend typing commands into a text editor and copy/pasting them into the terminal window so that you don’t make mistakes.

The module doesn’t appear to be case-sensitive to the AT commands, so you can type in lowercase and probably eliminate some errors that way.

Kinesis Advantage keyboard and learning Dvorak

ian@mutexlabs.com (Ian Howson) — Sun, 27 Jan 2008 00:00:00 +0000

I bought a Kinesis Advantage keyboard with the intention of reducing my finger pain associated with typing. Obviously, I spend a good portion of every day typing, and my livelihood basically depends on my being able to continue typing.

I also decided to learn the Dvorak layout while I was learning the Kinesis keyboard. I tried learning Dvorak a few years back but gave it up because I was working with a lot of other people’s keyboards as well – it was too inconvenient to keep switching layouts.

On the Kinesis/Dvorak learning process

Muscle memory is a huge factor in switching to a different keyboard or layout. Even now, when I’m typing on my laptop I instinctively reach for the Enter key with my right thumb, because that’s where it is on the Kinesis.
The Kinesis makes a lot of bad habits difficult, whether by accident or by design. You can’t really rest your hands on the pads while typing because then you can’t reach all of the keys. You can’t twist your hands around to move them around the layout because the keys are aligned to suit your hands in the home position. When you move your hands around the layout, suddenly they’re un-aligned and awkward.
I started out using a lot of ‘mental CPU’ time to handle the conversion. In the beginning, it took all of my concentration just to hit the right keys -- I had to separate my thinking from my typing.
While learning Dvorak, I noticed an interesting progression; I started out pressing just single keys at a time. Gradually, I started combining strings of keys into single motions (something I call ‘chording’, which I’ll come back to). This is similar to how a child learns to read -- they recognize single letters, expand out to sounds and eventually can string together words.
I made rapid progress for about three weeks. The first few days were difficult. I only used the Kinesis+Dvorak for a couple of hours each day because it was very frustrating to learn. I’ve had 20 years with a nice solid brain-keyboard link via the keyboard, and suddenly it’s horribly slow and error-prone. After the first few days things settled down a bit and I could manage an entire day’s work on the Kinesis.
After the first three weeks, progress slowed. I was still improving, but more in the areas of chording and accuracy. Some keys still gave me consistent problems on the Dvorak layout, particularly G and P.
You need to sit a little higher than with a normal keyboard. This is a problem for me -- I already find standard office chairs too short (even the Aeron!). When time allows, I’ll be buying a nicer chair and getting the gas-lift swapped for a taller one.
Why the A and S keys -- two of the most common characters in English text -- are on my two weakest fingers, I’ll never know. The pinkies get more of a workout than usual because they’re handling all of the keys on the edge of the keyboard, too. Placing these two common letters on already weak and overutilized fingers is probably the biggest flaw in the Dvorak keyboard that I’ve found.
Dvorak is lovely for English text. It’s just a great feeling to feel the letters whiz by with so little effort. However, a lot of my typing load is not English text. It’s Linux terminal navigation, C and Python source code, all of which intentionally discard vowels in exchange for brevity. This makes Dvorak’s plan to alternate hands work very poorly -- you’re just not typing anything on the left hand.
Some common Unix commands are absolutely worst-case scenarios on Dvorak. Take the command ‘ls -l’, which I would type dozens of times per day. On Dvorak, L, S and hyphen are all on the right pinkie. If you’re using a standard keyboard, so is the Enter key. It’s really, really unpleasant to type ‘correctly’.
Naturally, position-based key bindings don’t work: Vim’s HJKL, WASD for games, Ctrl-Z/X/C/V for word processing.

I really tried to make Dvorak work. I gave it a month. At the end of the month, I switched back to QWERTY. It’s a fantastic layout for cranking out lots of English text, but that’s not my use case.

On the Kinesis keyboard

The up and down arrows are backwards compared with the Vim convention. That’s relatively easy to remap on the keyboard.
The macros are rather buggy. Actually, scratch that. The keyboard’s firmware is rubbish. I suspect that either the authors had never programmed on a microcontroller before or were not trained as programmers to begin with. I can crash the keyboard in two fairly common scenarios. That said, the crap firmware is not a reason to not buy the keyboard. It’s perfectly usable despite its flaws.
I opened it up to look inside. It’s pretty typical of small-scale electronics manufacturing: no surface-mount parts, revisions hot-glued to the case, off-the-shelf components. It’s not badly made -- it feels very solid for its weight -- but it does have a few rough edges which might surprise you if you’re expecting a mass-produced product.
It’s ripe for hacking. The main controller is an Atmel AT89S series microcontroller. The macro RAM is on standard serial EEPROMs. There’s even a socket for a second one (the Advantage Pro upgrade). Apparently the firmware can be changed over the PS2 or USB port, but Kinesis didn’t seem to willing to send it to me when I mentioned I wanted to modify it.

Where to from here?

I still want to improve the keyboard layout. The Kinesis makes typing less painful, but some of my pains appear to be linked to the QWERTY layout. And the feeling of effortlessly flying through English text with Dvorak was just amazing.

To come up with a better keyboard layout, I want to log my keystrokes for a month. Each keystroke will be tagged with the time and the active process. The process lets me figure out whether the keystroke intent was a letter or a position. I can also detect errors by tracking Backspace presses. With that information, I can determine exactly which keystrokes or combinations are the most common for me.

In addition, I want a ‘trainer’ – a program that will prompt me with an arbitrary series of keystrokes and time how long it takes me to hit them. This will give me information on how strong and fast my fingers are and if any of them are particularly error-prone. From that, I can generate a map of the keyboard, each key associated with a ‘performance’ score. Combining the two datasets, I can then come up with an ideal keymap for me, given my typical usage patterns and my own brain-keyboard performance data.

I’d also like to integrate information on common digraphs, but I’m not sure how best to use them. I’m not sure that Dvorak’s assertion that alternating hands is the best thing to do. A common case for me is the ‘chording’ I mentioned previously, where a single hand can hit a sequence of keys very rapidly. The timing is simpler – I arrange my hand correctly on the keys, then use the individual fingers to press them in sequence. Of course, this sounds like the sort of thing that might cause tendon damage. But it’s fast.

There’s more discussion on chording and performance here.

Migrating your data between todo list programs

ian@mutexlabs.com (Ian Howson) — Sun, 27 Jan 2008 00:00:00 +0000

There are lots of great options for todo-list tracking these days. Unfortunately, most don’t make it easy to export or import your data. Here’s a list of scripts that I’ve found to perform conversions. Please comment if you know of more!

From Things

There isn’t currently any way to export repeating tasks automatically via AppleScript.

The Things AppleScript guide

To OmniFocus

To The Hit List

My Python script to convert from Things to The Hit List

To plain text

https://github.com/thepoch/ExportThings

From The Hit List

To OmniFocus

http://forums.omnigroup.com/archive/index.php/t-13565.html

Introduction

ian@mutexlabs.com (Ian Howson) — Mon, 20 Oct 2003 00:00:00 +0000

Motivation

The use of cryptography is growing rapidly with the adoption of computer technology. The design of cryptographic ciphers is still not well understood; we cannot prove the security of an algorithm. Currently, the only way to be sure of the security of an algorithm is to study it for a long period of time and use the absence of attacks as evidence confirming its security.

All ciphers are vulnerable to an exhaustive key search attack. An attacker can try every single possible key to check its correctness. This is time consuming, but feasible for several widely deployed ciphers.

An obvious way to conduct an exhaustive key search attack is to write software that will check each key in turn. Current microprocessors have clock rates in the gigahertz range and can execute several instructions per clock cycle. They are also cheap, highly available and easy to program.

Another possibility is to use a Field Programmable Gate Array (FPGA) device to conduct an exhaustive key search. FPGAs provide the functionality of a custom chip without the high up-front cost and lead time. They have much lower clock rates than general-purpose CPUs, but can be designed to perform one task exceptionally well. Parallelism can also be exploited to increase the overall search rate.

We thus have several questions requiring investigation with regard to FPGA technology in cryptanalysis:

Which cryptanalytic tasks can FPGAs complete more quickly than CPUs?
What are the price/performance benefits of FPGAs over CPUs?
What other technologies are there that might allow us to complete these tasks faster or cheaper?

Why exhaustive key search?

Exhaustive key search is guaranteed to be a possible attack for any cipher, but not necessarily feasible. Most new ciphers that are being deployed have key lengths of 128 bits or greater. A cipher with such a key length cannot be feasibly attacked with current technology. Nevertheless, there are many reasons why conducting research into exhaustive key search attacks is worthwhile.

A lot of currently deployed encryption is vulnerable to key search attacks. The default encryption used by GSM mobile phones and 802.11b wireless networks uses a key which is short enough to facilitate exhaustive key search. The DES cipher was widely deployed in the banking industry (amongst others) and is vulnerable. Many websites using SSL encryption are also vulnerable.

Export restrictions in many areas prevent the use of ciphers with long key lengths. The United States has a history of restricting the export of strong cryptography, often using key length as a deciding factor. The Wassenaar agreement stipulates similar limitations and is enforced by 33 countries around the world, including Australia.

Small embedded devices may not be able to support ciphers with long key lengths. Cheap smart card devices containing encryption software are becoming more widespread. In order to meet cost or size constraints, many of these devices use very short key lengths or known weak ciphers. The encrypted data transmitted from many of these devices can be attacked using exhaustive key search.

Many other attacks employ an exhaustive key search. Many attacks work by reducing the key space to an amount which can be feasibly searched or by removing large sections of the key space that can be proven to not contain the target key. Time/memory tradeoff attacks usually require a large preprocessing step which resembles key search. Both of these attack types require a key search to be conducted as part of their operation.

Key search machines can be useful research tools. Research into other attacks may require a cipher to perform particular operations or to generate plaintext or ciphertext with certain characteristics. Exhaustive key search can be used to achieve this.

Weak encryption has been used extensively in the past. Significant amounts of information has been encrypted with ciphers that are vulnerable to exhaustive key search or other attacks. Encrypted data could be stored until the technology or techniques to reveal that data become available. Key search machines may still be able to reveal valuable information that was encrypted in the past. Similarly, future technology may be able to reveal even today’s strongly encrypted data.

Exhaustive key search is highly parallelisable. This makes it a valuable application with which to experiment with parallel computing techniques.

Approach

In order to determine the utility of FPGAs when conducting exhaustive key search attacks, we need to consider their potential price and performance benefits over other technologies such as ASICs and CPUs. Pricing data can be obtained from suppliers, while performance data can be gathered from implementations. Performing implementations should also provide useful insights into the issues involved with cipher and key search machine design.

CPU pricing can be obtained from suppliers and performance measured with benchmark software. ASIC price and performance estimates can be obtained from suppliers.

The optimal family and device within each technology can be determined by computing the price for a certain search rate. Comparing price/performance ratios between technologies for different ciphers will help to determine which technology is best under what conditions.

From these analyses, it should be possible to recognise situations where FPGAs can be beneficial in key search applications.

Thesis organisation

Chapter 2 describes all of the past work, theory and knowledge that will be needed to understand the remainder of the thesis. It also sets the context for the new developments made by this thesis. Chapter 3 describes the design and implementation work that was performed in order to gather meaningful data. It allows the data analysis to use real-world data. Chapter 4 analyses the gathered data to form conclusions on a wide variety of areas, and forms the bulk of this thesis. Chapter 5 summarises the conclusions and provides directions for future work.

FPGA price/performance tables

ian@mutexlabs.com (Ian Howson) — Mon, 20 Oct 2003 00:00:00 +0000

FPGA pricing is specified in USD and was obtained from Avnet [45] on October 15th, 2003. In all cases, the cheapest package available was used; this was usually also the smallest package.

Spartan 3 devices only started shipping recently, and pricing is still highly unstable. No price could be obtained for the XC3S200 device.

RC4 performance figures use the same relative performance ratios as RC5; both are RAM-based cipher implementations. Resource figures were inferred from [21].

		RC5			DES
Family	Speed	MHz	SU slices	SU RAM	MHz	SU slices	SU RAM
Virtex II Pro	-5	138	666	1	238	1774	0
Virtex II	-4	120	663	1	209	1774	0
Spartan 3	-4	156	657	1	280	1774	0
Spartan IIE	-6	103	700	2	149	1806	0
Virtex E	-6	103	700	2	149	1806	0

				DES		RC5		RC4
FPGA	Speed	Package	Price	Mk/s	$/Mk/s	Mk/s	$/Mk/s	Mk/s	$/Mk/s
XC2VP2	-5	FG256C	$62	0	–	0.59	$104.59	1.02	$60.63
XC2VP4	-5	FG256C	$113	238	$0.48	1.18	$96.26	2.37	$47.83
XC2VP7	-5	FG456C	$176	476	$0.37	2.06	$85.45	3.72	$47.28
XC2VP20	-5	FG676C	$299	1190	$0.25	4.12	$72.58	7.44	$40.16
XC2VP30	-5	FG676C	$508	1666	$0.31	5.88	$86.36	11.51	$44.17
XC2VP40	-5	FG676C	$790	2380	$0.33	8.53	$92.56	16.24	$48.63
XC2VP50	-5	FF1152C	$1477	3094	$0.48	10.30	$143.45	19.63	$75.27
XC2VP70	-5	FF1517C	$2256	4284	$0.53	14.71	$153.35	27.75	$81.31
XC2VP100	-5	FF1696C	$5579	5712	$0.98	19.42	$287.29	37.56	$148.54
XC2V40	-4	CS144C	$22	0	–	0.00	–	0.29	$76.04
XC2V80	-4	CS144C	$28	0	–	0.00	–	0.59	$48.35
XC2V250	-4	FG256C	$79	0	–	0.51	$155.09	1.76	$45.16
XC2V500	-4	FG256C	$134	209	$0.64	1.02	$131.12	2.34	$57.27
XC2V1000	-4	FG256C	$195	418	$0.47	1.79	$108.71	2.93	$66.47
XC2V1500	-4	FG676C	$301	836	$0.36	2.81	$107.09	3.52	$85.74
XC2V2000	-4	FG676C	$428	1254	$0.34	4.09	$104.52	4.10	$104.34
XC2V3000	-4	FG676C	$658	1672	$0.39	5.37	$122.42	7.03	$93.57
XC2V4000	-4	FF1152C	$1552	2508	$0.62	8.70	$178.42	8.79	$176.62
XC2V6000	-4	FF1152C	$2936	3971	$0.74	13.05	$224.99	10.55	$278.40
XC2V8000	-4	FF1152C	$7446	5434	$1.37	17.91	$415.73	12.30	$605.21
XC3S50	-4	VQ100	$9	0	–	0.33	$27.06	0.38	$23.45
XC3S200	-4	VQ100	$16	280	$0.06	0.67	$24.05	1.15	$13.89
XC3S400	-4	TQ144	$24	560	$0.04	1.66	$14.43	1.54	$15.63
XC3S1000	-4	FT256	$67	1120	$0.06	3.66	$18.31	2.30	$29.09
XC2S50E	-6	TQ144C	$12	0	–	0.22	$56.35	0.51	$24.50
XC2S100E	-6	TQ144C	$15	0	–	0.22	$69.12	0.63	$24.05
XC2S150E	-6	PQ208C	$21	0	–	0.44	$46.96	0.76	$27.23
XC2S200E	-6	PQ208C	$25	149	$0.17	0.66	$37.65	0.88	$28.07
XC2S300E	-6	PQ208C	$39	149	$0.26	0.88	$44.45	1.01	$38.66
XC2S400E	-6	FT256C	$61	298	$0.20	1.32	$46.29	2.53	$24.15
XC2S600E	-6	FG456C	$153	447	$0.34	1.98	$77.36	4.55	$33.64
XCV50E	-6	CS144C	$33	0	–	0.22	$150.76	1.01	$32.78
XCV100E	-6	CS144C	$49	0	–	0.22	$224.62	1.26	$39.07
XCV200E	-6	CS144C	$87	149	$0.58	0.66	$131.98	1.77	$49.19
XCV300E	-6	PQ240C	$144	149	$0.97	0.88	$164.04	2.02	$71.33
XCV400E	-6	PQ240C	$222	298	$0.75	1.32	$168.63	2.53	$87.99
XCV600E	-6	HQ240C	$376	447	$0.84	1.98	$190.23	4.55	$82.72
XCV1000E	-6	HQ240C	$938	894	$1.05	3.73	$251.32	6.06	$154.82
XCV1600E	-6	BG560C	$1522	1192	$1.28	4.83	$315.10	9.09	$167.46
XCV2000E	-6	BG560C	$2142	1490	$1.44	5.93	$361.19	10.10	$212.03
XCV2600E	-6	FG1156C	$4620	2086	$2.21	7.91	$584.35	11.62	$397.72
XCV3200E	-6	CG1156CES	$6155	2533	$2.43	10.10	$609.21	13.13	$468.69

References

[21] K. L. K.H. Tsoi and P. Leong, “A massively parallel RC4 key search engine,” in Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2002, pp. 13 21. [Online]. Available: http://www.cse.cuhk.edu.hk/~phwl/papers/vrvw_fccm02.pdf

[45] (2003, October) Avnet electronics marketing. [Online]. Available: http://em.avnet.com/

CPU benchmark results

ian@mutexlabs.com (Ian Howson) — Mon, 20 Oct 2003 00:00:00 +0000

Processor	Clock rate (MHz)	Core	Speed (Mkeys/sec)
Pentium IV	2533	SolNET (BrydDES)	10.3
Athlon 2500+ (Barton)	1833	d.net (Byte Bryd)	10.2
Athlon XP 1900+	1600	SolNET (BrydDES)	9.0
Pentium IV	1800	SolNET (BrydDES)	7.5
Pentium IV-M	1700	SolNET (BrydDES)	7.5
Duron	1000	SolNET (BrydDES)	5.4
Pentium III M	1000	SolNET (BrydDES)	4.3
Pentium MMX	200	d.net (MMX bitslice)	2.9
Pentium II	233	d.net (MMX bitslice)	2.6
Celeron-A	450	SolNET (BrydDES)	2.0
Pentium II	233	SolNET (BrydDES)	1.1
Pentium MMX	200	SolNET (BrydDES)	1.0

Processor	Clock rate (MHz)	Core	Speed (Mkeys/sec)
Athlon XP 2500+ (Barton)	1833	SS 2-pipe	6.0
Athlon XP 1900+	1600	SS 2-pipe	5.3
Pentium IV HT	3060	DG 3-pipe	4.3
Pentium IV	2533	DG 3-pipe	3.5
Duron	1000	SS 2-pipe	3.1
Pentium IV-M	1700	DG 3-pipe	2.4
PowerPC 740/750 G3	900	MH 1-pipe	2.3
Pentium III-M	1000	SES 2-pipe	2.1
Pentium III	533	SES 2-pipe	1.1
Celeron-A	450	SES 2-pipe	0.9
Pentium II	233	SES 2-pipe	0.5

CPU price/performance tables

ian@mutexlabs.com (Ian Howson) — Mon, 20 Oct 2003 00:00:00 +0000

				RC5		DES
Family	Rating	MHz	Price	Mk/s	$/Mk/s	Mk/s	$/Mk/s
Athlon XP	1900+	1600		5.3		9.0
Athlon XP	2000+	1667	$101	5.5	$18.27	9.4	$10.76
Athlon XP	2200+	1800	$108	6.0	$18.14	10.1	$10.68
Athlon XP	2400+	2000	$125	6.6	$18.80	11.3	$11.07
Athlon XP (Barton)	2500+	1833	$135	6.0	$22.58	10.3	$13.14
Athlon XP (Barton)	2600+	2083	$156	6.8	$22.93	11.7	$13.34
Athlon XP (Barton)	2700+	2167	$212	7.1	$29.86	12.2	$17.38
Athlon XP (Barton)	2800+	2086	$275	6.8	$40.34	11.7	$23.48
Athlon XP (Barton)	3000+	2167	$397	7.1	$56.01	12.2	$32.59
Athlon XP (Barton)	3200+	2250	$687	7.4	$93.32	12.7	$54.30
Duron		1000		3.1		5.4
Duron		1400	$51	4.3	$11.73	7.6	$6.73
Duron		1600	$58	5.0	$11.73	8.6	$6.73
Celeron		2000	$94	2.8	$33.44	8.1	$11.51
Celeron		2200	$102	3.1	$32.85	8.9	$11.38
Celeron		2400	$115	3.4	$33.96	9.8	$11.83
Celeron		2500	$123	3.5	$35.07	10.2	$12.07
Celeron		2600	$130	3.6	$36.11	10.6	$12.30
Pentium 4		1800				7.5
Pentium 4		2400	$248	3.3	$74.84	9.8	$25.43
Pentium 4		2533		3.5		10.3
Pentium 4		2667	$287	3.7	$77.95	10.8	$26.49
Pentium 4		2800	$391	3.9	$101.04	11.4	$34.33
Pentium 4		3060	$586	4.2	$138.68	12.4	$47.12
Pentium 4 HT		2400	$265	3.4	$78.44	9.8	$27.11
Pentium 4 HT		2667	$320	3.7	$85.38	10.8	$29.51
Pentium 4 HT		2800	$405	3.9	$102.82	11.4	$35.53
Pentium 4 HT		3060	$603	4.3	$140.17	12.4	$48.44
Pentium 4 HT		3200	$920	4.5	$204.59	13.0	$70.70

Benchmark results that were directly gathered are shown in bold type. Benchmark results that were obtained from the distributed.net database are shown in italics.

Key search machine 2 interface

ian@mutexlabs.com (Ian Howson) — Mon, 20 Oct 2003 00:00:00 +0000

Registers

The registers available to the programmer are:

[H]

Address	Name	Description	Read/write
0	BUFFER	See text	Read/write
1	CTEXT	Sets the ciphertext to use	Write only
2	PTEXT	Sets the plaintext to use	Write only
3	IV	Sets the initialisation vector to use	Write only

The BUFFER register has the following format:

Bits	Name	Description
0–3	VERSION	Protocol version (2)
4	DVALID	Set when the machine is ready for a new command
5	unused
6	RW	Specifies whether this command describes a read or write operation
7	unused
8–15	ADDR	Specifies the target address for the write or read
16–63	DATA	See text

VERSION and DVALID function identically to the original key search machine. In this version of the machine the BUFFER register indirectly controls the search bus. A write is performed by setting RW to 1 and specifying the data in the DATA register. A read is performed by writing a word with RW set to 0 and then polling the BUFFER register until DVALID goes high. The data will be contained in the space allocated to the DATA field.

A read or write through the BUFFER register always sets or retrieves the key in use by a search unit. The exact interpretation of the DATA field depends on the key generator in use. The intended purpose is for DATA to be interpreted as a block number during a write, and treated as the key number (least significant 32 bits) during a read.

When reading through the BUFFER register, 32 bits are used by the key value. The other 8 bits are used by the search unit to report status information:

Bits	Name	Description
0	KEYVALID	Set when the search unit has a key block to search through
1	RUNNING	Set when the search unit is searching its key block
2–7	unused
8–47	KEY	The least significant 32 bits of the key value

Operation

Software checks presence and version of board by reading BUFFER register
If VERSION is 0, program complains that FPGA has not been programmed
If VERSION is not 2, program complains that software version does not match or FPGA is incorrectly programmed
For each address where a search unit is believed to exist
1. Software performs a read-through-BUFFER operation on the appropriate address
2. If the key returned is 1, a search unit exists at that address
Software writes CTEXT, PTEXT and IV registers
For each search unit:
1. Software writes initial key into search unit with a write-through-BUFFER operation
Until correct key is located:
1. Software polls the RUNNING bit of each known search unit in turn.
2. If RUNNING on a search unit is 0:
  1. Record the key value as a potential key
  2. Write the block number to the search unit so that it continues searching from the same point

The key that is read from the key buffer is the value that was in the key generator at the time the search unit was halted, not the key that caused the search unit to halt. The software must be aware of the number of clock cycles required to process a single key and subtract that value from the retrieved value. This value is algorithm dependent.

Key search engine 1 interface

ian@mutexlabs.com (Ian Howson) — Mon, 20 Oct 2003 00:00:00 +0000

Description

The interface allows the following operations to be performed:

retrieve the status of the board
retrieve potential keys
set the ciphertext, plaintext and IV to be used by all search units
access and detect all search units controlled by the machine
obtain the status of a single search unit
set or retrieve the next key that will be processed by a search unit

Registers

The controller provides a number of registers which allow the computer to access the machine’s resources.

Address	Name	Description	Read/write
0	STATUS	Status register	Read only
1	NEXTKEY	Retrieves the next potential key from the buffer	Read only
2	CTEXT	Sets the known ciphertext	Write only
3	PTEXT	Sets the known plaintext	Write only
4	SUSEL	Select a search unit	Write only
5	SUKEY	Set or retrieve the current value of the key generator	Read/write
6	IV	Sets the initialisation vector	Write only

Only the least significant 3 bits are decoded. All transfers are 64 bits wide. This makes supporting key lengths greater than 64 bits difficult.

When reading the STATUS register, the least significant word contains the following bits:

Bit	Name	Description
0–3	VERSION	Described below
4	BUFFER_FULL	Set when the key buffer is full
5	DVALID	Set when the machine is ready for a new command
6	SU_PRESENT	Set when the currently selected search unit exists
7	SU_RUNNING	Set when the currently selected search unit is running
8	BUFFER_EMPTY	Set when the key buffer is empty

VERSION specifies the version of the communication protocol. For this iteration of the design, the version is “0001”. It is also used to detect whether the board is programmed and operating properly. As such, it should never be “0000”. This catches the case where the FPGA has not been correctly programmed.

DVALID is set when the machine is ready for a new command, and cleared when a command is currently executing. Results from a write command should not be read until DVALID is set.

When written to, the SUSEL register selects a search unit. Any subsequent commands that operate on a specific search unit operate on the search unit specified in SUSEL. Writing to SUSEL also updates the value of the SUKEY register and the SU_PRESENT and SU_RUNNING bits in the STATUS register. The SUSEL register must be repeatedly written to in order to keep this data up to date.

SU_RUNNING is set when the last selected search unit is running, and cleared when the search unit is halted. A search unit might be halted if it has found a key and is waiting to have the key read, or if no initial key has been set.

Operation

Software checks presence and version of board by reading STATUS register
If VERSION is 0, program complains that FPGA has not been programmed
If VERSION is not 1, program complains that software version does not match or FPGA is incorrectly programmed
For each address where a search unit is believed to exist
1. Software writes the address to SUSEL
2. Software polls STATUS until DVALID goes high
3. If SU_PRESENT is 1, the search unit exists and can be used; if 0, search unit does not exist
Software writes CTEXT, PTEXT and IV registers
For each search unit:
1. Software waits for DVALID flag to go high
2. Software selects a search unit (sets SUSEL register)
3. Software writes initial key into search unit (writes SUKEY register)
Until correct key is located:
1. Software polls STATUS register to determine if any potential keys have been located
2. If there is a pending key, read it out of the buffer

The key that is written into the key buffer is always the key that was in the key generator at the time the search unit was halted, not the key that caused the search unit to halt. The software must be aware of the number of clock cycles required to process a single key, and subtract that value from the value stored in the key buffer. This value is algorithm dependent.

Design

ian@mutexlabs.com (Ian Howson) — Mon, 20 Oct 2003 00:00:00 +0000

Conceptual design

Most key search machines are designed around similar ideas. A controller operates a number of independent search units. This controller usually interfaces with a general purpose computer. Each search unit contains a key generator, decryptor (or encryptor) and comparator. The key generator produces trial keys that need to be checked. Some designs combine the key generator and decryptor modules to improve performance. The decryptor decrypts the known ciphertext with the trial key. An encryptor can also be used with some limitations. The comparator checks the plaintext that is generated by the decryptor to see if it is correct. If it is, the controller is signalled.

Due to its complexity, the cipher module is usually considered to be the bottleneck in the system. All other modules must be able to operate at least as quickly.

Conceptual key search machine design

Key generator

A counter is an obvious choice for a key generator. The count sequence is predictable, but performance is not adequate on all devices. Xilinx FPGAs provide a dedicated carry chain which improves performance significantly.

The EFF DES cracker [10] uses a counter where the 24 most significant bits are held constant and the 32 least significant bits counted. This technique is useful to reduce a counter’s propagation delay. The most significant bits must be counted and loaded externally. This scheme introduces the idea of a “block” of keys – a subset of the key space which can be searched in a short period of time. The 24 constant bits can be viewed as the block number. There must be a mechanism with which the controller can detect the end of block condition and start the key generator on a new block.

One design [12] uses a single counter that is shared between all of the available search units. Each unit adds or concatenates a unique ID to the counter value to obtain its trial key. This scheme works well when the number of search units is a power of two since the ID can simply be concatenated, saving resources.

A number of designs [16], [19], [15] use Linear Feedback Shift Registers (LFSRs) to generate trial keys. The main advantage of an LFSR over a counter is its high speed; propagation delays remain constant regardless of the length of the LFSR. One disadvantage of LFSRs is that their count sequence is nonlinear. Evenly breaking up a large key space between search units requires more effort than with a linear counter. One simple scheme is to use a shorter LFSR than usual and set the remainder of the key bits to a constant value. This works similarly to the block scheme for linear counters described above; the LFSR can be 32 bits long, and the remaining 24 bits set by the controller.

Cipher module

Encryptor vs. decryptor

When performing a known plaintext attack, the choice of encryptor or decryptor is dependent on which has higher performance. Most ciphers (including all stream ciphers) have identical performance regardless of their mode. Some may have a more efficient implementation when implemented in one way or another. RC5 [6] is an example of this. An RC5 encryptor can operate more efficiently than a decryptor because the order that the S array is used in during key setup matches that used in the encryption stage, allowing the phases to overlap.

Most key search machines will use a decryptor. Ciphertext-only attacks require a decryptor. Known-plaintext attacks will also require a decryptor under some conditions. This is to allow a more flexible comparator scheme that can detect correct plaintext regardless of imperfect knowledge.

Iterated vs. pipelined

Two major approaches are used when implementing the cipher module; a long pipeline or a small iterative module. Most ciphers are comprised of a number of round functions, making an iterative implementation natural. Pipelined approaches can achieve much greater speeds at the expense of FPGA resources. DES is frequently implemented as either a small iterated module or a long pipeline. The iterated version takes a multiple of 16 cycles (one for each application of the round function) to produce one block of output, while the pipelined version can produce one unit of output every clock cycle. The resource gains made by using an iterated cipher implementation almost never outweigh the loss of speed. Resource constraints may force a cipher to be implemented in iterative form.

An FPGA-Based Performance Evaluation of the AES Block Cipher Candidate Algorithm Finalists [32] explores these issues in depth. It presents FPGA performance figures for the MARS, RC6, Rijndael, Serpent and Twofish ciphers. Loop unrolling, pipelining and sub-pipelining are investigated as architectural choices. In most cases, a pipelined implementation was fastest. Fast DES Implementation for FPGAs and Its Application to a Universal Key-Search Machine [18] explores pipelined, combinatorial and iterative approaches for the DES cipher.

Partially evaluated circuits

Some ciphers may benefit from having values precomputed during compilation time. This is usually used to achieve higher performance in systems that have very infrequent key changes. FPGAs fit well with this approach, allowing the programmed-in key to be changed with only a small period of downtime. The speed efficiency of an FPGA DES implementation was improved significantly using this technique [33]. The utility of this technique in a key search machine is dependent on the cipher. The plaintext or ciphertext would be compiled into the design instead of the key. This may yield improvements for some ciphers, but exploratory experiments only showed very small resource savings.

Comparator

The environment that the key search machine operates in determines the choice of comparator. If a perfect ciphertext/plaintext pair is known, simply checking for bit equality will be adequate. Ignoring certain bits in the trial plaintext may be a useful extension when only a portion of the sought plaintext is known.

If a ciphertext-only attack will be attempted or the plaintext is not precisely known, it may be necessary to implement a heuristic matching scheme. Such a scheme will generally flag a number of keys as potential matches and allow humans or software to check them further for correctness.

A simple scheme to detect ASCII text is to require that the most significant bit of each plaintext byte be 0. This can be further generalised into a statistical approach that scores each plaintext byte in the plaintext according to its probability of occurrence. A Programmable Plaintext Recognizer [34] uses similar ideas to extend Wiener’s theoretical key search machine [15]. Applying compression to a message before encrypting it causes their heuristics to fail. This is an effective countermeasure against any statistical comparator, since the compression makes the message “look like” random data.

Some applications may also benefit from a specialist comparator. A machine designed to solve the Blaze Challenge [23] would need a specialist comparator that will find a match on any block that fits the form of the solution (in this case, when the plaintext is composed only of a single repeated byte.)

Returning matches

At some point in a key search machine’s operation it will be necessary to return potential keys to the host computer. Several schemes have been used to achieve this goal.

Most key search machines simply stop running when a match is found and wait for the computer to read out the key value. This is simple and flexible, but inefficient when many keys need to be returned – while a search unit is waiting to release the key it halted on, it cannot be used to search the key space.

A hardware buffer can be used to reduce the waiting time. When a key needs to be returned it is read into the hardware buffer, and the controller can read the keys out. This has the advantage of improved efficiency, but costs hardware resources.

One novel approach is to measure the amount of time needed to find the key. Using knowledge of how quickly the key space can be searched, an approximate trial key can be found. A number of keys need to be checked to account for timer inaccuracies. This method removes the need for key storage and retrieval hardware.

A generic FPGA key search machine

The programming interface for this design is supplied in Key search engine 1 interface.

Goals

To produce a FPGA-based key search machine which can operate independently of cipher algorithm. It should communicate with a computer for instructions and data. It should be reasonably scalable for large key cracks, and be easily modifiable for ciphertext-only attacks. It should allow rapid prototyping of key search machines for different ciphers.

Top-level design

Initial key search machine top level design

The bus provided by the Pilchard interface runs synchronously at 100 or 133MHz. The remainder of the key search machine must operate at this speed.

The top-level Status register provides general status information for the entire key search machine.

The key buffer stores potentially correct keys for the computer to read out and check further. This prevents search units from being paused for very long when a potential key is located. It is particularly useful for ciphertext-only attacks, where there may be a large number of potentially correct keys. 256 keys can be stored; this figure fully utilises the four Block SelectRAM units that are needed to store a 64 bit word.

The controller operates the search bus. It relays commands from the computer to individual search units, stores the ciphertext and plaintext registers, and polls each search unit on the bus to see if there are any keys waiting. It uses a simple state machine. While there are no commands waiting to execute, it polls search units to see if there are any keys waiting. If a key is found, it is read into the key buffer. If a command arrives, it temporarily stops polling and executes the command.

Search unit design

Initial key search machine search unit design

Each search unit has its own status register which the controller uses to determine if a key has been located. The key generator provides a trial key to the decryptor, which uses the key and the supplied ciphertext to produce trial plaintext. This trial plaintext is compared with the known plaintext or has a set of heuristics applied to determine if it appears to be valid. If it is, the search unit is halted until it is instructed to restart by the controller.

Another generic FPGA key search machine

Motivation

Several problems were identified with the original key search machine that justified the design of a new one.

It was overly complicated. Modules within the design such as the key buffer provided very low returns on their cost. Removing the key buffer allowed the controller to be simplified, since it no longer needed to poll search units for keys. Other simplifications were made within the search units.
Timing closure was never achieved. The original design was implemented before a proper understanding of high-speed digital logic had been attained. The fact that it actually worked can be put down to luck and favourable operating conditions.
Minor bugs remained that complicated the software design.
The programming interface was more sophisticated than it needed to be, which complicated the software further.
The clock speed for search units was locked to that of the memory bus. This turned out to be a larger handicap than was originally predicted. The DES search unit could run at almost 180MHz according to the synthesis tools, but was still locked to the 100MHz of the memory bus. This could be increased to 133MHz by adjusting the motherboard jumpers, but this was not a satisfactory solution. Better support for slow ciphers was also needed.
It could not easily support key lengths over 64 bits.

The new design and its driver software took approximately four days to implement and debug. The programming interface is described in Key search machine 2 interfaces.

Design

Revised key search machine design

There are two controllers in this design; a master and a slave. The master handles all communication with the host computer and links to the slave with an asynchronous bus modelled closely on VME. The slave controller’s only purpose is to link the asynchronous bus and the search bus, which can run synchronously at any speed. This allows search units to run at any speed, simplifying cipher implementation.

Search unit design

The search unit is designed to use a block system for key allocation. Only the block number is transmitted to the search unit. When retrieving the key from the search unit, the least significant 32 bits of the key are returned. The software is expected to track which search unit is searching which block.

Search bus

The search bus runs at the same speed as the search units. The two clock domains (SDRAM clock and search unit clock) are linked with an asynchronous bus using similar protocols to VME.

High speed search units were later identified as a problem; it was found to be difficult to route a wide high speed bus over the entire FPGA and still meet timing constraints. A potential improvement to the machine would be to decouple the clock rate of the search bus from that of the search units or make the bus completely asynchronous. The latter was the original intent of the asynchronous bus, but the resources required to implement it made it unwieldy to use on every search unit. It remains the best solution for a large-scale machine (at least between FPGA devices).

Modules

Counters

Two linear counters were implemented. The first was a simple 64 bit counter. The second was designed to work with block schemes and added functionality to allow counting to be inhibited for ciphers that do not need a new key every clock cycle. It counts through 32 bits of range and has a further 48 bits of range that is set externally.

Bit equality comparator

A comparator that checks for exact bit equality was implemented. It was 64 bits wide. It flags a match when its two inputs are identical. To ensure that it runs quickly enough with high clock speeds, it was implemented as a short pipeline. On the first clock cycle, four 16 bit segments of the trial plaintext are compared individually. On the second cycle if the result of these four comparisons is true, a match is flagged.

Statistical comparator

A simple statistical comparator was implemented using some of the ideas within [34]. Its purpose is to use the probabilities of different bytes within the produced plaintext to determine if the plaintext “looks right”. The definition of “looks right” varies depending on the attack scenario; English text would have different statistical properties to an executable file, for example.

The algorithm used is fairly simple. The comparator takes a 64 bit input and splits it into 8 bit bytes. Each byte value has an assigned “score” – higher scores correspond with more frequently occurring byte values. The scores for each byte are added and compared against a threshold value. If the threshold is exceeded, a match is flagged.

Implementation

Implementation of the algorithm was more challenging, but still straightforward. The main design constraint was that the comparator be no slower than any decryption module – in this case, the DES module running at over 149MHz and producing one word of plaintext per cycle. In order to meet this timing requirement, steps of the algorithm were split up as much as possible.

The figure below shows the steps performed by the implementation. Four RAM blocks were used to store the byte value scores (8 bits each). Each RAM block has two ports, allowing a total of eight memory lookups every cycle. The scores are added in parallel in pairs to minimise delays on each cycle. Finally, the total score is compared with the threshold (which is set by the plaintext value). Splitting up the steps in this way produces a deep pipeline, but allows very high clock rates.

Statistical comparator design

The threshold comparison stage is the main timing bottleneck. Speed improvements can be made by reducing the comparison resolution. By only comparing the most significant four bits, synthesis reports a maximum speed of 181MHz. The required resolution depends on the statistical properties of the text being attacked.

A small C program was written to generate character scores from files. It counts the frequency of each character in the file. The scores are then normalised down to 8 bits and output in a format suitable for entry directly into the VHDL RAM initialisation code. This data could also be used to modify the bitstream after compilation if desired. This program allows the comparator to be “trained” on similar input data to what is expected.

DES cipher module

The DES implementation is a modified copy of the DES demonstration provided with the Pilchard board, which is itself a modified version of the Xilinx optimised DES implementation in [35]. The order of the round functions was reversed to convert the encryptor into a decryptor.

Registers were added to the key schedule logic, but later removed when the efficient keying system described in [19] was implemented. This scheme integrated an LFSR key generator with the DES key schedule logic. A 72 bit LFSR with taps suitable for a 56 bit LFSR was used. As the previous keys were shifted through the LFSR they remain available to the key schedule logic, which can generate the necessary subkeys with rotations. This saved approximately 500 slices that were previously used for subkey registers. Subkey generation was essentially free, although fanout on the LFSR bits did reduce the speed slightly.

The possibility of attacking a key and its complement simultaneously was considered. This halves the search space, but not the search time. The decryption portion of DES comprised the bulk of the area requirement in a hardware implementation, and this improvement only saves key schedule logic. After implementing the LFSR keying scheme above, performance improvements would be negligible.

Using the XCV1000E, a key search machine containing controllers and five search units was operated at 100MHz, giving a total search rate of 500Mkeys/sec.

A5/1 cipher module

The A5/1 implementation was produced from scratch using the algorithm description given in [7]. It aims to find the initial key state rather than the key itself. Time constraints did not allow the more efficient stream cipher attack in Stream Ciphers to be implemented, and so no further work was performed using this module.

The (already small) resources needed to implement the A5/1 module could be further reduced by configuring the Xilinx LUTs as shift registers [36]. This would complicate key loading; the entire key state could no longer be loaded in a single cycle.

RC5 cipher module

Introduction

The RC5 implementation was produced completely from scratch using the algorithm description given by Rivest [6]. It implemented RC5-32/12/9. It was intended to be used to complete the RSA Secret-Key Challenge contests [29]. The possibility of connecting the complete key search machine to distributed.net [26] was considered as an extension.

Few prior works in this area could be located. [37] claims to have schematics for a functional RC5 implementation on Xilinx FPGAs, but they are no longer available. The author was not able to be contacted. [37] contains a Verilog model which was not found to be useful.

Pipelined design

A fully pipelined design similar to that used for DES was investigated. This possibility was considered to be impractical due to the large number of registers needed for the S array.

After implementing the iterative version, the possibility of implementing a pipelined version was considered again. This time, the number of LUTs required was identified as being excessive. A prototype implementation determined that each stage of the key mixing phase would require 256 LUTs, and each half-round of the decryption phase would require 192 LUTs. Given 78 mixing steps and 24 decryption half-rounds, the number of LUTs required is 24576 – coincidentally, the exact number of LUTs available on the Virtex 1000E. Many more would be required for state decoding, communication, key generation, comparisons, routing overhead and so on. This possibility was not investigated further, but would almost certainly be feasible given more hardware resources to work with. Such an implementation would be able to provide very high search rates on sufficiently large FPGA devices.

Iterative design

An iterative design for the RC5 implementation was used. Block SelectRAM memories within the FPGA were used to store the S array. The number of RAM blocks was anticipated to be the limiting factor, similar to the RC4 key search engine described in [21]. The L array was stored in three rotating registers; this eased timing constraints and prevented reads and writes to the RAM becoming a bottleneck.

The key mixing phase of RC5 took the bulk of the time needed to check a key. It required 78 iterations, each of which consists of a read and a write to the S and L arrays. To minimise the time required per cycle, the key mixing stage of the algorithm was set up to operate continuously on two separate regions of RAM. The initialisation and decryption stages were arranged to work on the opposite region of RAM. When a key mix phase completes, the decryption and initialisation phases begin on that region of RAM. In this way, the average time required to check a key would effectively be the time required to perform the key mixing phase.

RC5 RAM timing

The key mix phase needs to be completed as quickly as possible. The decryption and initialisation phases are not timing critical, and can be completed more slowly in order to save FPGA resources. The decryption module takes advantage of this by performing twice as many rounds and interchanging the A and B registers at the end of each round. In this way the subtract, shift, XOR and RAM lookup resources can be reused. The initialisation module actually performs the additions required to initialise the S array, even though these results could be trivially precomputed. This saves FPGA resources.

The general goal for the key mix operation is to complete as quickly as possible. The general goal for the decryption and initialisation operations is to use as few resources as possible, so long as the time taken for these two operations does not exceed that needed by the key mix operation.

Implementation

One problematic area in the implementation was the 32 bit barrel shifter required by RC5. The initial naïve implementation required 352 slices; with the help of [39] this was improved to 80 slices. One shifter is required for each of the key mix stage and the decryption stages. These account for a significant amount of the resource usage. Some research and experimentation was conducted to find smaller or faster shifter designs, without success. Shrinking or speeding up the barrel shifters would provide large benefits to the overall performance of the design.

Running the module at 100MHz proved difficult. Routing delays introduced after the place and route stage were the cause of the problem; congestion was present at one of the RAM blocks. The delay at this point increased when the number of search units was increased, suggesting that floorplanning may be useful to reduce the delay or at least make it consistent. A brief unsuccessful attempt at floorplanning was made.

To solve this problem, two approaches were used. Originally, two RAM blocks were used to provide a 32 bit wide RAM. One port was used by the key schedule module and the other by the decryptor and initialisation module. The number of RAM blocks was doubled and writes made to both pairs. Reads could be made from either pair of RAM blocks, allowing unrelated logic to be moved to different areas of the FPGA by the place and route tools. This helped to reduce delays. The RAM blocks were not being otherwise used. Adding a wait state after RAM access allowed the module to meet its timing requirements at the cost of reduced performance.

The total time required to check an RC5 key is 469 clock cycles. Each iteration needs 6 clock cycles, and 78 iterations are required. One cycle is needed for initialisation. At the target clock speed of 100MHz, this gives a search rate of 213,220 keys/sec. 16 search units could be fit into an XCV1000E device, giving an aggregate search rate of 3.4Mkeys/sec.

The RC5 cipher module consumed 595 slices. The implementation in [37] required 510 XC4000 CLBs; each XC4000 CLB [40] is roughly equivalent to a Virtex slice.

Optimisations

The possibility of increasing the clock speed of the RC5 module was investigated, but found to be counterproductive. The intent was to balance the time spent in each pipeline stage better, hopefully overcoming the increase in resource usage and number of stages required. Registers were inserted at locations responsible for timing limitations. These registers did not increase resource usage significantly due to the structure of the Virtex slice [41]. The number of cycles per round increased from 5 to 8 and the synthesis clock speed from 102MHz to 142MHz, which was not an effective tradeoff. Many previously trivial operations such as the comparison needed to be split into stages instead of being simple combinatorial operations, which greatly increased the complexity of the source code. The overall resource usage also increased.

Replacing each bit in the three registers used to implement the L array with a short LUT shift register would reduce the resources allocated and potentially ease routing.

Some work was conducted to see if it was possible to take shortcuts in the key mixing operation; this was unsuccessful.

Including the ciphertext and IV at synthesis time reduced resource usage for the search unit to 539 slices. This would be a worthwhile approach for an attack where the ciphertext and IV are known in advance. It would generally not be suitable for an ASIC implementation.

This module was implemented before the second key search machine. Performance could be improved by running at a lower clock speed with fewer pipeline stages.

Software benchmarks

Methodology

Benchmarks were conducted on a number of different CPUs to measure how quickly they could perform key searches. Setting up and running the benchmarks was very rapid, so many different CPUs were tested to determine if any would provide significant price/performance advantages.

Pre-written benchmarks were used. These benchmarks were faster and more thoroughly tested than what could otherwise be produced in the available time.

Results

Each benchmark was run at least three times until consistent results were achieved. Linux benchmarks were run as the root user, prefixing the benchmark command with nice -20 to ensure that the benchmark ran with the highest priority.

Tables containing the gathered results are given in CPU benchmark results. distributed.net maintains an online database [42] of search rates for each CPU, allowing some of the benchmark results to be verified.

DES

Two benchmark programs were used: the distributed.net client version 19991117 (which had to be compiled from source), and the SolNET DES client [43]. The distributed.net client gave far better benchmark results, but could only be run on Linux machines with appropriate compiler versions. Neither DES client had been optimised for modern CPUs.

The command dnetc -benchmark des was used to run the distributed.net benchmarks, and desclient-x86-linux -m for the SolNET benchmarks. The SolNET client’s benchmark results were unstable on faster CPUs, requiring them to be run a large number of times.

distributed.net maintains an online database of search rates for each CPU [42]. The DES benchmarks for newer CPUs could not be verified because the CPUs did not exist at the time that the online benchmarks were gathered. The results for older CPUs were far higher than those in the online database.

Benchmarks for Celeron, P4HT and Athlon XP (Barton) CPUs had to be inferred from others based on the same core. The Mkeys/sec/MHz ratios obtained for RC5 remained fairly constant under this assumption, and this is assumed to remain true for DES.

RC5-72

The distributed.net client version 03033120 was used to conduct RC5-72 benchmarks. Binaries from the distributed.net website were downloaded for the relevant platform, unpacked, and the benchmark executed from the command line with dnetc -benchmark rc5-72.

The RC5 benchmark results were verified against those in the distributed.net database. Confusion is apparent with the Athlon speed ratings; it is not obvious whether an entry marked “1900” refers to a 1900+ or a 1900MHz Athlon. Nevertheless, the RC5 benchmark results gathered were found to mesh well with those in the database.

No Celeron machines based on the Pentium IV core were available to run benchmarks on, so the online benchmark results were used for analysis. These appeared internally consistent, so a Mkeys/sec/MHz rating was determined and averaged across the available benchmark results to reduce error. This rating was used to infer the missing benchmark results.

References

[6] R. L. Rivest, “The RC5 encryption algorithm,” in Practical Cryptography for Data Internetworks, W. Stallings, Ed. IEEE Computer Society Press, 1996.

[7] J. Keller and B. Seitz, “A hardware-based attack on the A5/1 stream cipher,” in APC 2001. VDE Verlag, 2001, pp. 155 158. [Online]. Available: http://www.informatik.fernuni-hagen.de/ti2/papers/apc2001-nal.pdf

[10] Electronic Frontier Foundation, Cracking DES. O’Reilly, 1998.

[12] P. Leong, M. Leong, O. Cheung, T. Tung, C. Kwok, M. Wong, and K. Lee, “Pilchard - a reconfigurable computing platform with memory slot interface,” in Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), April 2001. [Online]. Available: http://www.cse.cuhk.edu.hk/~phwl/papers/pilchard_fccm01.pdf

[15] M. J. Wiener, “Efficient DES key search,” in Practical Cryptography for Data Internetworks, W. Stallings, Ed. IEEE Computer Society Press, 1996, pp. 31 79.

[16] I. Goldberg and D. Wagner, “Architectural considerations for cryptanalytic hardware,” CS252 Report, 1996. [Online]. Available: http://www.cs.berkeley.edu/~iang/isaac/hardware/paper.ps

[18] J.-P. Kaps and C. Paar, “Fast DES implementation for FPGAs and its application to a universal key-search machine,” in Selected Areas in Cryptography, 1998, pp. 234 247.

[19] I. Hamer and P. Chow, “DES cracking on the Transmogrifier 2a,” in Lecture Notes in Computer Science, ser. Cryptographic Hardware and Embedded Systems. Springer-Verlag, 1999, no. 1717, pp. 13 24. [Online]. Available: http://www.eecg.toronto.edu/~pc/research/publications/des.ches99.ps.gz

[23] M. Blaze. (1997, June) A better DES challenge. [Online]. Available: http://www.privacy.nb.ca/cryptography/archives/cryptography/html/1997-0%6/0127.html

[26] (2003, October) distributed.net: Node Zero. [Online]. Available: http://www.distributed.net/

[29] The RSA Laboratories Secret-Key Challenge. RSA Security. [Online]. Available: http://www.rsasecurity.com/rsalabs/challenges/secretkey/index.html

[32] A. Elbirt, W. Yip, B. Chetwynd, and C. Paar, “An FPGA-based performance evaluation of the AES block cipher candidate algorithm finalists,* in IEEE Transactions on VLSI Systems, ser. IEEE Transactions on VLSI Systems, August 2001, vol. 9, no. 4.

[33] J. Leonard and W. H. Mangione-Smith, “A case study of partially evaluated hardware circuits: Key-specific DES,” in Field-Programmable Logic and Applications. 7th International Workshop, W. Luk, P. Y. K. Cheung, and M. Glesner, Eds., vol. 1304. London, U.K.: Springer-Verlag, 1997, pp. 151 160. [Online]. Available: http://citeseer.nj.nec.com/leonard97case.html

[34] D. Wagner and S. M. Bellovin, “A programmable plaintext recognizer,” 1994. [Online]. Available: ftp://ftp.research.att.com/dist/smb/recog.ps

[35] C. Eilbeck. My crypto page. [Online]. Available: http://www.yordas.demon.co.uk/crypto/

[36] Xilinx, Inc. SRL16 16-bit shift register look-up-table (LUT). [Online]. Available: http://toolbox.xilinx.com/docsan/xilinx5/data/docs/lib/lib0393_377.html

[37] E. Soha. (1998, May) RC5 on FPGAs. No longer available from original source. [Online]. Available: http://web.archive.org/web/19981205053422/http://www-inst.eecs.berkeley%.edu/~barrel/rc5.html

[40] Xilinx, Inc., “XC4000E and XC4000X Series Field Programmable Gate Arrays,” May 1999. [Online]. Available: http://www.xilinx.com/bvdocs/publications/4000.pdf

[41] Xilinx Inc., “Virtex-E 1.8V Field Programmable Gate Arrays,” July 2002. [Online]. Available: http://direct.xilinx.com/bvdocs/publications/ds022.pdf

[42] distributed.net. (2003, October) distributed.net: Client Speed Comparisons. [Online]. Available: http://n0cgi.distributed.net/speed/

[43] (1997, May) SolNET DES Challenge Attack: Download Page. [Online]. Available: http://www.des.sollentuna.se/download.html

Background

ian@mutexlabs.com (Ian Howson) — Mon, 20 Oct 2003 00:00:00 +0000

Theory

The theory in this section is only covered briefly. The reader is encouraged to refer to Bruce Schneier’s Applied Cryptography for more details.

Symmetric ciphers

A symmetric cipher is characterised by the functions

$$ciphertext=E(plaintext,key)$$

and

$$plaintext=E^{-1}(ciphertext,key)$$

The intent of a cipher is that the function E be non-invertible without the key. This ensures that the plaintext remains secret to people without the key.

A block cipher is one where data is processed in discrete blocks. The input plaintext or ciphertext is broken up into blocks of the appropriate size. An example is DES, which processes data in 64 bit blocks. A stream cipher is one which works with much smaller units of data – often a single bit at a time. A5/1 is a common stream cipher. Stream ciphers are used to generate a key stream, which is then XORd with the plaintext to produce the ciphertext. XORing the ciphertext with the key stream again will decrypt the data.

Attacks and security

In a known plaintext attack, the attacker possesses some ciphertext and the matching plaintext. The goal is to find the key. This is the attack method usually used in research; possessing or being able to infer part of the plaintext is a reasonably safe assumption. E-mail headers and IP packets always begin in the same way, for example.

In a ciphertext only attack, the attacker possesses some ciphertext. These attacks are more difficult to perform. Usually, the attacker relies on some properties of the plaintext to determine when they are successful (such as character distributions or language statistics).

There are two criteria for a symmetric cipher to be considered secure [2]:

There must be no shortcuts to attack a cipher; exhaustive key search must be the most feasible attack
The number of possible keys is large enough to make an exhaustive key search attack infeasible

Key search attacks

In a key search attack, the attacker tries every possible key as input to the cipher. The known piece of ciphertext is also used as an input. In a known plaintext attack, the trial plaintext from the cipher output is compared to the known plaintext. In a ciphertext only attack, heuristics are used to determine if the output is valid plaintext.

Normal cipher usage

Most ciphers consist of a key setup phase and an operation phase. During key setup, the internal state is initialised. During operation, input ciphertext or plaintext is encrypted or decrypted. Key setup only needs to be conducted once for each key that is used.

When a cipher is used in practice, the key is usually kept constant for a long period while the plaintext or ciphertext input is varied frequently. Key setup is performed only once, and the cipher is designed to handle the rapid change in input.

Exhaustive key search reverses this by keeping the input constant while changing the key frequently. The main implication from this is that key setup must be performed very frequently. Many ciphers exploit this to improve resistance to key search attacks by having a very long key setup period. The key setup period is often comprised of the encryption algorithm itself. This greatly increases the time and resources needed to conduct a successful exhaustive key search.

Commercial chips that perform encryption or decryption may not be suitable for use in a key search machine if they are not designed to have the key changed frequently. Conversely, it may be possible to optimise a custom key search design by precomputing (partially evaluating) parts of the algorithm, since the ciphertext is known in advance. This technique has already been used to produce very fast and efficient cipher implementations by including the key in the design itself.

Block ciphers

When attacking a block cipher, one output block is usually tested for each key. If the output block matches the known plaintext, tests with more blocks are conducted to verify that the key is correct. The further checking step is important. There may be several keys that give the same plaintext output if the key size is longer than the block size and only a single block is checked.

Stream ciphers

Stream ciphers are often faster to conduct brute-force attacks against because incorrect keys can be quickly eliminated. A simple approach would be to generate a quantity of the key stream and XOR that with the ciphertext to generate the plaintext. The stream cipher can then be treated in exactly the same way as a block cipher. Efficiency can be slightly improved by ignoring the XOR stage and simply searching for the correct key stream. The amount of key stream to be generated must balance out the number of false alarms with the amount of time taken to check each key. Generating more of the key stream will cut down on false alarms, but take more time.

The main problem with this approach is that it is very inefficient. The entire block of key stream must be generated before it is checked for correctness. The first few bits to be generated may be enough to determine that a key is incorrect. A more efficient algorithm would then be:

Generate a single unit of the key stream (the smallest amount possible).
Check whether this unit matches the first unit of the desired key stream.
If it matches, continue checking with the next unit of the same key. If it doesn’t, start again with the next key.
If a sufficient number of units match, return the key as a potentially correct key.

With this algorithm, an average of two units of key stream need to be generated for each trial key. This is far more efficient than the simple algorithm, which may need to generate a large amount of key stream to avoid returning an excessive number of potential keys.

As with a block cipher, generating the first unit of key stream may require a lengthy key setup phase be carried out. Many key search attacks avoid this by searching for the initial state of the cipher after the key setup has been completed. This is not always feasible; some stream ciphers such as RC4 have very large internal states.

Common ciphers

Most ciphers make use of a number of common operations. These operations typically retain entropy to ensure random-looking output. They may also introduce nonlinearities in the output. Ciphers generally operations from a number of algebraic groups to improve their strength.

Name	Key length	Type	Operations	Ref
DES	56	Block	Bit permute, rotate, XOR, table lookup (6×4)	[3]
RC4	64	Stream	Add, table read and write (8×8), XOR	[1]
Rijndael	128/192/256	Block	Table lookup (8×8), rotate, multiply (GF(2⁸)), XOR	[4]
3DES	112/168	Block	Bit permute, rotate, XOR, table lookup (6×4)	[1]
IDEA	128	Block	XOR, add, multiply (16 bits), rotate	[4]
Blowfish	32-448	Block	XOR, add, table read and write (8×32)	[5]
RC5	0-2040	Block	XOR, add, variable rotate	[6]
A5/1	64	Stream	XOR, shift	[7]
Skipjack	80	Block	XOR, shift, add, table lookup (8×8)	[8]

Table sizes are specified by (x×y), where x is the number of bits in the index and y is the number of bits in the output. For example, 6×4 can be modelled by a RAM with 6 address bits and 4 data bits. Addition is considered to include subtraction as a trivial extension.

By testing the relative speed of each operation on each implementation technology, it should be possible to gain insights into which ciphers will run quickly on which technology.

DES

The Data Encryption Standard (DES) has been extensively used and studied for decades. Several linear and differential attacks against it have been discovered, but the most effective attack remains exhaustive key search. It works with 64 bit blocks and was originally designed for fast hardware implementations. It has been the subject of several contests.

DES has enjoyed widespread military, government and commercial use in the past, notably within the banking and finance sectors. Its key length is only 56 bits, which is considered far too weak nowadays. Many attacks on the key length of DES have been performed, some of which are described below.

DES remains in use through a variant called Triple DES (or 3DES). In 3DES, the DES cipher is applied three times with two or three different keys. This is an effective method of increasing the strength of the DES cipher, but care must be taken during implementation to ensure that all of the keys are different. Clayton and Bond successfully attacked a secure processor (formerly used in ATMs) that utilises 3DES [24]. They exploit protocol flaws to force the processor to use duplicate keys. They can then perform a key search attack on the reduced key space to determine a “master key”.

RC5

RC5 is a simple block cipher designed by Ronald Rivest [6]. Despite its simple structure, very few attacks better than exhaustive key search have been discovered. It is fully parameterised, so the block length, number of rounds and key length can be selected to suit the application. It is best known as the cipher being attacked by the RSA Secret Key Challenges [29]. The parameters for the challenges are selected to be efficient on modern CPUs. RC5 is not widely used for commercial applications and has been patented by RSA Data Security.

Implementation technologies

Software on general-purpose CPUs

Software is usually the first platform that a cipher will be implemented with. General-purpose CPUs are cheap, plentiful, and fast for most tasks. Many ciphers are designed for an efficient software implementation.

Parallelism

CPUs are designed to perform serial operations very quickly. Regardless of the amount of available chip area, the need to operate serially remains. This has lead to modern CPUs expending a large amount of area on prediction and caching circuitry. A doubling in chip area for a CPU will not result in a doubling of performance.

Exhaustive key search is highly parallelisable; the task can be split perfectly between any number of processing units. This means that using multiple specialised processing units instead of a single CPU may give higher performance.

Recent CPUs have demonstrated a move towards increasing parallelism. Multiple execution units, deep pipelines, Symmetric Multiprocessing (SMP) and techniques such as Intel’s HyperThreading all serve to increase the level of parallelism.

Bitslicing

Eli Biham pioneered a technique which later became known as bitslicing [9]. The paper deals primarily with its application to the DES cipher, but it is applicable to many other algorithms. In it, each register of a CPU is viewed as a large number of single-bit registers. This allows a large number of single-bit operations to be performed in parallel. For an algorithm such as DES which is composed largely of single-bit operations, this provides a very large performance gain.

ASICs

An ASIC (Application Specific Integrated Circuit) is a chip that has been designed for a particular purpose. They are usually very fast for whatever application they have been designed for, but cannot be modified after fabrication. Initial fabrication costs are very high, but can be amortized over many chips. The unit price per chip is usually quite low. There is a far greater development effort required than for software, and more than for FPGAs. Gate array designs reduce the high cost of development significantly, but reduce achievable density. They work by placing gates over the entire silicon area of a device during fabrication and linking the gates with metal layers later.

FPGA designs can be converted to equivalent gate array ASIC designs at a relatively low cost. This technique was used for the machine described in [10]. Designs implemented in this way tend to be faster and cheaper than those on FPGAs, but not as fast as a dedicated ASIC design. It is an attractive option where development time and cost are important and the number of FPGAs needed make the implementation cost prohibitive. Many of the issues affecting FPGA designs (such as timing) also apply to ASICs. Routing tends to be much less problematic on an ASIC compared with an FPGA.

FPGAs

FPGAs (Field Programmable Gate Arrays) combine software and hardware approaches. They are chips that can have their internal layout reconfigured at any time. This is usually achieved by loading a bitstream from a ROM or controller. An FPGA typically contains a large number of logic blocks. “Programming” an FPGA determines how the logic blocks are wired together. Modern FPGAs may contain tens of thousands of logic blocks, each of which contains latches, combinational functions and other logic. High-end FPGAs such as the Virtex II Pro [11] even integrate CPU cores within the normal FPGA fabric. There is a move to providing dedicated hardware within FPGAs, such as RAM and communication controllers.

FPGA performance approaches that of an ASIC, but their general structure makes them slower and less efficient. Much less can be done within the same amount of silicon area. They also generate more heat and use more power than an equivalent ASIC. FPGAs do not have the high up-front cost of an ASIC, but cost more per unit. They are ideal for prototyping and development, since they can be reprogrammed quickly at no cost.

Developing a design for an FPGA is generally more time consuming than writing normal software due to the low-level nature of the design. There are also far fewer competent FPGA designers than software programmers, increasing development cost and time.

Pilchard

The Pilchard development board

Pilchard is a low-cost FPGA development board [12]. It contains a Xilinx Virtex E device; the device used for this thesis was an XCV1000E-6HQ240. The Pilchard board and FPGA device used are shown above.

Pilchard interfaces to the RAM bus of certain motherboards. From the perspective of the FPGA designer, it provides a simple synchronous 100MHz or 133MHz bus with no interrupts or DMA. It appears as memory range to the programmer, so registers within the FPGA can map directly to variables in the driver software. This combination allows easy system development, high performance and low FPGA resource requirements.

A number of external I/O pins are available that can be interfaced to other hardware. Space is also available on the circuit board for ROM chips that can load a bitstream into the FPGA device on power up.

Combining technologies

Software, FPGA and ASIC components can be profitably combined. All known FPGA and ASIC-based key search machines are controlled by a general purpose computer.

One useful idea is to use each technology for the task it excels at. For example, a machine using all three technologies might use a computer as a primary controller, FPGAs as lower-level controllers, and ASICs for each search unit. The computer is useful for human interface and easy reconfigurability. The FPGAs are useful for their high speed, I/O capabilities and reconfigurability. ASICs have the advantages of very high speed and low price in very large quantities. This scheme is particularly desirable for ciphers where suitable ASICs are available commercially, reducing fabrication costs.

Another possibility is to perform hardware-fast operations on ASICs and FPGAs, and software-fast operations on computers. This is generally infeasible due to the “I/O gap” – latencies between the ASIC/FPGA and the CPU far outweigh any speed benefit. Pilchard interfaces with the memory bus of a computer and can thus provide a very high bandwidth and low latency connection to the CPU. The integrated CPUs in Virtex II Pro FPGAs also provide similar benefits.

The integrated CPUs on Virtex II Pro FPGAs provide another option for ciphers favouring software implementations. Each Virtex II Pro FPGA contains up to four PowerPC 405 cores [11]. These CPUs could be used to perform the bulk of the cipher operations while the surrounding FPGA fabric handles control, communication and testing of results. A very large number of CPUs could be integrated into a small space using this technique.

Previous work

Most previous hardware key search machines have been designed to locate DES keys. This is because DES is very fast in hardware, widely deployed and has a dangerously short key length. There are also political issues involved with DES and its selection as a standard.

Hardware key search machines

Many hardware key search machines have been produced in the past. Most of these are not practical machines. They are used to gather performance estimates with a certain technology or technique. More key search machines are known to exist; only those with notable features have been presented in this section.

Name	Cipher	Year	Level	Technology	Keys/sec/chip	Ref.
Diffie/Hellman	DES	1977	Theoretical	ASIC	1M	[13]
McLaughlin	DES	1992	Theoretical	ASIC	2k	[14]
Wiener	DES	1993	Designed	ASIC	50M	[15]
Goldberg/Wagner	DES	1996	Built	CPLD	0.5M	[16]
Various	DES	1996	Estimated	FPGA	30M	[2]
Various	DES	1996	Estimated	ASIC	200M	[2]
Wiener	DES	1997	Estimated	ASIC	300M	[17]
Kaps/Paar	DES	1998	Built	FPGA	6.29M	[18]
EFF	DES	1998	Built	ASIC	60M	[10]
Hamer/Chow	DES	1999	Built	FPGA	25M	[19]
Goldberg/Wagner	RC4	1996	Built	CPLD	8.4k	[16]
Kundarewich/Wilton/Hu	RC4	1999	Built	CPLD	40k	[20]
Tsoi/Lee/Leong	RC4	2002	Built	FPGA	6.06M	[21]
Goldberg/Wagner	A5	1996	Built	CPLD	4M	[16]

Most of these performance figures have caveats; they may be estimates, approximations, or based on implementations which were not completely carried out.

Performance estimates

A number of papers provide theoretical estimates of the cost of breaking ciphers with a hardware key search engine. Minimal Key Lengths for Symmetric Ciphers to Provide Adequate Commercial Security [2] is a prime example of this. It lacks practical grounding, but is still contains useful estimates and background. In 1996, it predicts that a $200 FPGA (AT&T ORCA) can test 30 million DES keys per second, and that a $10 ASIC can test 200 million DES keys per second. For $300,000, an FPGA-based machine should be able to crack a DES key every 19 days, and an ASIC-based machine every three hours.

Wiener’s Efficient DES Key Search [15] describes a theoretical hardware DES key search machine based around a custom ASIC. The machine was designed, but not built. Just about every detail of the machine was described, including the schematics, interfaces and physical requirements. Each ASIC in the design is estimated to be able to check 50 million keys per second. Pipelined search units and an LFSR key generator are used. In 1993, a machine costing $100,000 is estimated to be able to crack a DES key every 35 hours, on average. These estimates were updated in 1997 to take newer technology and further analysis into account [17]. In this paper, a $100,000 machine should be capable of cracking a DES key in six hours. The speed estimates given in this paper are the basis of those presented in [2].

McLaughlin presented a high-level design for a DES key search machine [14]. The paper ignores low-level issues and focuses on the high-level functionality of the machine. Its main features are the use of a fuzzy comparer and specialist key generators.

Diffie and Hellman produced a paper in 1977 that counters objections to the possibility of a key search machine [13]. Objections to the reliability, size, speed, power and cost of a key search machine were countered, and a system architecture based around a million search chips presented. Despite the (comparatively) primitive technology available at the time, a key search machine is still believed to be feasible.

The EFF DES Cracker

In 1998, the Electronic Frontier Foundation (EFF) published a book [10] describing a large scale DES cracker that they built. Paul Kocher later elaborated on the book in [22].

The machine was based around a very large number of search units. Each search unit takes 16 clock cycles to check a DES key. 24 search units were built into a custom ASIC design that ran at 40MHz. 64 ASICs were placed on each circuit board and 27 boards constructed. Taking into account faulty search units, the entire machine was capable of a search rate of 92.6 billion keys per second, or an average search time of 4.5 days. The machine was built with a budget of $250,000.

A flexible plaintext recognition scheme was used that allows selective matching against certain characters, as well as specialised modes for the Blaze Challenge [23]. This allowed the machine to conduct ciphertext-only attacks.

Kocher further elaborated on the technical problems inherent with building such a large machine. Power and heat issues were the main ones dealt with. The ASICs used had to be produced successfully with a single attempt, leading to a number of design compromises. Had this requirement been lifted, both the performance and correctness of the design could have been improved significantly.

Many of the political issues involved with DES were also covered. These focused primarily on the disparity between what government officials report and what the DES cracking machine is capable of.

Small-scale key search machines

Many small key search engines have been produced in an attempt to gauge how processing power has changed with time. These are all based on reconfigurable hardware (FPGAs or CPLDs).

Hamer and Chow implemented a DES key search machine on the Transmogrifier 2a, a system containing 32 linked Altera FPGAs [19]. Their design features a long DES pipeline and an LFSR key generator design that minimises the need for key schedule logic. Each FPGA ran at 25MHz, giving an aggregate search rate of 800Mkeys/sec.

Tsoi, Lee and Leong implemented an RC4 key search machine [21] using a Pilchard board. Their design used 96 search units running at 50MHz to achieve a total rate of 6.06Mkeys/sec. Not all of the FPGA resources were used; the number of search units was limited by the number of RAM blocks available. The FPGA implementation ran approximately 58 times faster than a software implementation running on a Pentium 4 1500MHz. Kundarewich, Wilton and Hu also implemented an RC4 key search machine using Altera CPLDs, and obtained a search rate of 40kkeys/sec [20]. Brief cost and performance comparisons were carried out.

Deeper investigation into architectural decisions was made by Kaps and Paar [18]. They explored the idea of an algorithm independent key search machine on an FPGA, focusing on DES. Several architectural options for DES were investigated and implemented on Xilinx FPGAs. Their key search design would be capable of 6.29Mkeys/sec.

Goldberg and Wagner performed an analysis of RC4, A5, DES and CDMF implementations on a CPLD board [16]. The performance of a variety of CPLDs was compared with their cost, noting that low end CPLDs generally give the best price/performance ratio. Comparisons were also made against equivalent software implementations. The RC4 cipher was found to be faster in software than hardware, the opposite result to that of [21] and [20].

Specialised key search machines

Clayton and Bond exploited a variety of protocol flaws to successfully attack a security module that was previously used in ATMs [24]. They were able to successfully recover 3DES keys from the device with the assistance of a cheap FPGA board. By implementing a more practical attack they were able to learn more about the difficulties and benefits of working within a real environment.

Pornin and Stern attacked A5/1 using a combination of software and hardware approaches [25]. Software was used to reduce the search space of initial states, while hardware was used to conduct an exhaustive search over this subspace. A board containing four Xilinx 4010E FPGAs was used in conjunction with a 500MHz Alpha workstation. Each FPGA contains 12 search units, each checking one initial state every 65 cycles. The FPGAs were clocked at 50MHz, giving a total search rate of 37 million initial states per second. Using two boards with one workstation allowed an initial state to be determined in 2.5 days on average, far faster than exhaustive key search alone. Keller and Seitz took a more analytical approach by using backtracking to reduce the search space [7]. Their implementation was performed on a Xilinx XC4062 FPGA.

Software key search

Distributed computing

Several organisations have implemented software to conduct distributed key search attacks against ciphers using network-connected hosts. distributed.net [26] is the largest and most well-known of these. Others include DESCHALL [27] and SolNET [28]. The basic idea is the same: run a piece of software on many hosts and coordinate their efforts with a central server. The software is configured to run during idle time on the hosts. Buffering schemes allow hosts to continue working on their part of the task when not connected to a network. Regardless of the precise task being performed, work is usually divided into “blocks” which take a (relatively) short period of time to complete. A client connects to the server to be allocated a number of blocks and does not communicate again until those blocks are complete.

These efforts have been quite successful so far. distributed.net has successfully completed RSA Data Security’s RC5-64, RC5-56, DES II-1 challenges [29], as well as a similar challenge from CS Communications & Systems [30]. They completed the DES-III challenge with the help of the EFF DES cracker. DESCHALL completed the DES-I challenge. A group headed by Germano Caronni and containing 3500 computers completed the RC5-48 challenge. Ian Goldberg used 250 computers to complete RC5-40 [31].

References

[1] B. Schneier, Applied Cryptography: Protocols, Algorithms, and Source Code in C, 2nd ed. John Wiley & Sons, Inc., January 1996.

[2] M. Blaze, W. Diffie, R. L. Rivest, B. Schneier, T. Shimomura, E. Thompson, and M. Wiener, “Minimal key lengths for symmetric ciphers to provide adequate commercial security,” A Report by an Ad Hoc Group of Cryptographers and Computer Scientists, January 1996. [Online]. Available: http://www.schneier.com/paper-keylength.pdf

[4] J. J. G. Savard. A cryptographic compendium. [Online]. Available: http: //home.ecn.ab.ca/∼jsavard/crypto/intro.htm

[5] B. Schneier, “Description of a new variable-length key, 64-bit block cipher (Blowfish), in Lecture Notes in Computer Science, no. 809. Springer-Verlag, 1994, pp. 191 204. [Online]. Available: http://www.schneier.com/paper-blowfish-fse.html

[6] R. L. Rivest, “The RC5 encryption algorithm,” in Practical Cryptography for Data Internetworks, W. Stallings, Ed. IEEE Computer Society Press, 1996.

[7] J. Keller and B. Seitz, “A hardware-based attack on the A5/1 stream cipher,” in APC 2001. VDE Verlag, 2001, pp. 155-158. [Online]. Available: http://www.informatik.fernuni-hagen.de/ti2/papers/apc2001-final.pdf

[8] National Security Agency. (1998, May) Skipjack and KEA algorithm specifications. [Online]. Available: http://csrc.nist.gov/encryption/skipjack/skipjack.pdf

[9] E. Biham, “A fast new DES implementation in software,” Lecture Notes in Computer Science, vol. 1267, pp. 260 ??, 1997. [Online]. Available: http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-get.cgi/1997/CS/CS08%91.ps.gz

[10] Electronic Frontier Foundation, Cracking DES. O’Reilly, 1998.

[11] Xilinx, Inc., “Virtex-II Pro complete data sheet,” September 2003, http://direct.xilinx.com/bvdocs/publications/ds083.pdf

[13] W. Diffie and M. E. Hellman, “Exhaustive cryptanalysis of the NBS data encryption standard,” in Computer, June 1977, vol. 10, no. 6, pp. 74 84.

[14] R. McLaughlin, “Yet another machine to break DES,” Cryptologia, vol. 16, no. 2, pp. 136 144, April 1992.

[15] M. J. Wiener, “Efficient DES key search,” in Practical Cryptography for Data Internetworks, W. Stallings, Ed. IEEE Computer Society Press, 1996, pp. 31 79.

[16] I. Goldberg and D. Wagner, “Architectural considerations for cryptanalytic hardware,” CS252 Report, 1996. [Online]. Available: http://www.cs.berkeley.edu/~iang/isaac/hardware/paper.ps

[17] M. J. Wiener, “Efficient DES key search - an update,” in Cryptobytes, RSA Laboratories, Ed., 1997, vol. 3, no. 2, pp. 6 8. [Online]. Available: ftp://ftp.rsasecurity.com/pub/cryptobytes/crypto3n2.pdf

[18] J.-P. Kaps and C. Paar, “Fast DES implementation for FPGAs and its application to a universal key-search machine,” in Selected Areas in Cryptography, 1998, pp. 234 247.

[20] P. D. Kundarewich, S. J. Wilton, and A. J. Hu, “A cpld-based rc4 cracking system,” in Canadian Conference on Electrical and Computer Engineering, 1999. [Online]. Available: http://www.ee.ubc.ca/~stevew/papers/pdf/ccece99.pdf

[22] P. Kocher, “Breaking DES,” in Cryptobytes, RSA Laboratories, Ed., 1999, vol. 4, no. 2, pp. 1 5. [Online]. Available: ftp://ftp.rsasecurity.com/pub/cryptobytes/crypto4n2.pdf

[23] M. Blaze. (1997, June) A better DES challenge. [Online]. Available: http://www.privacy.nb.ca/cryptography/archives/cryptography/html/1997-0%6/0127.html

[24] R. Clayton and M. Bond. Experience using a low-cost fpga design to crack des keys. [Online]. Available: http://www.cl.cam.ac.uk/users/rnc1/descrack/DEScracker.html

[25] T. Pornin and J. Stern, “Software-hardware trade-offs; application to A5/1 cryptanalysis,” in Lecture Notes in Computer Science, ser. CHES 99. Springer-Verlag, 2000, pp. 318 327. [Online]. Available: http://www.di.ens.fr/~stern/data/St91.pdf

[26] (2003, October) distributed.net: Node Zero. [Online]. Available: http://www.distributed.net/

[27] C. M. Curtin. (1998, June) DESCHALL. [Online]. Available: http://www.interhack.net/projects/deschall/

[29] The RSA Laboratories Secret-Key Challenge. RSA Security. [Online]. Available: http://www.rsasecurity.com/rsalabs/challenges/secretkey/index.html

[30] (2003, October) distributed.net: Project CSC. [Online]. Available: http://www.distributed.net/csc/

[31] (1997, January) 40-bit crypto proves no problem. [Online]. Available: http://news.com.com/2100-1017-266268.html?legacy=cnet

Analysis

ian@mutexlabs.com (Ian Howson) — Mon, 20 Oct 2003 00:00:00 +0000

Factors affecting exhaustive key search

Many factors affect the time taken to conduct an exhaustive key search attack:

Key length. Increasing the key length used by a cipher will dramatically increase the size of the key space and hence the time required. Using long keys is by far the most effective countermeasure to an exhaustive key search attack.
Available resources. An attacker with more resources (money, computational power, people) can sweep the key space more rapidly.
Cipher design. The design of a cipher has a strong influence on how long an exhaustive key search will take. Several factors contribute to this:
- Key setup time. In normal use, a cipher’s key setup time is usually negligible. When conducting a key search attack, the key setup phase must be performed for every key. Long key setup times can frustrate key search attacks.
- Cipher operations. Some operations will take longer to perform than others. Some ciphers are designed to use operations that are fast on one particular technology and slow on another.
- Cipher speed. On identical implementation technologies, some ciphers can encrypt or decrypt more quickly than others.
- Frequency of register access. Pipelined designs can achieve significant resource savings if registers are accessed infrequently (discussed in Estimating FPGA resource usage for pipelined cipher implementations.) Iterative designs often have less constraints on their RAM or register access.
Availability of accurate plaintext and ciphertext pairs. Possessing only ciphertext or incomplete ciphertext and plaintext may reduce the speed and accuracy of the exhaustive key search attack. Ideally, several perfect pairs of ciphertext and plaintext should be available. If many pairs are available, time-memory tradeoff attacks may be more appropriate for some ciphers.

Software factors

The most significant factor that affects the time required to conduct a key search using CPUs is the word length of the cipher. The speed of the cipher tends to be highest on a CPU whose word length matches that of the cipher. All processing must be performed in units of this word length.

Ciphers that have very small states may be able to operate completely within CPU registers, which will improve performance. Smaller speed benefits will arise if the state is small enough to fit inside the highest-level cache. Very few ciphers have state sizes over a few thousand bytes.

Some cipher implementations may exchange storage space for processing power by precomputing values. One software implementation of A5/1 [44] exploits this by precomputing all possible values of the individual LFSRs. Instead of performing the normal shift and XOR operations, output is generated by modifying pointers to the precomputed values. This requires almost 128MB of memory, but greatly improves the performance of A5/1 in software.

FPGA factors

The most significant factor affecting the speed of exhaustive key search on FPGAs is whether the cipher can be implemented as a long pipeline as opposed to an iterative structure. Pipelined structures make far more efficient use of the FPGA resources and can achieve higher clock speeds. Any cipher can be implemented as a long pipeline, but the required FPGA resources are often prohibitive. Generally, a long pipeline will check one key every clock cycle.

Operation performance

FPGAs

Operation performance on FPGAs is influenced by the time required to complete the operation and the resource usage. Many operations can be completed more quickly by using more resources. Conversely, using less resources allows a greater level of parallelism, increasing overall performance. Finding the correct balance is difficult and may require several attempts at implementation.

Results giving the time and space requirements of a number of operations are given in LUTs and time required per bit for cipher operations. Not all operations are covered, and optimisations using RAM or other FPGA features are ignored. In particular, large data-dependent table lookups will be more efficiently performed using the RAM blocks contained within most Xilinx FPGAs. Similarly, multiplications may be better performed using onboard Virtex II, Virtex II Pro or Spartan 3 multiplier resources.

Addition and subtraction are dependent on the width of the data being added, due to the Xilinx carry chain. Variable rotation is achieved using the scheme in [39], and assumes the use of Xilinx slice multiplexers. Data-dependent table lookup does not assume this because each Xilinx family has different multiplexer resources available which will affect the structures used for implementation.

Operation	LUTs	Time (LUT depth)
Bit permute	0	0
Fixed rotate or shift	0	0
XOR (up to 4 inputs)	1	1
Data-dependant table read or write (1-4 address bits)	1	1
Data-dependant table lookup ($w$ address bits, $w>4$)	$2^{w-3}-1$	$w-3$
Add or subtract (two inputs)	1	Depends on data width
Variable rotate or shift over $w$ bits	$\log_{2}w$	$\frac{\log_{2}w}{2}$

CPUs

In contrast with FPGAs, storage space is not a concern for software implementations. When the word size of the operation inputs is equal to or less than that of the CPU, most modern CPUs can complete any operation very quickly. The operations being performed thus become less important.

Word size is the major factor when determining operation speed on CPUs. If the word size of the operation is greater than that of the CPU, performance will generally be halved (or slightly worse).

If the word size of the operation is less than that of the CPU, processing resources are being wasted. We then need to examine the algorithm to determine if any parallelism can be applied to make the best use of the resources that are available. For example, it may be possible to pack a number of short-word operands into one word and perform an operation over a number of words simultaneously. Bitslicing [9] is another approach that is used to accelerate single-bit operations on CPUs with a wide native word length.

Estimating FPGA resource usage for pipelined cipher implementations

Introduction

It is useful to determine whether a pipelined cipher implementation is feasible before beginning work on the implementation itself. This section describes a method that can be used to estimate the resource usage of a cipher given details of its algorithm.

Assumptions

A number of assumptions are made to facilitate this analysis:

FPGAs are comprised of many logic cells (LCs.) A logic cell is comprised of a four input look-up table (LUT) and a flip-flop (FF). This is similar to the LC layout used in most Xilinx and Altera FPGAs, and is shown in Simplified FPGA logic cell.
The state size remains constant throughout the key setup and decryption phases. This is rarely true, but does not significantly affect the results. The analysis can be extended to handle changing state sizes.
The key setup and decryption phases are composed of a number of rounds. Each round performs the same operations. The logic used to perform each round maps the state from round $r$ to round $r+1$.

Simplified FPGA logic cell

Factors

To estimate the quantity of resources required, we consider:

The size of the state during key setup and decryption
The operations that are performed during key setup and decryption
What quantity of the state is modified during each step or round of key setup and decryption

The final result will specify the number of LCs required for a fully pipelined implementation of the cipher. The mapping from LC to real FPGA resources is generally trivial. A Virtex, Spartan IIE, Virtex II or Virtex II Pro slice maps to two LCs. Spartan 3 devices have fewer effective LCs because half of the LUTs on the device have reduced functionality and do not support usage as a RAM or a shift register.

Method

We define a number of variables, shown in the table below.

$s$	State size in bits
$n$	Number of rounds needed to complete a phase
$r$	Number of LUTs required to perform a round
$m$	Number of bits of state modified during a round
$c$	Number of LCs required to perform a phase

The state size $s$ is the number of bits that need to be stored between rounds. This includes all registers and modifiable lookup tables used in a phase. Usually this figure can be determined by summing the total number of bits of storage used by the cipher.

The number of rounds $n$ is usually specified by the cipher. Modelling the cipher in this way forces each round to complete in one clock cycle, which may result in very long combinational delays for some ciphers. This will be discussed further below.

The number of LUTs $r$ required to perform a round is determined by examining at the operations performed during that round and the number of bits that these operations are performed on. These results are summarised in LUTs and time required per bit for cipher operations. Another strategy would be to implement the round function and use synthesis results to estimate the resource usage.

If a state bit is modified during a round, its result can be stored in the same LC as the LUT that modified it. A dedicated LC is needed otherwise. This gives us value $m$.

The number of LCs needed to complete a cipher phase is thus:

$$c=n(r+s-m) \text{ (equation 4.1)}$$

We can then compare the number of LCs in the result to the number of LCs in a device to determine whether a pipelined implementation is feasible. Alternatively, we can use the LC estimate to determine what the smallest device needed is. The LC figure can also be used as a “difficulty rating”; ciphers with high LC counts will generally take more FPGA resources and time to attack.

Optimisations

Multiplexers

Each Xilinx slice contains a number of additional multiplexers which can reduce the number of LCs needed for large table lookups.

Shift registers

Xilinx LUTs can also operate as shift registers [36]. This is very useful when a state bit is not modified for one or more clock cycles. Instead of chaining FFs together, a single LUT can replace up to 16 FFs. Analysis of the data dependencies in the cipher algorithm can provide justification to reduce the effective $s$ value significantly. Synthesis tools will sometimes perform this optimisation automatically. We can then obtain an alternate equation to obtain the number of LCs required:

$$c=S+nr\label{eq:LC2} \text{ (equation 4.2)}$$

where $S$ is the number of LCs used to store the state over the entire pipeline. Implementations using shift registers for storage cannot pack the shift registers into the same LC as the LUT performing the calculation, since the shift register uses the LUT. All state that is modified can be latched in the same LC as the LUT that performed the calculation. We thus do not require variable $m$.

Short pipeline stages

Pipelined implementations on FPGAs can be sped up by making each pipeline stage as short as possible. No additional cost is incurred when using the latches included in LCs, making very deep pipelines at high speeds a good design approach. Instead of completing an entire round in a single clock cycle, it is usually more efficient to break up the round and complete it over more clock cycles (at a higher clock rate). The design will still check an average of one key per clock cycle at the higher clock rate.

Splitting up the pipeline stages requires analysis of the data dependencies within each round, and would significantly complicate this analysis.

Using short pipeline stages will increase the number of LUTs that will need to be used as shift registers, and hence increase overall resource usage slightly. At worst, doubling the clock rate of a design will double the number of LCs used only for data storage. Smaller penalties will be incurred if shift registers are less than half full.

Some operations (particularly boolean operations like XOR) can often be collapsed into other operations, particularly if there are spare LUT input lines. This is highly dependent on the cipher algorithm.

RC5 example

Rivest describes the RC5 algorithm in [6]. It consists of three phases: initialisation, key mixing and decryption. The initialisation phase is ignored because it can be trivially precomputed and will not consume FPGA resources when implemented in this way.

Key mixing

The key mixing phase of RC5-32/12/9 uses an S array of 26 words and an L array of 3 words. Each word is 32 bits wide, giving a total state size ($s$) of 928 bits. 78 rounds ($n$) are needed. Each round modifies 64 bits of state ($m$.)

To determine $r$, we examine the operations being performed. All of the table lookups operate in a predictable order, removing the need for additional multiplexers. Each round contains five adds, a fixed rotation and a variable rotation, all over 32 bits. One of the adds ($A+B$ in the second line) is performed twice, and can be ignored. This gives an $r$ value of $32(4\times1+0+\log_{2}32)=288$ and a final $c$ value of $78(288+928-64)=89856$. This represents a slice count that can only be achieved in very large FPGAs. The estimated $r$ value is close to the value of 256 obtained in the trial implementation.

Significant resource savings can be achieved by recognising that each bit in the S array is only accessed once every 26 rounds, and each bit in the L array is only accessed once every 3 rounds. We can thus collapse a significant number of the LCs used purely for storage into shift register LUTs. 7488 LCs are used to store the L array as it travels through the pipeline; this can be reduced to 2496. 64896 total FFs are used to store the S array. Two shift registers are needed to provide the 25 cycle delay on each bit, and each bit is accessed three times. The total number of LCs needed is thus $26\times32\times2\times3=4992$. This demonstrates that the frequency of register access has a significant bearing on the efficiency of cipher implementations on FPGAs. The S array stores almost 9 times as much data as the L array, but requires only twice the resources.

Using Equation 4.2, we obtain an LC count of 22464 for the key mixing phase of RC5.

The resource usage can be further optimised by noticing that elements of the S array are initialised sequentially, and so not all values need to be stored until 26 rounds have been completed.

Decryption

The decryption phase of RC5 uses the same S array as the key mixing phase, but does not require the L array. Each round consists of two half-rounds which are identical except for the source and destination of the results. We can use this property to double the number of rounds and halve the number of LCs required per round. It consists of 22 half-rounds, each of which contains a subtraction, a variable rotation and an XOR. All operations are performed on 32 bit words. Two additional subtractions are performed after the half-rounds. We thus have $s=26\times32=832$, $n=22$, $r=32(1+5+1)=192$ (which matches the trial implementation) and $m=64$. This gives $c=22(192+832-64)=21120$.

Again, significant resource savings can be achieved by using shift registers for storage. Each word in the S array is only accessed once and is never written to, so we need only provide storage for the period between the start of the decryption phase and the point where it is accessed. At worst, this will be 26 rounds, requiring two shift registers per bit. There are 26 words storing 32 bits each, giving an LC count of $26\times32\times2=1664$. Applying Equation 4.2 again gives $c=1664+24\times192=6272$. The final subtractions require an additional 64 LCs, giving a total of 6336 LCs.

Results

Using the figures determined from Equation 4.1, we obtain a total LC count of 110976. This translates to 55488 slices – barely fitting within the largest Virtex II Pro device that is planned for production (and is not yet even shipping.)

Using the Xilinx-optimised figures from Equation 4.2 gives a total LC count of 28800. This converts to a far more practical 14400 slices, within the capacities of many larger devices.

It should be noted that the figures generated by this analysis technique tend to be conservative and ignore many potential resource optimisations. It also ignores issues of pipelining within the cipher rounds which are difficult to deal with in such a general sense. Both of these areas can provide significant resource and speed advantages in an actual implementation.

FPGA price/performance comparison

Pricing data for Virtex E, Spartan IIE, Virtex II and Virtex II Pro FPGAs in quantities of 25-99 was obtained from Avnet’s website [45], and is current as of 15 October, 2003. The XC2V40 and XC2V80 device pricing is for a quantity of 100 or more.

Pricing data for Spartan 3 devices was obtained from Ernest Peltzer [46] of Sensory Networks, and are projected prices for Q1 2004 in quantities of 100 or more. Spartan 3 devices only started shipping very recently and so pricing data is both difficult to obtain and very likely to change.

This analysis assumes that the cipher module is the slowest part of the total key search machine. It ignores resources that would be dedicated to the PC interface, but includes those that are required for each search unit. Interface overheads are ignored because in a large-scale design each FPGA does not need its own PC interface and controller; the search bus can be connected directly to FPGA pads.

Speed grades and packaging

Each FPGA is available in a variety of speed grades and several packaging options. In general, the slowest speed grade and the smallest package gives the best price/performance ratio. Low FPGA speeds can be compensated for by using more FPGAs, and packaging is not important since only a few I/O lines are needed. This greatly simplifies the analysis by allowing a large percentage of the FPGA devices available to be ignored.

Families

The maximum attainable speed and resource usage is the same within each FPGA family. This allows performance estimates to be generated with far less effort. Synthesis estimates for the DES and RC5 search units are used. These are less accurate than those obtained after place and route, but remain valid across different capacity FPGAs within a family. These results are summarised in FPGA price/performance tables. Search unit resource costs are estimated in the same way.

Relative clock speed across FPGA families

The maximum clock speed for a given design varies greatly depending on the FPGA family used. Interestingly, the current “budget” family (Spartan 3) achieves the highest clock rates. This can be explained by their 90nm manufacturing process; the Virtex II and Virtex II Pro use 150nm and 120nm processes.

RC5 requires half as many RAM blocks on a Virtex II, Virtex II Pro or Spartan 3 as it does on a Virtex E or Spartan IIE. This is because the RC5 implementation needs a 32 bit wide RAM, and the Virtex E and Spartan IIE RAM is only 16 bits wide.

DES

FPGA price/performance for DES

FPGA price/performance for DES, showing low-end detail

The above figures show the price and performance of each device in each Xilinx FPGA family for the DES cipher. The second shows the same data, but zoomed to show detail for the low-end FPGAs. “Kinks” in the graph appear where moving to the next largest device does not improve performance (since there are not enough available resources to add another search unit). Smaller kinks are visible when moving to the next largest package.

We can see that Spartan 3 FPGAs give by far the best performance for a given price. This is not surprising; they are positioned as a budget FPGA and can achieve higher clock rates than the other families being considered. Again, it should be pointed out that they have only recently started shipping and pricing will likely be volatile for some time. The pricing figures used for the analysis were also projected figures for Q1 2004 and for a larger quantity than the other families.

Of the mature families, Spartan IIE devices give the best price/performance ratio. Their performance is quite limited, however. Virtex II Pro devices can achieve spectacularly high search rates, but at a high cost per device.

The figure below shows the price/performance ratio achieved by each device in the Virtex II Pro family. The device with the best ratio is the XC2VP20, with the XC2VP30 and XC2VP40 close behind. In a real system where PCB costs and assembly have to be taken into account, it may be worthwhile purchasing a smaller number of faster FPGAs with a worse price/performance ratio. These three devices use the FG676 package; the jump in price to the XC2VP50 can be explained by the larger package (FF1152). The XC2VP100 has a far worse ratio than the others, and a much larger package (FF1696). Like the Spartan 3 devices, it has only recently started shipping and may still have unstable pricing.

Virtex II Pro price/performance ratio by device for DES

RC5

FPGA price/performance for RC5

Relative FPGA price and performance for RC5 is shown above. It is similar to that for DES, but with less gap between the Virtex E and Virtex II/Virtex II Pro families. Detail near the low end is also very similar. Relative pricing and performance within the family remains the same as for DES.

RC4

FPGA price/performance for RC4

FPGA price/performance for RC4, showing low-end detail

RC4’s performance is entirely constrained by the number of RAM blocks available on the FPGA. This gives quite difference price/performance results. From the figures, we can see that the Virtex E and Virtex II families are quite similarly placed. Virtex II Pro devices perform much better when cost is taken into account. Examining the low-end detail shows that the Spartan IIE family remains competitive for far longer than the Spartan 3, in contrast with the other results. Again, the XC2VP20 remains the most cost-effective choice in the Virtex II Pro family.

CPU price/performance comparison

CPU pricing data was obtained from Sastradi Satria of OnLine Centre [47]. This is for quantities of 10 and is specified in AUD without GST. This is useful for comparing CPUs, but makes comparing price/performance ratios between CPUs and FPGAs difficult. Several other pricing sources were located but not used due to accuracy or quantity issues.

Benchmark results are listed in CPU benchmark results, and should be interpreted taking into account the problems noted in Software benchmark results. These results were scaled by the clock speed to obtain performance estimates for each CPU that is currently being sold. This assumes that performance will scale linearly with CPU speed, which is generally true for exhaustive key search.

When comparing performance between CPUs the SolNET benchmarks were used because they have been performed on a wider variety of CPUs. They are not directly applicable to CPU to FPGA comparisons.

No pricing data was available for mobile Intel CPUs (PIII-M, P4-M, Centrino). This would be useful when considering a very large-scale key search machine based on CPUs; these CPUs use much less power and generate far less heat. PIII-M, Centrino and G3 are particularly interesting due to their high benchmark results at comparatively low clock rates.

These comparisons ignore the cost of support hardware, which can be expected to be several times that of the CPU device itself in some cases.

DES

The figures below show the price and performance of each CPU family for the DES cipher. We can see that the CPUs within a family that achieve the highest search rates are disproportionately expensive. It is scarcely worth trying to achieve a search rate over 12Mkeys/sec with an Athlon XP or 11Mkeys/sec with a Pentium 4 because the price increases so steeply.

CPU price/performance by family for DES

CPU price/performance by device for DES

The Athlon XP curve contains a number of kinks; these occur because pricing increases with their performance rating. This performance rating is not in line with actual performance, however – Barton core Athlons have a higher performance rating than their clock speed (and measured performance) would suggest. We can see that the Duron line appears to fit reasonably well with the Athlon XP line. Celeron CPUs achieve higher performance than their price would otherwise suggest.

Examination of the data shows that the two Duron data points have a linear price/performance relationship. In most practical systems, it would thus be best to choose the faster of the two in order to save on auxiliary costs (support hardware, space, etc.)

The second figure above shows the price/performance ratio for each device under consideration. We can see that the slowest device in each family generally gives the best ratio. The Durons have exactly the same price/performance ratio. The Athlon XP 2200+ and Celeron 2200 provide a marginally better ratio than their neighbours. All Pentium 4 devices are quite expensive for the performance that they give. The exact ratio needed for a large-scale machine will depend on the price of the support hardware, but in general any Duron, Celeron or Athlon XP up to 2600+ will provide a good price/performance ratio.

RC5

CPU price/performance by family for RC5

CPU price/performance by device for RC5

The top figure shows the price/performance ratios of each CPU family for RC5. We can see that the Celeron and Pentium 4 families are far less competitive for RC5; the most expensive Pentium 4 HT device barely outperforms the cheapest Duron! The Duron and Athlon XP families remain similarly positioned relative to each other. The same kinks in the Athlon XP curve are apparent.

The bottom figure shows the price/performance ratios of each device for RC5. As with DES, the cheapest device in each family provides the best ratio (with the minor exceptions noted for DES). Unlike DES, however, the Celeron family is no longer as competitive. Key search machine designers would do best to select the Duron 1600 or a low-end Athlon XP. Both Pentium 4 varieties remain very expensive for their performance.

Technology comparison

CPUs and FPGAs

Ciphers

The pricing and performance data for CPUs and FPGAs is not directly comparable. CPU prices are given in AUD for quantities of 10; FPGA prices are given in USD for quantities of 24-99. The CPU performance is based on benchmark results, while the FPGA performance is based on synthesis estimates.

Nonetheless, we can scale the CPU pricing based on the current exchange rate, and scale the FPGA performance based on measured performance. At the time of writing, one Australian dollar is worth 0.700639 U.S. dollars. The predicted performance for the XCV1000E running DES was 894Mkeys/sec; achieved performance was 500Mkeys/sec. The DES CPU performance also needs to be scaled up by approximately 2.5 to account for the low speeds achieved by the SolNET client compared with the distributed.net client. The predicted FPGA performance for RC5 matched quite closely with the achieved performance (predictions were 1.0625 times faster.) Scaling with these figures ignores many factors but will suffice for this analysis.

CPU and FPGA family comparison for DES

The figure above shows the price/performance ratios for each CPU and FPGA family. The entire CPU range is compressed into the left-hand side of the graph; even at the high end, they do not come anywhere near the search rate of a low-end FPGA. It can be seen that searching DES on general purpose CPUs is very costly compared to searching with FPGAs.

CPU and FPGA family comparison for RC5

The figure above shows the same comparison for the RC5 cipher with the less competitive FPGA families removed. CPUs now perform better than FPGAs at the same price. They still cannot match the performance offered by high-end FPGAs.

These two comparisons show that the choice of implementation technology can greatly affect the time and cost to perform an exhaustive key search. In a practical key search machine, the technology must be selected to match the cipher being attacked.

In general

Over time, FPGAs will most likely become more efficient for key search than CPUs. This is because CPU performance does not scale linearly with available silicon area; it is limited by bus speeds, interactions between instructions and limited parallelism. FPGA performance for key search will scale linearly; if twice as much silicon area is available, twice as many search units can be implemented. FPGAs will thus become more important in future cryptanalysis. Already, improved CPU performance is becoming dependent on increasing parallelism; SIMD techniques and HyperThreading are examples of this.

CPUs are, of course, far easier to obtain than FPGAs. Many organisations already have a large computing infrastructure that could be used to perform key searches. distributed.net and other software RC5 efforts have demonstrated the feasibility of this approach. FPGA hardware is very rare in comparison, especially in the quantities that would be needed to conduct key searches.

CPUs need a large amount of support hardware (heatsinks, RAM, chipsets, multi-voltage power supplies and so on) which drives up the cost of a CPU-based key search machine. All of this hardware is very cheap and available. Storage space for it becomes more of a concern. FPGAs have a clear advantage here; many FPGAs can be mounted on a card that will fit within a computer case. It may be possible to design a motherboard for commodity CPUs that has some of these advantages.

Ultimately, neither CPUs nor FPGAs are very efficient for conducting key searches compared with ASICs. CPUs are inefficient due to their support hardware and program-based operation; FPGAs are inefficient because of their generic hardware structure. ASICs have custom hardware and a low unit price, but very high initial price.

Extrapolation for ASICs

In [48], Craig Ulmer reports that ASIC implementations can achieve three times the speed of an FPGA implementation, and ten times the density. This is useful as a general guide, but not in this analysis since the area required by FPGA dice is not easily obtained.

Instead, we can estimate ASIC costs based on the gate count required and infer the cost of a gate array device. During the Map phase of FPGA compilation, ISE reports the ‘equivalent gate count’ for an ASIC implementation of the design. This is based partly on the data contained within [49], and can be used to determine an approximate ASIC cost. The DES implementation on the XCV1000E uses 453,968 gates, and the RC5 implementation uses 1,353,397 gates. RC5’s large gate count is due to the amount of RAM used, including the additional RAM blocks used to reduce routing delays on the FPGA implementation. Both of these figures include interface and controller logic. It would also be possible to find a tradeoff between die size with final cost.

According to [50], a Virtex E design should be implementable with a CMOS-10HD gate array. This is designed around a 250nm process, which seems reasonable for a 1:1 speed and density conversion; the Virtex E family uses a 180nm process. No measures on the physical size of a CMOS-10HD die were available, but [51] claims 15k gates/mm² for a CMOS-9HD die. Assuming that the number of gates per mm² scales linearly with feature size, we get approximately 30k gates/mm². Assuming a 50% gate utilisation ratio gives us approximately 900kgates for DES, or 30mm².

MOSIS [52] provides small-quantity ASIC fabrication. They also provide an online price list [53]. We select the TSMC 250nm process (CL025) as one that should be suitable; other 250nm processes are approximately the same price. This gives a fabrication cost of $44,200 for 40 parts. $2,500 more will be required for packaging [54]. This is a high average price per part, but not a great deal more than the price of the XCV1000E. No pricing data was available for larger quantities.

Comparison of CPUs, FPGAs and ASICs for DES

Without further pricing data, it is difficult to perform an intelligent cost comparison involving ASICs. We can, however, use the price of packaging as a bare minimum cost per device to determine a price point at which ASICs become a viable option. The figure shows ASICs, CPUs and FPGAs compared at their best price points for the DES cipher. We can see that to assemble a machine equivalent in power to the EFF DES Cracker, FPGAs remain the most cost-effective choice. ASICs do not become price-effective until the machine performance reaches almost 400Gkeys/sec – over four times the performance of the EFF machine.

Spartan 3 FPGAs were not included in this comparison due to their uncertain pricing and performance. In addition, their $/Mkeys/sec ratio is below that of the ASIC design given, meaning that the two curves would never converge (as they would if all fabrication options had been considered.) It will be interesting to update this analysis when Spartan 3 pricing stabilises.

Comparison with other DES FPGA results

The below figure shows known FPGA DES key search machines and the performance that was predicted by Blaze et al. in [2]. Extrapolating their estimates with Moore’s Law gives an estimate of 1120Mkeys/sec for a $200 FPGA today. Performance estimates for the FPGAs priced around $200 are also shown.

Previous FPGA DES key search machines and performance estimates

The graph shows that no implementation has matched the performance predicted in 1996 for FPGA devices, regardless of the price of FPGA device used. The implementation presented in this thesis moves closer to predictions (as a percentage of expected performance) but still falls short. It also uses an FPGA that costs $938 today, well in excess of the $200 quote given. Other $200 FPGA devices are predicted to achieve similar performance.

The XC3S1000 is interesting; it has a very high capacity for its price. It still falls short of the estimate, but not by much. Its predicted price is also well under $200. It will be interesting to see if this price remains accurate in Q1 2004. No larger Spartan 3 devices are shipping yet, so a device with a price closer to $200 could not be selected.

Large-scale key search machines

CPUs

CPUs are more suited to RC5 key search than FPGAs. A large-scale machine to complete the RSA RC5-72 challenge in one year might be considered. This requires an average search rate of almost 150Tkeys/sec to conduct a complete sweep of the key space (half a year on average to find the key). The most cost-effective CPU is the Duron 1600, achieving 5.0Mkeys/sec at a price of $58. To reach the target search rate, almost 30 million CPUs will be needed, costing over $1.7 billion. This is before considering extras such as RAM, motherboards, power, heat removal and storage space.

At present, distributed.net is achieving a key rate of approximately 120Gkeys/sec [55]. At the current rate, the RC5-72 challenge will probably be solved in 624 years.

A more feasible machine might attempt to match the performance of the EFF DES Cracker, which achieved 92.6Gkeys/sec using a large number of gate array ASICs. The most cost-effective CPU is again the Duron 1600, achieving 21.5Mkeys/sec (distributed.net scale) for $58. Over 4300 CPUs would be needed at a cost of over $250,000. This is not excessively expensive, but again ignores support hardware and other extras.

FPGAs

FPGAs perform extremely well for DES key search. A machine matching the speed of the EFF DES Cracker could be constructed from XC2S200E devices ($25, 149Mkeys/sec). Spartan 3 devices were not considered due to their unstable pricing. 622 devices would be needed, at a total cost of $15,540. The EFF machine spent $130,000 on materials; it is not clear how much of this was spent on ASIC fabrication.

Alternatively, XC2VP20 devices could be used. They are slightly more expensive at a given search rate, but far fewer devices would be needed. This would reduce auxiliary costs significantly. To match the performance of the EFF DES Cracker, a mere 78 devices would be needed. Each device costs $299, giving a total component cost of $23,322. In contrast, the EFF machine used 1536 devices spanning many circuit boards and several physical cabinets.

At the top end of the FPGA spectrum, XC2VP100 devices could be used. These are the largest Xilinx devices that are currently shipping. Only 16 devices would be required. The total device cost would be $89,264, but the physical space consumed by the machine would be very small – less than that of a single board in the EFF DES Cracker.

FPGAs are generally more expensive than CPUs when performing RC5 key searches, and so will not be considered. Using the theoretical pipelined design may be profitable; a single (expensive) FPGA could search 100–200Mkeys/sec.

Accounting for hardware costs

The device cost of a large-scale key search machine is not the only factor affecting a machine’s cost. All technologies require circuit boards, controllers, assembly, testing, power, storage and cooling considerations to be addressed.

ASICs and FPGAs

ASIC and FPGA implementations can use the estimates provided by Wiener [15]. A circuit board that can support 120 small package ICs is reported to cost $300. The devices used by Wiener are 18mm square. FG676 packages (as used on the XC2VP20) are 27mm square, fitting approximately 35 devices per board. PQ208 packages (as used on the XCS200E) are approximately the same size (28mm.) The microcontrollers used by Wiener are not needed since controllers can be integrated into the FPGAs. Assuming that only one FG676 or PQ208 can fit into the space occupied by four of Wiener’s ASICs allows 35 devices per board.

From this we can see the value of high-density devices. One XC2VP20 has eight times the performance of an XCS200E. A machine capable of 100Gkeys/sec would be take just over four days on average to find a key. Taking into account circuit board costs, a machine using XCS200E devices would span 671 devices, 20 boards and cost $22,641. A similar machine using XC2VP20 devices would span 84 devices, three boards and cost $26,016. Factoring in controllers, power supplies and mechanical concerns according to Wiener’s figures gives a total of $33,741 for the XCS200E machine and $33,816 for the XC2VP20 machine. Power consumption, heat generation and storage space for the XC2VP20 machine would be significantly lower at only a very small increase in total price.

Using the preliminary pricing for XC3S1000 devices gives 89 devices, three boards and $13,663. Most of the cost in this machine is devoted to auxiliary hardware, meaning that higher speed machines would cost less in relation to the key rate achieved.

Cost of key search machines and their expected search time

Figure “Cost of key search machines” infers the cost to find a DES key in a given amount of time. Wiener’s estimates, the EFF DES Cracker and the estimates proposed by this work for XC2VP20 devices are shown. We can see that to achieve very short search times a lot of money must be spent, and vice-versa.

CPUs

Commodity computer hardware pricing is needed to infer the remainder of the hardware costs. Assuming that computers can boot from a network, each CPU will need a heatsink, motherboard, case, power supply, network adapter and a small amount of RAM. Using an all-in-one motherboard and low quality case reduces costs significantly. Each CPU will need approximately AUD$175 in support hardware.

Key lengths

It has been shown time and time again that using a long key is the best way to protect a cipher from an exhaustive key search attack. Assuming a cipher with a 90 bit key length (as recommended in [22]) that can be attacked at the same speed as DES, 132 billion XC2S200E devices would be needed to cover the key space in a year, at a price of $3.3 trillion. Even using the best available Spartan 3 device (XC3S400) will cost $850 billion and need over 35 billion devices. Of course, attacks become even more infeasible if the key length is increased only slightly.

Capabilities

Well-resourced entities can feasibly attack ciphers that use long key lengths. Assuming the performance of the XC2VP20 machine is maintained regardless of key length, we can see the cost to obtain a key within a year in the table below. If an attacker is prepared to wait for a year, 56 bit ciphers like DES are trivial to defeat using current technology. Each additional bit added to the key doubles the cost to break it within one year.

5660646872768092

Key length	Cost	Potential attacker
$305	Bored teenager
$4,900	Employed adult
$78,000	Business department
$1.25 million	Large business
$20 million	Small government
$320 million	Large government entity
$5.1 billion	Significant inter-government collaboration
$21 trillion	Infeasible?

Minimum key lengths

The minimum key length required for a system depends on who the potential attackers will be. Assuming that we want messages encrypted today to be completely undecipherable to all known attackers, a 92 bit minimum key length seems to be appropriate.

Future data security must also be considered. Moore’s Law is currently the de facto method of predicting future computing capabilities. Over an 18 month period, Moore’s Law states that transistors per IC (or computing power) will double. If we apply this to a 20 year period, messages must be encrypted with 14 bits of additional key length to remain secure against all known potential attackers. 106 bits appears to be a suitable minimum key length to keep data secure for the next 20 years.

Data that needs to be kept secure over a longer period of time (such as census data) will need an even longer key. To keep data secure for the next hundred years, a 159 bit key seems appropriate.

All of these estimates ignore future computing technologies that may become available or new cryptanalytic techniques which may decrease the strength of a given cipher. Predicting suitable key lengths can be likened to telling the future. Given the low additional processing cost, using a key length of 192 or 256 bits should protect data from all known potential attackers in the foreseeable future.

Alternatives

When attacking a well-designed system that uses cryptography, it is rarely profitable to attack the ciphers themselves. Normally there are weaker areas of the system, such as software bugs, inadequate security policies, weak passwords and the people involved in the system. It will almost certainly be easier to exploit one of these areas (particularly people) to complete an attack against a system.

Regulatory issues

One of the strongest reasons that this research is valuable is in the context of legal restrictions on the use and export of strong cryptography. With it we can evaluate different ciphers in the face of key length restrictions, as well as the availability of the technology required to perform an exhaustive key search attack.

Cryptography controls

Unless otherwise referenced, information in this section was assimilated from [56], [57] and [58]. They should be consulted for more details.

Local regulations on cryptography have changed significantly in recent years. Australia is a party to the Wassenaar Agreement, which restricts the export of “dual-use goods”, including encryption. It is vague in parts (particularly as to what constitutes an “export”), but sufficiently restrictive to raise concerns. Australia’s regulations are more restrictive in that any export requires approval from the Defence Signals Directorate (DSD), which deals with Australia’s signals intelligence and information security [59]. The Defence and Strategic Goods List [60] describes goods which may be subject to export controls, including encryption. Many different products are covered by the legislation, including nuclear, biological, optical, semiconductor and other technology goods. It is frequently updated.

Obtaining export approval usually involves submitting an export application [61] with the DSD. To date no applications have been rejected, although some companies have been informed that their applications will be rejected without having applied. An early assessment of cryptographic goods can be performed to determine if export approval will be required for that good [62].

There are no restrictions on the of cryptography within Australia or the importation of cryptography into Australia.

The United States has comparatively tight controls on cryptographic exports. Currently, symmetric cryptography using up to 56 bit keys is able to be exported once it has undergone a one-time review. Export of any cryptography is not permitted to seven “terrorist countries (also known as “Tier 4” [63].) As can be seen in the results from this thesis, 56 bit symmetric cryptography is not very resistant to brute force attacks. Someone operating under these constraints would be advised to select a cipher that is relatively slow and expensive to attack, such as RC5.

The legal export status of this thesis and its accompanying CD can be questioned. The thesis itself is probably safe to export without approval, since it does not contain any cryptographic algorithms. The CD can almost certainly not be exported without approval, since it contains cryptographic source code.

Computing controls

Regulatory issues exist for exports of high performance computers to certain countries. This was most visible when exports of the Playstation II gaming console to China were denied [64] for fears that they may enhance China’s military capability. The regulations have recently been updated to increase the allowed performance of exported devices [65]. Computer exports are controlled for “Tier 3” countries, which generally includes any countries that are not allied with the United States. Exemptions can be obtained to bypass these controls.

It is not clear whether FPGA devices are subject to export controls, but it would almost certainly be easy to force designs to fall under various performance classifications.

References

[36] Xilinx, Inc. SRL16 16-bit shift register look-up-table (LUT). [Online]. Available: http://toolbox.xilinx.com/docsan/xilinx5/data/docs/lib/lib0393_377.html

[39] P. Alfke and B. New, “Multiplexers and barrel shifters in XC3000/XC3100,” Xilinx, Inc., Tech. Rep. [Online]. Available: http://direct.xilinx.com/bvdocs/appnotes/xapp026.pdf

[44] A. Biryukov, A. Shamir, and D. Wagner, “Real time cryptanalysis of A5/1 on a PC,” Lecture Notes in Computer Science, vol. 1978, pp. 1+, 2001.

[45] (2003, October) Avnet electronics marketing. [Online]. Available: http://em.avnet.com/

[46] E. Peltzer, October 2003, private communication.

[47] S. Satria, October 2003, private communication.

[48] C. D. Ulmer, “Configurable Computing: Practical Use of Field Programmable Gate Arrays,” Ph.D. dissertation, School of Electrical and Computer Engineering, Georgia Institute of Technology, January 1999. [Online]. Available: http://users.ece.gatech.edu/~grimace/research/reports/qual_report.pdf

[49] Gate count capacity metrics for FPGAs, Feb 1997. [Online]. Available: http: //www.xilinx.com/bvdocs/appnotes/xapp059.pdf

[50] NEC Electronics America, Inc. FPGA to ASIC Conversion. [Online]. Available: http://www.necelam.com/asic/conversion.cfm

[51] NEC Electronics. NEC: Gate array information. [Online]. Available: http://www.necgatearray.com/content.nsf/webpages/gatearrayinfo

[52] The MOSIS Service. MOSIS Integrated Circuit Fabrication Service. [Online]. Available: http://www.mosis.org/

[53] The MOSIS Service. Domestic Price List for MOSIS IC Prototyping Service. [Online]. Available: http://www.mosis.org/Orders/Prices/price-list-domestic.html#tsmc25_logi%c

[54] The MOSIS Service. MOSIS Domestic Price List for ASAT Plastic Packages. [Online]. Available: http://www.mosis.org/products/assembly/plastic/price_domestic_asat.html

[55] distributed.net. RC5-72 Live Stats. [Online]. Available: http://www1.distributed.net/~pstadt/rc5-72/

[56] G. Pure and G. Taylor. The Australian cryptography FAQ. [Online]. Available: http://www.efa.org.au/Issues/Crypto/cryptfaq.html

[57] B.-J. Koops. Crypto law survey. [Online]. Available: http://rechten.uvt.nl/koops/cryptolaw/cls2.htm

[58] Electronic Frontiers Australia Inc. Crypto politics. [Online]. Available: http: //www.efa.org.au/Issues/Crypto/crypto2.html

[59] Defence Signals Directorate. Defence Signals Directorate. [Online]. Available: http://www.dsd.gov.au/

[60] Department of Defence. Defence and strategic goods list. [Online]. Available: http://www.defence.gov.au/dmo/id/export/DSGL_2003.pdf

[61] Department of Defence. Export application. [Online]. Available: http://www.defence.gov.au/dmo/id/export/dsec/AC717_Oct_03.pdf

[62] Department of Defense. Application for one time review. [Online]. Available: http://www.defence.gov.au/dmo/id/export/dsec/Onetime.pdf

[63] U. S. Bureau of Industry and Security. HPC - CTP Chart. [Online]. Available: http://www.bxa.doc.gov/HPCs/ctpchart.htm

[64] D. E. Sanger. Letting the Chips Fall Where They May. [Online]. Available: http://www.nytimes.com/library/review/061399china-chips-review.html

[65] Deputy Press Secretary. President changes export controls on computers. [Online]. Available: http://www.whitehouse.gov/news/releases/2002/01/20020102-3.html

Conclusion

ian@mutexlabs.com (Ian Howson) — Mon, 20 Oct 2003 00:00:00 +0000

From the analyses presented in Chapter 4, we can see that:

Key length, available resources and cipher design are the main three factors that influence the time taken to conduct an exhaustive key search attack.
Different implementation technologies favour different ciphers. DES key searches are best performed with FPGAs, while RC5 key searches are best performed with CPUs.
The resource usage of a pipelined FPGA cipher implementation is dependent on the frequency of register access, state size, number of rounds and complexity of the round function.
Frequency of register access is one of the biggest factors affecting resource usage for pipelined FPGA cipher implementations.
If sufficient FPGA resources are available, pipelined cipher implementations will perform far better than iterative cipher implementations.
Based on preliminary pricing, Spartan 3 FPGAs (particularly the XC3S400) have the best price/performance ratio.
Based on stable pricing, Spartan IIE FPGAs (particularly the XC2S200E) have the best price/performance ratio, followed closely by the XC2VP20 (which has a much higher density).
Duron and low-end Athlon XP CPUs provide the best price/performance ratio.
The performance estimates in [2] are most likely to be optimistic. This view is shared by Golberg and Wagner [16].
The cost of conducting a ciphertext-only attack with FPGAs depends on the cipher. The additional resource cost is quite small for DES, but significant for RC5. Ciphertext-only attacks favour large, fast search units over small slow search units.
A machine similar to the EFF DES cracker [10] could be built from FPGAs for approximately $34,000, a fraction of the price of the original machine.
CPUs are more cost-effective for RC5 key searches than FPGAs, although it remains to be seen whether this remains true for a pipelined RC5 implementation.
Cryptography that is restricted to a 56 bit key length by export controls provides little protection against a well-funded or patient adversary.
FPGAs will play a greater part in cryptanalysis in the future.

From these conclusions, we can see that in the right situations FPGAs are very useful cryptanalytic tools. Their low price and high performance allows key search attacks to be conducted at very low cost. If physical space devices is a concern, they can achieve much higher search rates per device than CPUs, even for ciphers that are designed for CPUs.

The EFF DES cracker can be reproduced now using FPGAs at a cost of about $34,000. At a price this low, DES should not be used for anything remotely secure. Government concessions to allow the export of 56 bit cryptography completely destroy the purpose of using cryptography.

FPGAs will play an increasing role in future cryptanalysis as the gap between CPU and FPGA performance for a given price widens.

Future work

Possible extensions to this work include:

Update the price/performance analyses as time progresses. This would allow the security of ciphers to be continually tracked and give a general idea of the rate of improvement in FPGA and CPU technology.
Analyse more ciphers and determine their resistance to exhaustive key search using various technologies.
Examine DSPs and CPLDs as possible low-priced technology alternatives.
Improve the DES benchmark software. The programs used for benchmarks were not designed with modern CPUs in mind, and may be able to achieve very high performance by taking advantage of available features. In particular, SIMD architectures such as Altivec and SSE2 may prove useful.
Implement RC5 as a long pipeline. Estimates show that this may result in very high search rates. High capacity FPGA devices would be needed to attempt this.
Examine different FPGA families. No Altera devices were considered for this thesis. Actel produces a gate array family called the Axcelerator [66] which is one-time programmable and is reported to have very low routing overheads and a low price. Spartan 3 devices should also be re-examined once better pricing data becomes available.
Improve the accuracy of the price/performance estimates for ASIC devices. Different fabrication processes may provide better price/performance ratios.
Extend the FPGA resource estimation techniques to include timing data. With careful analysis, it should be possible to approximate overall performance given a cipher algorithm.
Consider heat generation and power usage for FPGAs. One of the problems encountered with the EFF machine was the high power and cooling requirement for the machine. FPGA devices are reported to be inefficient in this regard, which may prove a stumbling point for large-scale key search machines.

References

[10] Electronic Frontier Foundation, Cracking DES. O’Reilly, 1998.

[16] I. Goldberg and D. Wagner, “Architectural considerations for cryptanalytic hardware,” CS252 Report, 1996. [Online]. Available: http://www.cs.berkeley.edu/~iang/isaac/hardware/paper.ps

[66] Actel Corporation. Actel: Products & Services: Antifuse Devices: Axcelerator. [Online]. Available: http://www.actel.com/products/axcelerator/index.html

\(s\)	State size in bits
\(n\)	Number of rounds needed to complete a phase
\(r\)	Number of LUTs required to perform a round
\(m\)	Number of bits of state modified during a round
\(c\)	Number of LCs required to perform a phase