“The Rules of PERF” at Dropbox

Source: Dropbox

A more detailed explanation:

  1. Avoid needless processing. This breaks down two ways
    1. Feature design: Think hard before adding features that come with significant performance impacts — do you really need this feature? Is there a simpler way to do it which achieves most of your goals? Can you do it a simple way 90% of the time and only fall back to something more complex if needed? Can you skip several intermediate steps to get to the end result faster? (ex avoiding sorting a list)
    2. Optimize execution by taking advantage of short-circuit evaluation and doing lazy fetching/evaluation. For conditionals, if you sometimes need to do an expensive check, but usually don’t, then see if there’s a way you can skip that check. Laziness: don’t fetch extra things from the filesystem until requested, if you often don’t need it.
      • Practical example: I optimized a routine (in Python) at work last month. We were processing text files a line at a time and removing control characters. To remove control characters we used a regex on each line (not the most efficient approach, fairly expensive). I added a quick check that iterated through the line of text and checked if any of the characters were within the control character range,and just returned the original string if not. Not as efficient as rolling a non-regex implementation, but since control characters are rare it avoids 90% of the performance cost and was much simpler & safer to implement.
  2. Cache results of expensive operations to avoid repeating them unnecessarily. If you’re fetching info from the filesystem, cache it in memory if you are likely to reuse it (works well with lazy evaluation).
  3. Batch it: if you’re doing a single operation often to many items, try gathering up the items to process and processing them in large groups. Often this is more efficient because it makes better use of caches (CPU/disk) and it permits you to write much tighter loops for processing. It permits reusing buffers, connections, SQL prepared statements, etc. It can improve branch prediction, permit use of SIMD instructions, etc where they would not work otherwise.
    • Batching also makes it easier to fall back to something like the multiprocessing library to parallelize work.
  4. Use software pipelining. This is kind of like batching: rewrite loops that run items through a series of steps/processes so you first do the same step to each item, then the next step. Sometimes can be evaluated much more efficiently by compilers/interpreters because it allows using SIMD instructions, avoids branch prediction misses, etc.
    • May also mean using Unix/Linux pipelining as well: use a bunch of smaller utilities that pipe input from one to another. This is another application of the same principle, but has the extra advantage of being generally very efficient, and spreading work across multiple processors.
  5. Use a lower-level language than Python to optimize the most performance-sensitive parts of the code. I.E. fall back to C bindings for intensive number crunching, crypto, etc. Optimized C can be several times faster than Python (or sometimes much more!). In general Pareto’s principle applies: 80% of your execution time comes from 20% of the code (and vice versa), so if you double the performance of the slowest 20% you can almost double your overall performance.

To Summarize

If you have a performance issue, you should try the following fixes in order (ie. try one and if that doesn’t solve it, go to the next possible fix):

  1. Just don’t do whatever you’re trying to do. In other words, ask yourself if it’s really necessary/useful/something you might lose a client over.
  2. Cache the results of previous calls. Maybe you can reuse them as-is, or partially.
  3. Do a large number of calls in batch, maybe in advance or later, outside of peak operating hours. Or perhaps you need to set up a network connection to do what you’re doing, if that takes a while then don’t make a new connection for each request, bundle up a dozen and setup once.
  4. Don’t add your stuff to an existing program, create a new and separate one that will take the output of a previous process. Decoupling, in a way.
  5. If none of that works, only then should you look for a totally different way of doing it.

Further Explanation

There’s multiple explanations for these, which makes them deeper than they seem, but there’s a couple more parts:

  1. Not doing work may include several ways of avoiding extra computation — lazily running expensive operations if not always needed, adding conditional checks before complex work rather than throwing exceptions, using short-circuit evaluation, or using more efficient algorithms / cutting out intermediate steps if you can get a result without them.
  2. Yep.
  3. Yep, but it doesn’t necessarily have to wait hours — batches of work can be handed off to other processes or utilities to process (makes better use of cores), and often you can write tighter loops that make better use of caches and reuse resources (connections, buffers, etc)
  4. That’s Unix pipelining, and it’s good shit, but software pipelining is a more general version of the technique. Depending on your architecture one or the other may be more efficient — goes well with batching above though.
  5. No, this is a reference to falling back on C bindings invoked from Python, and writing the really tricky bits highly optimized in a lower-level language. C can be several times as fast as Python (or more, with good use of SIMD instructions) if written efficiently.
    • They didn’t do this often at Dropbox, because Python is faster to write and easier to maintain, but when they did this they got huge speedups.

Software pipelining

In computer science, software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining. Software pipelining is a type of out-of-order execution, except that the reordering is done by a compiler (or in the case of hand written assembly code, by the programmer) instead of the processor. Some computer architectures have explicit support for software pipelining, notably Intel’s IA-64 architecture.

It is important to distinguish software pipelining, which is a target code technique for overlapping loop iterations, from modulo scheduling, the currently most effective known compiler technique for generating software pipelined loops.

Source:  https://www.reddit.com/r/Python/comments/eip48b/the_rules_of_perf_at_dropbox/