Top “Non-Browser” User Agents

A simple question came to mind the other day: “What are the top HTTP libraries or applications that are used to access web content?”

What I mean by that is: What are the top user agents that are not browsers (IE, Firefox, Chrome, etc..) that web developers use to access content via HTTP. A few come to mind: libwww for Perl, libcurl for PHP, and CFNetwork for iOS.

Is this information useful to anyone? Usually these libraries are used by applications to fetch content to be handed off to another layer for presentation. Developers do no need to worry about which “non-browser” user agent to tailor their output for, like they do for IE vs. Chrome for example.

I decided to take a small sample of traffic to see if there were any interesting patterns to be found.


I took a sampling of approximately one million hits from four different geographical locations: Los Angeles, New York, Amsterdam, and Sydney. For a total of 3,147,379 samples.

All hits were filtered for user-agents that started with “Mozilla” (using the regular expression: ^User-Agent: Mozilla.*) at capture time. I later realized this didn’t filter out requests from the Opera browser, which I did in post processing. 


The user agent field is a free text field. This means that any value can be used in this field by the application or web developer. I’m assuming that most libraries do not try to pretend to be a user agent it is not by spoofing the header.

Data was captured at the same time across all geographic locations, regardless of local time. There is a direct correlation between time of day or day of week and user agent usage. I’m assuming that is not the case for “non-browser” user agents.

There may be a bias as the data was sampled from Fastly customers. 


There is no surprise that the top to user agents are CFNetwork and Dalvik. These are the agents that are used for iOS and Android applications respectively. Does this count as a “non-browser” user agent? I would argue yes. However, there is no way to determine if the requests were from the app or from an in-app browser.

BTWebClient comes in third place. This user agent is used by bit torrent clients to download torrent files from websites as well as updating trackers using HTTP.

libwww-perl comes in fourth place, followed by Java. 

Facebook External Hit is the next most popular user agent. Whenever a user shares a link on Facebook, the site goes out and fetches a copy of that link to display on the users wall. 


Rank User Agent
1 CFNetwork
2 Dalvik
3 BTWebClient
4 libwww-perl
5 Java
6 facebookexternalhit
7 Apache-HttpClient
8 HttpComponents
9 Parsoid
10 Python-urllib

I’m surprised that cURL didn’t make the top ten.


There was a signifigant location bias. For example: The Sydney data contained a lot of requests from the “FoxTel Guide” user agent. However, that agent was not found in any of the US sample data. This makes sense as users in the US do not care about a service that isn’t available in that country. 

Although this information is interesting, I can’t see any useful need to generate recurring reports or analysis over time. It’s certainly not as interesting as the ongoing browser war

The CDN Manifesto

I wasn’t able to attend this years Velocity conference. So I’m catching up now by watching videos that are available online.

A lot of people ask me: “why did you want to work at Fastly?”. It’s an innocent, but complex question. My answer usually varies based on the audience. The explanation I give non-technical people is most likely: “Because I want to make the internet faster”.

However, Fastly is doing much more than just making websites faster. They are much more than a CDN, they are an extension of your application or website. No longer is the CDN a black box that you just place between your origin and your audience.

A friend asked me the other day: “Give me the 10 second elevator pitch on why Fastly is better than any other CDN”. I thought for a split second and answered with:

“There are three main differentiating factors that sets Fastly apart from the rest of CDN field:

  • Real Time Log Delivery
  • Instant purging / invalidation
  • Full programmatic API interface

Other CDNs don’t have all three capabilities”.

These three items are very powerful to web developers. It’s allows you to fully control your content and gain visibility into what your users are doing.

An excellent talk by my co-worker, Hooman Beheshti, touches on these very points. His entertaining and informative talk at Velocity is a manifesto of what every CDN should be moving forward.

Don’t let the fact that this is a sponsored talk turn you off. It’s not a sales pitch at all.

Getting Lost in 302s

Web properties that have been around for a while probably have a lot of old links, dead ends, and redirects. There is a fear amongst content owners that users are not going to be able to find their site if a URL changes.

“What about everyone’s bookmarks?!” cries the content owner. The bookmark is something from the 90s web (web 1.0 if you will). Nobody uses them anymore.

This was a challenge I was up against at my previous job. It wasn’t until I illustrated the complexities and unscalability of keeping every URL around forever, did change happen.

The mobile landscape changes quickly. This shaped the url structure and was the main cause of the many redirects for the CBC’s mobile website. Over the course of a week and after digging through Apache configurations, Akamai config files, and “meta refresh” html files, the following flow chart was born.

Click for larger

Click for larger

Thankfully there were no redirect loops! However, there were some pretty serious issues. For example: -> -> ->

Yes. you would be redirected three times to the final URL! Not ideal, especially if you are on a mobile device!

A lot of these have since been removed. However, it wasn’t until this diagram was presented to the web developers and management did everyone realize the gravity of the situation. 

It’s true what they say: A picture speaks a thousand words. In this case: A Visio diagram improved web performance!

IPv6 and Web Performance

After reading the first 60 or so pages of Ilya’s excellent book, my mind started racing. How does IPv6 (v6) affect packet size, round trip times, and overall web performance versus IPv4 (v4)?

I decided to set up a simple test from my Windows desktop and my personal webserver hosted by Linode. Both have native IPv6 connectivity. No tunneling.

The setup:

  • Client: Windows 7 & Chrome 35.0.1916.114
  • Server: Centos 6.5 Linux (Kernel: 2.13.7) & Apache 2.2.15

I wanted to keep things as simple as possible to better understand the low level effects of IPv6 on the typical HTTP transaction. As such, I made sure to try to keep the conditions the same for both v6 and v4 requests. My test urls were:

The hostname is the same number of characters and the fetched object is exactly the same. In order to ensure that only v6 and v4 packets were being sent, I disabled support for each one in Windows before doing the test.

You can follow along the packet streams if you like at cloudshark: 


My v6 packets take a different route versus my v4 packets. This results in a lower latency for v4 traffic over the v6 traffic (average over twenty packets is 82.1ms vs. 87.4ms repsectively).

Windows doesn’t have mtr, so I used my Macbook Pro on the same network instead. [Edit: Thanks to @jpaulellis for pointing me to:]

v6 routing:

Blakes-mbp:~ bcrosby$ sudo /usr/local/sbin/mtr -n -c 20 -r
HOST: Blakes-mbp                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 2601:9:8480:1196:76d0:2bf  0.0%    20    1.1   1.2   1.0   1.9   0.2
  2.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0
  3.|-- 2001:558:82:213b::1        0.0%    20    9.7  18.6   8.9 184.2  39.0
  4.|-- 2001:558:80:14a::1         0.0%    20   12.7  12.2  10.2  14.3   1.1
  5.|-- 2001:558:80:cf::2          0.0%    20   15.7  12.9  10.4  20.6   2.5
  6.|-- 2001:558:0:f6cb::1         0.0%    20   14.4  14.4  11.8  23.3   2.3
  7.|-- 2001:558:0:f5e8::2         0.0%    20   18.1  17.1  15.4  18.2   0.8
  8.|-- 2001:559::502              0.0%    20   20.3  17.0  14.7  21.8   1.9
  9.|-- 2001:590::4516:8f75        0.0%    20   14.2  15.7  13.9  40.4   5.8
 10.|-- 2001:590::4516:8fa6        0.0%    20   14.7  15.1  14.4  16.3   0.4
 11.|-- 2001:590::4516:8e00        0.0%    20   33.0  45.6  31.5 185.6  40.2
 12.|-- 2001:590::4516:8e3b        0.0%    20   34.6  36.4  31.0  66.4   9.6
 13.|-- 2001:590::4516:8e65        0.0%    20   64.5  66.8  64.3  99.9   7.8
 14.|-- 2001:590::4516:8e4b        0.0%    20   85.5  85.0  82.0 109.0   5.9
 15.|-- 2001:590::451f:22b2        0.0%    20   84.5  87.1  82.7  95.8   4.1
 16.|-- 2001:518:1001:1::2         0.0%    20   93.5  88.0  82.9  94.5   4.5
 17.|-- 2001:518:2800:3::2         0.0%    20   84.2  84.2  82.8  86.5   1.0
 18.|-- 2600:3c03::f03c:91ff:fe6e  5.0%    20   83.0  87.4  82.8 121.1   9.7

V4 routing:

Blakes-mbp:~ bcrosby$ sudo /usr/local/sbin/mtr -n -c 20 -r
HOST: Blakes-mbp                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|--                0.0%    20    0.7   1.3   0.7   2.5   0.6
  2.|--                0.0%    20    9.2   9.3   8.3  10.5   0.8
  3.|--              0.0%    20   10.8  10.0   8.7  14.2   1.5
  4.|--               0.0%    20   12.2  13.5  10.3  38.0   6.0
  5.|--               0.0%    20   11.3  12.7   9.7  20.7   2.6
  6.|--               0.0%    20   14.2  14.1  11.0  25.9   3.1
  7.|--              55.0%    20   38.6  57.6  36.8 203.4  54.8
  8.|--               0.0%    20   39.9  43.9  38.2 102.2  13.8
  9.|--               0.0%    20   64.9  64.2  61.9  69.4   1.8
 10.|--               0.0%    20   90.9  83.2  79.9  90.9   2.4
 11.|--              0.0%    20   80.6  81.8  80.5  88.4   1.9
 12.|--             0.0%    20   84.7  85.9  81.7  93.6   4.2
 13.|--               0.0%    20   82.4  82.8  81.0  91.6   2.3
 14.|--             5.0%    20   83.0  82.1  80.9  84.0   0.9

DNS Requests

The main difference betwen looking up a v4 IP address versus a v6 address is the record name. v4 Addresses use an “A” record, while v6 addresses use a “AAAA” record. v6 addresses are also much larger at 16 bytes (versus 4), so the response will always be larger than a standard v4 response.

In this particular test, both v4 and v6 responses fit into a single packet. So the number of round trips are the same. The client is going to my router to do the DNS resolution using UDP. So there is no TCP handshake overhead.


Version # of Packets Total Size RTT
v4 2 176 bytes 0.092ms
v6 2 228 bytes 0.089ms



The v6 request is ~30% larger than the v4 request. However, round trip time will be the same (under my test conditions)

TCP Handshake

All v6 packets will be larger due to the increase in size of IP headers. v6 headers have a 40 byte size, while v4 headers only have a 20 byte header.

One thing I did notice was that the MSS size was different in the initial SYN packet from the client to the server. the v4 MSS was set to 1260 bytes, while the v6 one was set to 1460 bytes. 

The v4 SYN/ACK response from the server reset the MSS to 1460 and the same v6 SYN/ACK response reset it to 1420.


Version # of Packets Total Size RTT
v4 3 186 bytes 0.081ms
v6 3 246 bytes 0.085ms




There is no difference between v4 and v6 when fetching the object. HTTP header and body sizes are exactly the same. 


IPv6 is a new version of the Internet Protocol. It doesn’t change the way TCP or HTTP behaves. Your packets (although a little larger with v6) are routed the same way. This means that it’s latency and not available bandwidth that affects v6 performance, just like with v4.

Overall you will be pushing more bits over the wire, however the number of round trips made to the server to make a simple HTTP request with v6 is the same as with v4.


Version # of Packets Total Size RTT
v4 12 2236 bytes 0.337ms
v6 15 2848 bytes 0.342ms



What’s this? 3 extra packets with the v6 conversation? Yes. For some reason the server decided to change the MSS to 1420 from 1460 before returning the HTTP response. After the response was sent, another MSS change from 1420 back to 1460 was made. I have no idea why. But this accounts for the 3 extra packets. [Edit: My coworker pointed out that this additional TCP handshake was Chrome being proactive with setting up a new TCP session. The browser does this exepecting to download more data when you click on a link or perform another action]

v6 will eventually be the de facto IP version used on the internet. The good news is that all of the advancements in web performance such as front end optimization and HTTP/2 won’t have to change when v6 becomes ubiquitous.

v6 should result in better web performance overall. Mainly due to the fact that:

  • v6 routers don’t perform packet fragmentation.
  • Routers don’t need to perform checksums on v6 packets (like they do on v4)
  • Routers aren’t required to compute packet time in queues

The above points may be moot now with today’s fast routers and specialized cpus. However, when it comes to web performance, every little bit counts.

Who’s using a CDN?

The HTTP Archive is a great resource for keeping track of trends in the way web sites are built. It’s shown the steady decline of Flash on websites over the years, for example. 

I decided to use the dataset to track which are the most popular CDNs. Below are my findings using the May 15, 2014 run.


HTTP Archive has recorded a total of thirty different CDNs. The top five used are: Cloudflare, Google, Akamai, ChinaNet, and Edgecast. Keep in mind that only 10% of all sites tracked by HTTP Archive are using a CDN at all.

The first question that came to mind was, “Google is a CDN?”. The answer: Yes. These would be sites hosted by Google Sites or Googles own properties (like YouTube). 

Both Cloudflare and Google are free, so it’s no surprise that they are the top two popular CDNs. 

One thing to note with the data. The HTTP Archive only tests where the front page HTML is hosted. So it’s not a definitive way of knowing if a particular site uses a CDN or not. For example: the HTML could be hosted at origin, but all images could be hosted by a CDN.

The HTTP Archive also keeps track of the Alexa rank of each site it tests, so we can use that to determine which CDN powers the most popular pages.

image (3)

Google takes the cake for hosting the top 100 popular sites (A little hard to see in the above graph). Akamai takes a commanding lead on hosting the remaining top 5,500 sites, followed by Cloudflare. 

A breakdown of the most popular site hosted by each CDN:


Alexa Rank




















Level 3


Keep in mind these results are for the hosting of the front page HTML file only. A lot of sites utilize a multi CDN approach where they spread requests over more than one CDN.

So what about sites that decided to not use a CDN? You might be surprised at some of the results:

Alexa Rank












Some of these sites use CDNs to host site assets (like Facebook and Twitter).

What’s the quickest and easiest way to see if a particular site is hosted by a CDN? You can look at how the hostname resolves. Doing a dig on returns:

;; ANSWER SECTION:		39086	IN	CNAME 17565	IN	CNAME	18	IN	A	18	IN	A uses Akamai as shown above.

This is just the tip of the iceberg. I encourage you to take a look at the data yourself. You can access the data for free using Google Big Query.

O’Reilly Velocity Wrapup

Barbara and I had the pleasure of speaking at the O’Reilly Velocity conference in Santa Clara, California last week.

This was one if the best conferences that I’ve attended. It was great to see so many smart people sharing ideas in one place.

Office Hours

Answering questions in the one-on-one “Office Hours” session

One of the best features of this conference is what the organizers called “Office Hours”. This gave you the opportunity to talk to the speakers privately about anything. This gave visitors the opportunity to “pick your brain” over ideas they may have. Barbara and I took advantage of this time to also get to know the attendees better.

Blake Crosby and Barbara Bermes


A copy of the slides are available on Slide Share or view the slides below.

I’m planning on proposing another talk for next years event. However, I think I’ll target the East Coast this time in New York.

FITC Web Performance and Optimization

A colleague and I presented a 50 min talk at my first weekend event yesterday.

I counted approx. 70 people in attendance before we got to talking. The talk was well received and we even spent another 20 min or so chatting to individuals that wanted some more information.

The talk was about how CBC is taking web performance seriously, the tools we use to improve our websites performance from both the front and back ends. Slides are available in pdf format.

Barbara and I will be giving this talk again, this time at the O’Reilly Conference in June.

What Your CDN Won’t Tell You

Julian and I had the honour of having our paper accepted by USENIX. In fact, Julian is at this years LISA conference presenting it (I’m unable to attend due to schedule conflict).

Our years together working at CBC has taught us a lot about running a News website. Specifically around dealing with our CDN (Akamai).

How do you manage that fine line between having fresh content appear on the site quickly (what News wants) and protecting the origin from the load of a breaking news event (what the SysAdmins want)?

This paper answers that question and gives you a glimpse at how we do things at CBC.

You can read the paper here.