Twitter and Gnip: taking control of the tweets?

So, Twitter has bought Gnip.

This is the kind of story that won't mean much to many, because Gnip aren't a business with a lot of customers. What Gnip does is collect Twitter's data (ie. the 'firehose' of tweets) and sell it on.

Twitter do make some data available themselves. Obviously, as a user, you can do a simple search through the website to find tweets mentioning particular keywords, or historical tweets from a particular account. But there is only so much data you can pull that way, and its a bit of work if you want to count how many people have tweeted about something over a period of time etc.

Alternatively, if you're comfortable with a bit of coding, you can start digging around the GET API – more powerful, easier to automate – but still limited (to something like 1,500 tweets per search, and a limited number of searches per hour.) Which is useful if you want to do something like analyse and compare a handful of Twitter accounts, for example – but very limited in terms of how many you can look at.

Then there is is the streaming API, which lets you do live tracking of tweets, and gives you a bit more volume – capturing a sample of up to something like 50 tweets per second that mention particular keywords (for example), and a count of how many tweets in total are using those words. I had some fun using this to track the volumes of tweets around major events a couple of years ago – if you want to track the volume of tweets, minute-by-minute (or even second-by-second), then this is useful. But where it falls down is if you want to do any real analysis on what is being said, and who its being said by. (In other words, using this I could tell you that there were 327,452 tweets in a single minute when Obama was re-elected, but I could only accurately tell you what less than 0.1% of them actually said, or about who tweeted them.)

So, for that kind of thing, you need access to more data – the 'firehose' of every single tweet. And not many companies have direct access to this.

How 5 became 1.

At the end of 2013, there were 5 companies with direct access to the full firehose of every single tweet; Gnip, Datasift, PeopleBrowser, Topsy and NTT Data.

PeoplBrowser were the first to go. I'm not quite sure why, but they ended up in a protracted court case with Twitter, and lost direct access at the end of last year. (According to the link, they were paying Twitter $1m a year for access – no clue what the cost for the same data through a 3rd party would be.) So now they have to buy their Twitter data from a different 3rd party. Peoplbrowser were a little unusual, in that they weren't selling the data, so much as processing it to build a product – they spotted the value early on, and then built their product (Kred) around it – it looks like they had a special relationship because they started collecting data early on, but Twitter wanted to narrow the list of people they were providing the full firehose of data to. (The catch with having access to the the full firehose is that you don't necessarily get to provide the full firehose – ie. I can access any tweet through a 'firehose' partner, but that isn't quite the same as accessing every tweet.)

Topsy also have direct access to Twitter – or at least they used to. They were acquired by Apple at the end of last year for a rumoured ~$200 million, which led to the expectation that they would shut down while Apple did whatever it was that they were planning to do with them. I don't know what has been happening since, Apple (obviously) haven't said anything (maybe something to do with Siri?), but Topsy themselves haven't been tweeting for a while– it seems very unlikely that Apple's plan is to be running a B2B service selling Twitter data to marketers. The point here is that Apple seems to have bought Topsy's data processing capabilities, rather than data that they were processing – the ability to deal with huge amounts of data (both collecting and analysing) is clearly of significant value, even outside of the Twitter ecosystem. Maybe they still have access to the data, maybe they don't – but its unlikely that they will be providing it to anyone else (at least for long.)

Apparently there is a Japanese company called NTT Data who also has access, but I don't really know anything about them, other than that they specialise in the Japanese language/market. (I presume that analysing Japanese tweets is a very different challenge to analysing other languages.)

Which leaves Datasift and Gnip. Both of their businesses are around processing and selling data from Twitter (along with some other sources.) But Gnip is now a part of Twitter itself – raising questions about Datasift's future, if they are going to effectively be in competition with Twitter themselves. To which they responded;

Our core technology means that the majority of our revenue comes from Data Processing and not licensing.
The platform brings in hundreds of data sources and deals with them in a single manner has meant we have dominated the market.

So – at the end of 2013, outside of Twitter there were 5 companies with access to all of the Twitter data. Four months later, there are two — one serving the western market, and one serving the Japanese market.

What are Twitter doing?

The fact that these changes have happened at the same time would certainly seem to imply a broader plan by Twitter. So, the first question this raises for me is what Twitter plan to do next.

Is it just a case of making sure that they have some control over the resellers market and the data they have access to? One theory is that Apple's purchase of Topsy worried Twitter - assuming that Apple's interest was in the data processing technology/techniques/experts rather than the data itself, what if other companies took an interest in Gnip and Datasift? Could the whole Twitter monitoring industry be closed down, if one or two data processing firms wanted to acquire their technology? What would the implications be for Twitter if that were to happen?

Or alternatively, what would happen if someone who was interested in the data, rather than the processing. For example, what if rather than Apple, it was Facebook who took an interest? Or if Facebook had decided to buy a data company like Gnip and/or Datasift, instead of Oculus/WhatsApp etc.?

Historically, Facebook have been keen to keep what gets talked about on Facebook out of sight of the rest of the world (due to privacy/PR concerns), but a partnership with SecondSync (a company tracking tweets around TV programmes in the UK, who I have mentioned before) led to a study and report about TV-related conversations that are happening on Facebook. It showed that there was huge volume (as I think we had all expected — but been unable to quantify), that it was driving brand interactions (ie. Likes and comments), that the majority of activity was during the TV programme itself (which I think was less expected), and that it was predominantly (80%) coming from mobile devices.

All interesting stuff - and potentially challenging the idea that Twitter rules the "second screen" experience.

Shortly afterwards, Twitter bought SecondSync. You won't find that Facebook study on their website or blog any more (or at least, I couldn't.) And that had followed their acquisition of BlueFin Labs, a company doing similar 1 analysis of TV-related tweets, at the beginning of last year, and Trendrr in the middle of the year – not long after they pointed out that 5x as many conversations around TV happen on Facebook than Twitter.

Conspiracy theories aside, its pretty clear that Twitter are very interested in the analysis side of all the data that they are holding. So why haven't they been doing it themselves already?

The early story – at a time when the 'fail whale' was still a regular sight during busy times – was that it was better all around for companies wanting to access huge volumes of Twitter data to get it through a 3rd party, rather than Twitter having to deal with an additional layer of high volume network traffic. Farming out that job to 3rd parties made sense, while Twitter was struggling to deal with providing a reliable service to its users (and develop a sustainable business model.)

Now, 'social listening' is a big business. I don't know how much money is being made by the whole industry, but there are a lot of businesses who I'm pretty sure would be struggling without it (or at least suddenly find themselves with unprofitable products). For example;

  • Radian 6
  • Sysomos
  • Netbase
  • Crimson Hexagon
  • Simply Measured
  • Nielsen
  • Lithium
  • Mass Relevance
  • Brandwatch
  • Synthesio
  • Alterian SM2

…and probably plenty more. In other words, these businesses are building products and profits on top of the data being provided by Twitter and sold on by Gnip and Datasift.

Why wouldn't Twitter want a part of that?

Last year, data licencing "and other" accounted for 11% of Twitter's revenue — $70 million. The rest was from advertising. I don't know how much is being spent with the list of companies and products above, but I would bet that there is more than $70m flying around — especially when you start to reduce the costs of the middlemen/duplicated data processing & storage/sales teams etc.

It seems hard to imagine that Twitter will want to keep running Gnip as it currently operates – selling data from other social networks alongside Twitter data. (Perhaps even harder to imagine the likes of Facebook selling that data to Twitter.) So in that sense, its probably good news for DataSift.

On the other hand, the fact that everyone was buying the data from someone other than Twitter made it easy to explain away the data that you couldn't access. For example, Twitter knows how many people are seeing tweets — but we are still basing 'reach' figures on 'how many people followed other people' (ie. maximum potential reach.) With the t.co URL wrapper launched in 2010, they know exactly how many links are being clicked — but that isn't being reported. (Leading to the frustrating experience of link-shorteners-around-link-shorteners, as people try to count how many people clicked on their shared links to someone else's shared links…) And whats the time dimension — if 10,000 people actually saw a particular tweet, how many of them saw it within a minute? An hour? A day?

Or, how about seeing what people who follow particular accounts are tweeting about? Or, out of everyone who tweeted about a particular topic, who are they most likely to be following?

When you're buying data from a 3rd party reseller, you can't negotate around data they don't have. But Twitter have all of that data. So, I would guess that there are plans to make more of it available.

Is this Tweetie/Tweetdeck all over again?

In its early days, there was a small industry of developers building and selling Twitter clients for smartphones. In 2010, Twitter bought one of them — Tweetie, and turned it into the "official" Twitter client. In 2011, Twitter warned 3rd party client developers that they shouldn't "build client apps that mimic or reproduce the mainstream Twitter consumer client experience". Soon afterwards, they bought Tweetdeck, and the following year started tightening the rules to limit 3rd party clients' growth.

What had become a variety of experiments and innovations that led to better ways of using Twitter was effectively squashed, resulting in a "like it or lump it" approach — which in turn paved the way for Twitter's mobile advertising products (which would probably have been impossible if 3rd party clients were able to add their own adverts, or sell ad-blocking features.) There are still a few Tweetbots and Twitteriffics and other 3rd party clients around, but they certainly seem to be a lot thinner on the ground than they used to be.

In a way, this feels kind of similar — Twitter taking control of what used to be an independent ecosystem, thinning out the herd etc. But to me, it feels like a move to improve what is being offered, rather than to standardise it for the sake of a different part of the product.


For now, for most people who are involved in the social media/insight/data world, I don't think things are going to change quickly. The dashboard tools that we are using might have to rethink their data supply chains, but I wouldn't expect to see any signficant change. (The real differentiation in social listening tools, in my experience, lies at the user end - the UI, the natural language processing technology, the pricing models etc. — as opposed to what data they have access to. It pretty much all revolves around Twitter for the most part anyway.)

But what I'm hoping is that this is the start of Twitter opening up the streams, and gearing their APIs and data towards people who want to be analysing it — which is mainly the same people that account for the other 89% of Twitter's revenue: the advertisers.

  1. Broadly similar, anyway.