Thursday, September 21, 2017

The Python application TwitterCroatia

The Python application TwitterCroatia


During the winter semester I have been developing a Python application.

Since the Twitter API only allows 150 requests per hour, and Twitter does not generally allow users access to tweets older than a week or sometimes even less, the purpose of the code was to create a database of  tweets for later analysis. And so, using the Python wrapper for the Twitter API, I created a set of modules, the most important of which I will now elaborate:
  • The first and easiest to explain is the twitterDB module. It stores and retrieves data from files using pickle. Its main functions are addTweet, addUser, getTweets and getUsers. It can also fetch single or multiple tweets from a specified user or check if a file contains a user id, which are functions mostly used to avoid duplicates being needlessly stored in a file.
  • The module twitterCommunication is the module which communicates with the Twitter API using the python-twitter wrapper. Its functions mostly mimic those enabled by the wrapper itself, only with implemented error handling for most common errors (such as network malfunctions and exceeded rate limits) but the biggest advantage is the automated cursor handling. When a large enough amount of data is requested, the Twitter API breaks it up in pieces and sends only the first piece along with the cursor which is a reference to the next piece. The functions in this module expect this and call the API recursively until all pieces are delivered. It should be noted that this means that one call to a function may mean more than one request to the API.
  • The third module getAllFollowersAsObjects uses the aforementioned two modules to create a more complex functionality - download all user objects who follow a specified user using twitterCommunication and store them in a specified file using twitterDB.
  • The last module uses all the other modules, and was the end purpose of my code. It is currently called twitterCroatia although the name is not abstract enough and might be misleading, and so might change. It is the only module which has an infinitely long while loop, its function being the continuous collection of data. It starts by requesting a single user object, and then requesting its timeline and its followers, and proceeds to do the same for each of the followers, if there are any, ad infinitum. It is conceivable for the loop to be in fact finite, for example in a closed circle of friends who follow each other and nobody else, but this is unlikely to happen.
TwitterCroatia can be modified to start with any user, but the default is VladaRH, the official Twitter channel of the Croatian Government. This by no means guarantees that users whose data and tweets are fetched this way will really be Croatian, and in fact some most surely will not be. For this filtering to be accomplished, additional resources are needed, for example by checking the Location data. Since most users do not enable publicly shared location data, natural language processing might be required.

Also, with the csvToData module I have enabled integration with the CSV format (which Twitter recently offered its users for downloading their entire tweet history, and as of 2013, available in Croatia as well)
The next steps are:
  • moving from pickle and data serialization to an SQLite database.
  • utilizing tools that might prove useful in future development such as the Python Graph Tool and Natural Language Toolkit.


download file now