Sync function was down, including create account, login, and actual sync. Fix was overdue by several months and was too late to be done.

Background: Alkitab app has backend service in GCP. The project name is alkitab-host-hrd, web address https://alkitab-host-hrd.appspot.com, aliased to https://www.alkitab.app and https://www.bibleforandroid.com/.

The server runs on Google App Engine since 2012. In 2012, App Engine has only one kind, which is now called Standard Environment. It is very simple to deploy applications. Just upload Python 2.7 files and all library dependencies, and one or more server instances will be spawned, auto-scaled according to load. It has out-of-the-box services (“App Engine APIs”) such as Datastore, Task Queues, Mail, Cron, Text Search Engine that can be accessed just by calling library functions, without any configuration.

In 2014 I made the Sync function in the backend. It allows users to have multiple devices and those devices will have the same bookmarks, highlights, notes, reading plan progress, pins, and history. It also allows users to change their devices without losing data. However, the Sync code requires a lot of computational power and memory, such that the unchangeable limits of Standard App Engine (128MB ram, 30 sec http timeout) were soon reached.

So it was inevitable that we need to move to Flexible App Engine. In Flexible, it is like renting a machine on Google’s datacenter by number of CPU power and RAM needed. The lowest option they provide is almost 4 GB ram, much more powerful that the Standard. This seemed like a great choice, albeit the high price. However, there are some drawbacks:

  1. Deployment and configuration is much more complicated. I need to manage threads myself, updating code needs 10 to 15 minutes, and the logs are very hard to read, since they do not show any severity level and do not separate logs by request any more. When there is an error in the server, I need to wait at least 15 minutes to do the patching and to wait until it is up, repeating it as many times as needed.
  2. The Flexible App Engine previously still provided the out-of-the-box services mentioned above by declaring the “runtime” to be “python-compat” and enabling the App Engine APIs. This ability was declared by Google to be removed around 2018, but they keep extending until the final notice:

“We’re writing to let you know that App Engine Flexible Compat mode (beta feature) will be shut down on January 31, 2020.”

They extended it once more, to Feb 28. Then March came, and around March 6 suddenly the App Engine Flexible instance I had was turned off completely.

Panic happened. Tens of users sent emails saying that they can’t sync or they can’t create a new account. I need to rectify this problem immediately.

So actions taken:

  1. I tried to redeploy the app without enabling App Engine APIs, and see what happens. The App Engine deployment server correctly said that I can’t use the python-compat any more, and enabling App Engine APIs are not allowed any more.
  2. The web server needs to be run as an independent program, not managed by App Engine runtime any more. So I figured out how it works. It needs gunicorn as the web server (I’m not sure I get this right), needs an entrypoint that knows how to route paths to their corresponding handler, etc.
  3. I figured out what needed to be reimplemented. Luckily, Google did not deprecate App Engine APIs for Standard App Engine, only for Flexible App Engine. (I didn’t mention it earlier that not all endpoints runs on Flexible. These functions are running in Standard: Sharing verses, Bible translation list and download, Song book and songs, Announcements, Feedback).
  4. There are 3 services that is used by Sync: Datastore, Task Queues, and Mail. 
    • The Mail is for sending password recovery email. I just reroute the request to the Standard instance to send the email.
    • Datastore is for storing and retrieving user data. Google had NDB for the Standard App Engine, they (super super lucky for me) reimplemented it for Flexible App Engine and for other servers. (It does not use memcache any more). I need to vendor “webapp2” because webapp2 2.5.2 (the latest stable version) still uses the old NDB and patch it myself. Phew. All the @ndb.transactional decorator did not work, I had to replace it with ndb.transaction(function) instead. I could not find any solution for that.
    • Task Queues, which was previously super magically easy to use due to their “deferred” wrapper, now do not have such thing and scheduling tasks has to be implemented manually. This is used to send push notifications to the other devices when user data is updated. It is also used to unpack cached data (I compressed some of the user data). Google has a service Cloud Tasks that I used to implement this. However every time I add a tasks to execute a request to the App Engine instance, it always failed to execute, with 500 error “Instance Unavailable”. I changed it such that it executes a HTTP request instead, albeit pointing it to the same App Engine service.

So after 3 days of very not-interesting work and opaque, the Sync function is now up again. I am sorry that you had to experience the downtime. I expect some bugs to happen though. But at least I was thankful that this was done and I think God gave me the strength to do this by not easily give up and by having some people cheering me up like Tika and Fernando. Thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *