A bug that made me a better developer.

Bugs are evil.
No developer likes working on bugs because it's boring and not rewarding.

Not all bugs are equal. While majority of them don't involve special skills, some will require true engineering mindset and creativity.

Here is a little story about such a special bug I recently experienced myself.

I built a minimalistic anonymous photo sharing app on my spare time -- https://www.wisaw.com
The app is dead simple -- take a photo with a mobile device, no registration required, the photo gets uploaded to the cloud automagically, and everyone is able to see it right away. The unique aspect of this app -- it's crowd moderated. Anyone can delete any photo they don't like at any time.

The first MVP was built in 7 days. The stack I used for the MVP:

  • iOS and Android native clients
  • REST API on AWS Beanstack with autoscaling (KOA2 backed)
  • RDS PostgreSQL data storage

Took me another couple of years to optimize performance, and streamline the UX.

  • Moved KOA2 to AWS Lambda
  • Migrated native clients to single codebase react-native solution
  • Migrated HOC Class Components to Functional Hooks, along the way making tons of performance optimizations
  • And eventually started to use Expo managed workflow

My little app was nice, and simple, and beautiful. Keeping things simple always takes some extra effort.

Everything was going great. A little slow at times, which is typical for a pet project built in spare time. In Jan 2021, I started to notice something strange -- my iPhone would randomly crash while I was using my little app! There were no Crash reports in AppStoreConnect. There weren’t any exceptions in the logs -- the screen would simply turn black and show a spinner for 30 seconds. Then it would bring you to a locked screen asking to enter the PIN to unlock the device. Meanwhile, the Android app worked just fine. And on top of that, it only seemed to affect prod devices -- I was never able to reproduce the issue in dev.

Hard to say when exactly it happened -- I started to notice it after upgrading React-Native stack to Expo, and the first thought was that there is a bug in Expo.

I also implemented my own Image Caching solution, because react-native-fast-image does not work with expo managed workflow. My caching solution worked extremely well for me, which I open-sourced https://www.npmjs.com/package/expo-cached-image .

It would take between couple of days and couple of weeks for the issue to start appearing, and the only way to make it go away was to delete the app from the device and install it fresh from the app store. Restarting the app, or rebooting the device would not help.

All this lead me to believe -- there is some state that is accumulating on the file system, which eventually causes the device to crash. And I was indeed accumulating a lot of state in Image Cache, which was persisting to Storage.

I reached out to Expo for advice and a new forum topic was created: My expo app is crashing in prod, how do I troubleshoot it?
The Expo team was super helpful and explained how to get logs from a production device. Unfortunately, these logs were cryptic and not very useful to me -- I'm not an Operating System engineer, I'm an app developer:

Jun720:29:29kernel[0]<Notice>:1805.088 memorystatus:killing_top_processpid604 [securityd] (vm-pageshortage1)6480KB - memorystatus_available_pages:7069
Jun720:29:29kernel[0]<Notice>:1805.094 memorystatus:killing_top_processpid616 [trustd] (vm-pageshortage1)3184KB - memorystatus_available_pages:6715
Jun720:29:29wifid[353]<Notice>:__WiFiServerClientTerminationCallback:Clientcarkitdterminated,cleanupstate
Jun720:29:29kernel[0]<Notice>:1805.096 memorystatus:killing_top_processpid355 [assistantd] (vm-pageshortage1)9696KB - memorystatus_available_pages:5276
Jun720:29:29kernel[0]<Notice>:1805.100 memorystatus:killing_top_processpid391 [biometrickitd] (vm-pageshortage1)2512KB - memorystatus_available_pages:5013
Jun720:29:29kernel[0]<Notice>:1805.102 memorystatus:killing_top_processpid324 [mediaremoted] (vm-pageshortage1)2976KB - memorystatus_available_pages:5042
Jun720:29:29kernel[0]<Notice>:1805.103 memorystatus:killing_top_processpid383 [cloudpaird] (vm-pageshortage1)3760KB - memorystatus_available_pages:5038
Jun720:29:29kernel[0]<Notice>:1805.104 memorystatus:killing_top_processpid483 [suggestd] (vm-pageshortage1)11616KB - memorystatus_available_pages:5079
Jun720:29:29kernel[0]<Notice>:1805.106 memorystatus:killing_top_processpid384 [searchpartyd] (vm-pageshortage1)5952KB - memorystatus_available_pages:5065
Jun720:29:29kernel[0]<Notice>:1805.109 memorystatus:killing_top_processpid331 [nanomediaremotelinkagent] (vm-pageshortage3)2752KB - memorystatus_available_pages:5078

Basically, this log indicated that at the moment of crash, iOS thought some application is using too much memory, and silently killed it.

I went back and forth with the expo team, insisting that it has to be something with the Storage, while they were pushing back that there is a difference between RAM and Storage, and in my case the app is using too much RAM, and that's why iOS kills it.

It turns out we were all correct in our own ways -- the issue was related to both, the RAM and the Storage (continue reading to the end).

But before the mystery was solved, I had to take few extra steps.

  • I thought, perhaps there is some issue with "fetch" method for calling my REST backend -- I switched to Axios. Felt that Axios handles the networking response codes much better, but this made no difference -- the issue creeped in again 2 weeks later.
  • The REST API's were beginning to feel little antiquated -- so I started migrating the backend to GraphQL (something I wanted to learn for a long time and this was a perfect opportunity). While overall application responsiveness drastically improved -- the crash was still happening intermittently.
  • Next step: start using modern image formats -- Webp instead of PNG. This suggestion came from Google Search Console for optimizing WebVitals for my website, so I thought there is no harm doing that on the mobile, and hopefully it will address my crash issue -- still no luck.

I stored cached images on the file system, so I tried to use Expo's

FileSystem.documentDirectory

Instead of

FileSystem.cacheDirectory

Strange thing about FileSystem.cacheDirectory -- you never know how much Storage it utilizes. It's another one of those mysterious iOS things (like RAM) which is handled automagically. I even went on a rant with the Expo team trying to convince them that there is some issue with how FileSystem.cacheDirectory utilizes resources -- you never know how much storage it uses per app. The iOS can clean up the files in this folder as needed, but you never know when that's going to happen, and the amount of storage utilized by the FileSystem.cacheDirectory per different app is never reflected anywhere in the device runtime stats. Of course the Expo guys pushed back again and said -- everything is fine with how FileSystem.cacheDirectory is implemented.

  • I tried to implement different cache cleanup strategies. While doing it I've learned how to batch Promise.all, which was super awesome technique, but never solved my issue.
  • This RAM vs Storage debate made me realize, that Apple does not advertise RAM size they put in their devices either. Every personal computer, going back to 80s, always specified the size of RAM and the Disk Storage separately. These days the only number Mobile Vendors show on their specs is the Storage. When you buy a new 128Gb iPhone, it means it will have 128Gb storage and you will never be able to find in any of the specs how much Memory (RAM) they put in your devise.

  • Continued digging around, I stumbled upon the following video (which talked about a life cycle of an image on iOS, how it renders it and how it allocates memory while rendering):

  • That video was the turning point for me. While my app shows only 1 full size image at a time on a details screen, it will also render lots of thumbnails all at once on the infinite scroll screen. To implement infinite thumbnails scroll, I'm using a solution based on FlatList, which is super efficient and discards resources that go off the screen. There are still easily 20-30 thumbnails that can be rendered on a screen at any given time.

All this made me question if there is something wrong with how I render the thumbnails -- and surely that's exactly where the problem was.

I always thought that thumbnail were invented back in the web era to conserve the network bandwidth. Here was my excuse for being lazy with the mobile app: I thought, if I have a full size version of the image already available locally, I can simply stick it into the thumbnail view. Adding the full size image to the local cache for thumbnail URL would also save an extra trip to the server next time. The only problem with this approach was that on iOS, rendering an image on the screen will take an amount of memory proportional to the size of the underline image file, regardless of the dimensions of the image on the screen. In other words, in order to render an image, iOS has to bring it into the memory (RAM), and it will load the whole image file, regardless of how small the image appears on the screen. And since the memory is a scarce resource -- iOS reserves the right to silently kill the app that uses too much memory.

This is what I thought was happening:
Thumbnails rendering

But this is what was really going on:
Full size images rendering

Finally I was able to consistently reproduce the issue.
Here is the sequence that would cause the crash:

  1. Take a photo with my app
  2. Send a full size version of the photo to the server, where it gets processed and stored. The server side thumbnail is also generated during this step.
  3. Stick the local image into the cache for the full size image URL as well as for the thumbnail URL, so that it would not re-download these from the server the next time they need to be rendered.
  4. Repeat steps 1 through 3 fifteen or more times.
  5. End up with 15 images on the infinite scroll screen rendering full size files in place of thumbnails
  6. See iOS killing your app because it runs out of memory.

So, the expo team was right after all -- it was a memory issue. I was also correct, because the state (the image cache) was accumulating in the Storage.
This issue was particularly difficult to troubleshoot, because it would only impact devices of most active users -- someone who takes a lot of photos frequently enough to make the thumbs with underline full-size images dominate the screen on local device. If you end up mixing these large file thumbs with the other users thumbs which have to be downloaded from the server before they cached -- the use of the memory would go up, but it would not go up high enough for iOS to kill the app.

The solution -- if you do not have an appropriate thumb version available, always resize it to the dimensions of the image on the screen before rendering.

Lessons learned:

# 1 -- Never give up. When this issue first happened, I had no idea where to start. I tried so many different things, which lead to drastically improving the application performance and the UX. If I knew exactly what's causing my issue in the first place -- I may have never put the same amount of effort into my app, as it was already good enough.

# 2 -- If not you, then who? It's tempting at times to push back -- redirect the blame to 3rd party or someone else's code. I'm convinced once again -- if there is a problem, I can always find a solution. Never go it alone, always reach out for help, ask lots of questions, even if you do not understand every answer right away -- eventually the light bulb will go off. But it's always up to you to keep pushing forward. It's simply a matter of time. Sometimes it may take you 6 months or longer, but then apply rule #1 and never give up.

27