At FarmHedge we recently got to the point where the data we store has become an important asset for us and our customers. Our users ask us reports on their transactions on our system or statistics about their customers that are only stored on our databases. In the past we struggled to provide this type of information because the data was not easily accessible.
It was not possible to access key information stored in our database without the intervention of a developer.
Our data is stored in a NoSql DB (MongoDB) and it in the past was accessible only via MongoDB queries or by exporting the data to Excel/Google Sheet for further analysis. These queries or data extractions would most likely be made during working hours and they generated additional load to the production database. We were abusing our production database and affecting the user experience of our customers.
We realised we needed a way to access our data more efficiently, via queries in a standard format (SQL) without interfering with the everyday usage of our products. We needed to set up a data warehouse.
Choosing a data warehouse platform
There is plenty of material online comparing performance, pricing and integration capabilities of data warehouse frameworks. The ones we considered initially at FarmHedge were AWS Redshift, Google BigQuery and Snowflakes. They all have their pros and cons and I spent quite some time reading article and watching introduction vides. For a startup, pricing and time-saving capabilities are the most important decision making factors. In the end, we went for Google BigQuery for a few reasons:
- Pricing is proportional to the usage, such as the number of queries;
- It provides limited-but-sufficient access to Google Data Studio (a simplified version of Tableau);
- It inherits the sharing capabilities of standard Google Docs family of products.
Keeping the data warehouse up to date
The next challenge we had to solve was how to keep the data in our database synchronised with Google BigQuery. We considered three different possible solutions.
Solution 1. Manual updates
This solution consists in manually export the data from MongoDB using mongoexport and then upload it to BigQuery in a CSV format. This procedure is described in detail in this StackOverflow answer. This is a very simple solution, but it is not automated and prone to error in any of the manual steps executed.
Solution 2. Automated with paid ETL services
Examples of paid ETL services supporting MongoDB to BigQuery replication are Stitch, Hevo or Fivetran. None of them offers MongoDB access for free. Prices vary, but on average they start from €100 per month for the most simple type of integration. These services promise to give you a solution set up in minutes, but they also require full access to your DB.
Solution 3. Automated with ad-hoc solution
Reluctant to give DB access to 3rd party tools and considering the setup time of these tools would have cost time too, I decided to explore also an ad-hoc automated solution. It is possible to stream data to BigQuery by using its streaming APIs. Note that this is a valid approach as long as the insert rate respects the (pretty generous) BigQuery streaming quota
Our solution consisted of scheduled job that reads the latest changes to specific documents in our DB and then streams the data to BigQuery. This procedure is implemented in JavaScript (our API is based on NodeJS) and it is similar to this one.
At the moment of writing, we are using the solution #3 as the implementation was not too hard, but when the complexity of the data model will increase we will probably switch to one of the paid ETL services listed in the solution #2.
Bernardino Frola, CTO @ FarmHedge