Data is now a great way to growth for many companies.
If collecting data is now something natural, there are issues when it comes to making a relevant treatment to it.
Which workflow to adopt? What tools? How to fit that to an existing architecture? How to aggregate them?
Those questions could seem generalist, but the reality shows that each project has its own specificity and that there is no “one fits all” answer, even if appearances can make you believe the opposite.
At Poool, data analysis is a priority because it’s at the heart of our knowledge and our business.
Knowing how to detect behaviors and propose the best solution to engage and monetize a visitor on a media’s website requires a relevant and scalable workflow.
Finding adequate solutions has been a major concern as well as an important research field. The way we collect and analyze data could not be left to chance. The goal is ultimately to gain in relevance and performance.
The purpose of this article is not to say that one solution is better than another, but rather to share a feedback.
“Why not using Redash? “
Redash works well and is easy to install with Docker.
The app allows you to fetch data from several different sources and make dashboards in three clicks.
Bad side: these are MongoDB requests (in our case) and depending on the volume to be processed, performance can greatly decrease.
We were using it temporarily for basic queries because we have moved from a standard architecture to a shards architecture, and it is no longer possible to use aggregations.
“Well, well, we make requests with Python and Pymongo, right? “
Of course, for ease and simplicity, making a simple Python script with different methods and a KPI per method has — naively — seemed to be the simplest solution.
So certainly, it is actually the simplest, but not the most effective in terms of processing time. On 100k entries, we had correct performance, on 1M, it became more complicated.
At the time of crossing a table of 15M of entries and another one of 5M, it exploded (we arrived at a time of treatment of 30 minutes). From there, everything becomes exponential, even using low level libs (Monary type), the processing time only increased.
We would not have worried about a weekly or monthly report, but in our case, it’s about being reactive about current activities. So we had to find a different solution, and keep that option for back-up.
“ElasticSearch looks cool”
Solr, Elasticsearch, Hadoop … Different solutions but with a common goal: facilitate the processing of large volumes dataset.
Even if at the moment we weren’t doing some ‘Big Data’ (nor medium data, by the way), it was clear that we were going to get there.
So, here came our first approach: collect a copy of the production database and synchronize it locally with ElasticSearch through a connector.
“Mongo-connector, looks great! “
After searching for a solution, we quickly came across a well-designed Mongo-connector tool.
The purpose of this tool is to have an “all in one” bridge between a MongoDB database and ElasticSearch, Solr or another Mongo database. Then, we did our first conclusive test and first ES index were filled with production data.
Mongo-connector was doing the job and the queries were extremely fast. We also discovered Kibana.
“Did you say scalable? “
At this point, we were on a reasonable volume, the index took about 2 hours to build and the idea of running it with a CRON at night seemed pretty cool.
Only after a new stage of our deployment and a change of architecture, the data volume has increased significantly and the mongo-restore asked for more parameters than before.
After a little work on it, we started again the sync with mongo-connector and the behavior became … quite random.
Random in the way: “I index all the data in ElasticSearch, but I delete everything once it’s done.”
Yes, it does not make sense. No similar case, open issues on Github unanswered … If you’ve ever faced this kind of situation, you know how bad it feels.
Finally, we tried to connect it directly to the production server. Two problems with this: the potential scalability that can cause, and a compatibility problem between our architecture and the way mongo-connector works.
We did not go further.
“Well, ok, it’s probably not suitable. And Logstash ?”
Next step, Logstash. The tool looked really cool, and could have answered the problems we were facing.
The software allows to fetch, parse and preserve data for analysis. The operation remain simple : a configuration file specify “the input” (where the data comes from) and “the output” (where they go out).
It was therefore necessary to find an “input plugin” for MongoDB, as this input format wasn’t managed natively by Logstash.
The most downloaded has not had support nor update for 11 months.
After several research, we managed to make it almost work : impossible to identify which collection (equivalent of a table in SQL) belonged to which index, because of a specific issue on the processing.
Many tests, a little bit of hard work, no conclusive results at 100%. We could have contented ourselves with it, but given the importance of the subject, we wanted to try to do things correctly.
“Well, that’s good, I’ve got something.”
If you are a developer, you should know the plastic duck theory. The principle is to speak to a third person out loud so that the solution appear obvious.
At this moment, a new element appeared to us: we were perhaps not on the right approach. The dominant logic was to plug an X solution into an existing infrastructure. In reality, maybe our infrastructure need to change ?
It was then, and after reading a lot of documentation that we came across this:
The infrastructure is complex, but the logic is simple: send the data to two different places at the same time.
The implementation was not extremely expensive and allows us to achieve the performance objectives that we set ourselves.
All approaches are possible
The set of methods we had experimented could have worked: we could have enhanced the Python script on several threads, distribute the load via a rabbitMQ … The reality is that we wanted to avoid being “specific”.
Staying generic in the method allows two things. Avoiding a solution too complex to maintain. It is also a guarantee for our customers that we use proven and viable methods. On a topic as key as data, we did not want to rely on a “DIY”.