Currently used for tracking various statistics in /stats
- In future would be used for showing statistics of various streams
- Minimal impact on scalability and service complexity
- Well tested
- Efficent to query, so that the data can be displayed easily on stream page for example without perfomance issues.
- Storage size should be smaller than the main Message/Usermessage tables so we don’t need any additional database.
Has 3 main components
- Of the form
- subroup: eg
- Foreign key fileds to UserProfile, Stream, Realm etc
- CountStats define what stats should be generated and stored in which tables.
FillState keep tracks of when the last time the cron job updated each stat.
Counts are initially collected into
StreamCount are aggregated into
messages_sent:client:day has rows in
UserCount with values of UserProfile, client and endtime
- When they are aggregated to
RealmCount it has rows with values Realm, client and endtime.
RealmCounts are aggregated into
- With values client and endtime
Rows with value 0 are not stored
- An analytics system will end up requiring something like hadoop if it’s not managed carefully.
- So Zulip analytics system is designed carefully so that the data can be processed by a single system.
- Also need to make sure that expensive queries are not made again
UserMessage tables since they are really big.
- Some key design principles
- Don’t repeat the work using fillstate.
- Storing data in
Count tables ensure that we don’t hit large tables like
UserMessage for generating analytics data.
- Doing expensive operations inside the database rather than getting the data to Python and putting it back.
- eg: Insert into table2 select * from table1.
- Aggregating instead of creating additional queries to Message and UserMessage tables.
- eg: Make a query for generating UserCount and generate RealmCount from UserCount rather than making additional queries.
- Don’t store values with 0 since they take up too much memory.
- An hourly stat would collect .5MB of data per user.
- Which comes down to 24 * 365 * .5MB ~= 4GB per user.
LoggingCountStat for generating stats for data that are not worth storing every data point for.
When adding a new table make sure to edit and run
populate_analytics_db so that the fake data is generated in dev environment.