This is a guest post by Candelario A. Gutierrez. Thanks for sharing your CrateDB experience with us.
My name is Candelario A. Gutierrez, and in this post, I will talk about how we leverage CrateDB into our platform for analyzing social-platform data. We have reported on this work in our paper on “Analyzing and visualizing Twitter conversations” (see https://dl.acm.org/doi/abs/10.5555/3507788.3507791). In this post, I provide more detailed information on the technical aspects of our platform, of interest to software developers who deal with similar data and are interested in exploring CrateDB.
Conversations about public interest issues have shifted from mediated traditional media sources to unmediated online media, blogs, and webpages.
- Online conversations meaning, value, and emotions can have lasting financial consequences (Lefsrud, Westbury, Keith & Hollis, 2015).
- Online stakeholder activism can create an existential crisis for organizations that use purely technical methods for identifying and evaluating their risks (Mechler, 2016).
- Online conversations about any single topic are inter-related with many other conversations on other related topics (Oberg, Lefsrud & Meyer, 2021).
As a result, there is substantial interest in developing a variety of computational methods for analyzing social platform data (Sapountzi & Psannis, 2018). Our own work has been motivated by the need to develop a robust and scalable platform, able (a) to deal with multiple and big data sets, and (b) to easily integrate new analyses. The platform in its current version analyze data at the lexical level (extracting words, hashtags, URLs, emojis, mentions, and media); integrates three dictionary-based semantic analyses, i.e., personal values, sentiment (of the tweet text and images), and humor expressed; extracts domain-specific aspects (specified as phrases) and their associated sentiments; and quantifies engagement in terms of a variety of alternative metrics.
Social media is a prime example of semi-structured data consisting primarily of free-form texts, called posts, containing references to users (e.g., "at mentions" in Twitter), topics of interest (e.g., hashtags), or other posts (e.g., "re-tweets" in Twitter). This type of data can easily grow to terabytes. Therefore, for a system to be capable to process, analyze and visualize from medium to big data, it needs to fulfill a set of requirements:
- Efficient data processing.
- Distributed and easily parallelizable architecture, including a powerful query interface.
- Low latency database access.
Thanks to open-source platforms and tools, we managed to fulfill all those requirements as pictured below.
There are three key concepts that this platform brings to the table: automation, scalability, and response. Let me explain how they are reflected in our components:
- It starts with a client that collects Twitter data based on two different implementations: through an API, or a scraper. From this result, we extract and move URL(s) and images from the dataset in order to expand URL(s) if needed and download the images in async mode. A parsed CSV, containing only posts’ and users’ data is created to be used for further steps.
- The CSV file is ingested through an Apache Kafka topic and transferred to Apache Spark Structured Streaming. Once Spark detects that there is new data incoming through Kafka's topic, our custom functions transform the data and perform a number of calculations.
- Once processed, each record is transferred to its respective table in CrateDB. When we have all the tweets available in the database, a corresponding users table is constructed through a CrateDB’s view that selects unique users based on the id field.
- URL(s) expansion and image(s) processing are retrieved through asynchronous API calls in a different pipeline, which are later transferred in their respective tables to the database.
- Finally, we developed a set of custom visualizations that are able to be fetched from CrateDB through SQLAlchemy, where a Socket.IO’s client was developed to listen and establish a connection, in order to transfer data to our front-end with less overhead. Such visualizations are capable of doing aggregations and counts thanks to the ZingChart library. And in order to explore all the datasets, we made use of the FancyGrid library to create a table component.
In the case that external services need to consult the processed or any other type of data, with FastAPI we developed a set of endpoints that allow to do different types of select into CrateDB:
- /select/: Retrieve paginated results of data.
- /select_offset/: Retrieve data based on an offset.
- /select_ids/: Retrieve data based on id(s).
- /select_date_range/: Retrieve data based on a time range.
Now, let me introduce you to the internal components of the data processing and analysis section. As depicted in the image below, we have a set of different analysis pipelines.
Text and Metadata Analysis
Dictionary-Based Lexicon Look-Up
- Preprocessing: We developed two custom functions that work with text, one that looks for words associated with personal values, humor, and sentiment. And another one for aspect-based analysis. They apply Natural Language Processing (NLP) standard rules to process text, which are used during tokenization and normalization.
- Elements: We extract and quantify different elements of the post that can be useful to support quantitative analyses, such as handles, emojis, hashtags, URL(s) and image(s).
Aspect-Based Sentiment Analysis
We apply dependency parsing to reveal the relationships between words that modify the meaning of other words. When multiple aspects are found in the text, they are separated in the dependency structure and their specific modifiers can later be analyzed for sentiment independently.
Engagement Calculation (Public Metrics)
We apply standard formulas to compute engagement and/or influence, extended reach, and posts’ impressions based on the metadata that we can extract from the query search.
We have recently incorporated an API into our system that is capable of processing individual (batches of) image(s). Once an image is received, a number of analyses are applied to it, including:
- Color scheme analysis: Color degrees as RGB values.
- Object detection: Mark objects with bounding boxes.
- Image classification: Confidence value of objects.
- Sentiment analysis: Sentiment score within 3 degrees.
- OCR: Recognition of text.
- Face analysis: Recognition of faces.
Such analyses are later transferred into CrateDB as a single table with their respective post ID.
Custom curated visualizations were created in order to interpret all of the text and metadata analysis. Two groups of visualizations are available to contextualize (a) the frequency of personal values, sentiment, and humor in tweets, and (b) the mention of terms associated with an aspect and what combinations of them are popular in tweets. The added value of such visualizations is the dictionary multi-select, aspect filter, and time control bar. Functionalities that allow the user to get a more in-depth view of the data.
The Energy East Case Study
To validate the architecture of our data platform and the usability of our pipelines analyses. We performed a case study based on the “#energyeast” hashtag query.
The characteristics of this dataset were the following:
- The raw CSV file:
- Collected between June 7, 2013 and June 16, 2021.
- Contained 28,693 tweets, 111,091 retweets, 3,753 tweets with quotes and 4,780 replies with a total of 148,317 tweets.
- Had a size of 68.2 MB.
- Contained 23 columns.
- Number, string, alphanumeric and timestamp data types were used throughout the file.
- The URLs file was composed of:
- 56,970 URLs (Including duplicates).
- 13,428 unique URLs.
- 4,879 short URLs.
- The images file was composed of:
- 5,696 images.
Moving forward, the respective pipelines for text, engagement, metadata and URLs processing got executed on a Linux PC with 16 cores and 56 GB of RAM, which also contained a dockerized deployment of CrateDB v4.6.3 - 3 nodes. And for the image processing, we used a Linux PC with 4 cores, 22 GB of RAM and an NVIDIA vGPU with 8 GB of memory. These were the approximate execution wall times of our pipelines:
- URLs expansion: 56 mins for processing and 5 seconds for writing to a table of 2 columns with a schema:
- id - text.
- expanded_urls - array(text).
- Image processing: 19 hrs for processing and 5 seconds for writing to a table of 2 columns with a schema:
- id - text.
- image_analyses - array(text).
- Dictionary-based lexicon look-up, aspect-based sentiment, metadata extraction and quantification, and engagement calculations: 2 minutes between processing and writing to a table of 42 columns.
Based on the final results, we realized 2 bottlenecks, one is the URLs expansion and the other one is the image processing. The reason for this is because both make asynchronous API calls to fetch data from external resources, and for this we depend on two factors:
- The network throughput.
- The capacity of the external resource to handle volumes of requests and processing power.
To conclude, no matter the number of columns that our tables needed, we experienced a really good writing execution time. Also, in our methodology to serve our visualizations, based on the selection of all 148,317 rows in batches of 2,500 results, it takes approximately 17 seconds. Hence, we realized that CrateDB was a good fit to handle the requirements for this specific study and for this type of data, with really low latency.
If you wish to take a look into the project, you can visit https://ai4buzz.ca/. Further development of this platform will include: Topic and language modeling; evolution analysis and more visualizations.
This project was supported by:
- Ai4Society: https://ai4society.ca/
- The Canadian Energy and Climate Nexus (CECN): https://thececn.ca/
- ISAIC: https://isaic.ece.ualberta.ca/
- Kule Institute for Advanced Study (KIAS): https://www.ualberta.ca/kule-institute/index.html
- Mitacs: https://www.mitacs.ca/en
Lefsrud, L., Westbury, C., Keith, J., & Hollis, G. (2015). A basis for genuine dialogue: Developing a science-based understanding of public/industry communication. Phase I Report Prepared for the Alberta Chamber of Resources.
Mechler, R. (2016). Reviewing estimates of the economic efficiency of disaster risk management: opportunities and limitations of using risk-based cost–benefit analysis. Natural Hazards, 81(3), 2121-2147.
Oberg, A., Lefsrud, L., & Meyer, R. E. (2021). Organizational (issue) field perspective on climate change. economic sociology_the european electronic newsletter, 22(3), 21-29.
Sapountzi, A., & Psannis, K. E. (2018). Social networking data analysis tools & challenges. Future Generation Computer Systems, 86, 893-913.