
A new Rutgers-led study designed to address a gap in how geographic information is extracted from unstructured social media text has found that geographic relations expressed in social media differ significantly from those in prior theoretical frameworks, and that the choice of extraction method strongly impacts the types of relations identified. The study also found that large language models perform better for geographic relation extraction when trained with domain-specific data.
"More accurate extraction of geographic relations from social media can improve the precision of location identification during crises such as natural disasters or public health emergencies, when official location tags or coordinates are often missing,” said SC&I Assistant Professor of Library and Information Science Jessica Yi-Yun Cheng, lead author of the study. “This can help emergency responders locate affected areas faster and with greater accuracy, potentially saving lives and resources."
Co-authored by Assistant Professor Ly Dinh, at the School of Information, University of South Florida, the study was published in the International Journal of Geographical Information Science.
"While existing geographic extraction approaches focus mainly on identifying place names," Cheng said, "they often miss the relations between locations, e.g., information like 'north of,' 'located in,' or 'between.' These relations can be critical for disambiguating incomplete location information, especially in urgent situations like disaster response. Our goal was to adapt and evaluate relation extraction models specifically for geographic contexts and to capture the variety of geographic relations found in everyday social media language."
Cheng and Dinh collected 163,037 English-language tweets mentioning U.S. County and state names between 2022 and 2023. After cleaning and preprocessing, they annotated a sample of tweets for geographic entities and relations to serve as the ground truth data.
They then applied and compared five relation extraction methods: a heuristics-based model, OpenIE, BERT, GPT-3.5-turbo, and Mistral, Cheng said.
"We adapted these models for the geographic relation extraction task. Model outputs were evaluated against the ground truth using precision, recall, and F1-scores, and we analyzed the types and frequencies of relations each model produced."
The study’s additional significant findings include:
- Cheng and Dinh created a categorized list of common geographic relations found in social media, including meronymic (e.g., is-a, is-part-of), prepositional (e.g., on, at, between), "include" and "locate" variations, geographic noun phrases, and other spatial relations.
- They showed that certain relation categories, especially "include" and "locate" variants, are more prevalent in natural language than the set relations (e.g., RCC-5, RCC-8) favored by earlier GIS research.
- Their error analysis revealed that many current models tend to generate false positives, highlighting the need for targeted filtering when working with noisy social media data.
"Our findings can guide the design of better geographic information systems, enhance event monitoring platforms, and inform future natural language processing research," Cheng said.
Learn more about the Library and Information Science Department at the Rutgers School of Communication and Information on the website.