https://github.com/compnet/transpolosearch

Web-based information extraction for political science

https://github.com/compnet/transpolosearch

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.0%) to scientific vocabulary

Keywords

information-retrieval political-science web-search
Last synced: 9 months ago · JSON representation

Repository

Web-based information extraction for political science

Basic Info
  • Host: GitHub
  • Owner: CompNet
  • License: gpl-2.0
  • Language: Java
  • Default Branch: master
  • Homepage:
  • Size: 58.6 MB
Statistics
  • Stars: 1
  • Watchers: 3
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
information-retrieval political-science web-search
Created about 11 years ago · Last pushed almost 7 years ago
Metadata Files
Readme License

README.md

TranspoloSearch v2

Web-based information extraction for political science

  • Copyright 2015-18 Vincent Labatut

TranspoloSearch is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation. For source availability and license information see licence.txt

  • Lab site: http://lia.univ-avignon.fr/
  • GitHub repo: https://github.com/CompNet/TranspoloSearch
  • Contact: vincent.labatut@univ-avignon.fr

Description

This software takes the name of a public person and a period, and retrieve all events available online involving this person during this period. It first perform a web search using various engines, then retrieves the corresponding Web pages, performs NER (named entity recognition), uses these entities to cluster the articles, and considers each cluster as the description of a specific event. It is designed to handle Web pages in French, but should work also for English. It has been used in references [MLE'15] and [ML'17].

If you use this software, please cite reference [MLE'15]: bibtex @InProceedings{Marrel2015, author = {Marrel, Guillaume and Labatut, Vincent and El Bèze, Marc}, title = {Le {Web} comme miroir du travail politique quotidien~? Reconstituer l'écho médiatique en ligne des événements d'un agenda d'élu}, booktitle = {13ème Congrès de l'Association Française de Science Politique}, year = {2015}, pages = {25}, address = {Aix-en-Provence, FR}, url = {[hal-01904338](http://www.congres-afsp.fr/st/st7/st7marrellabatutelbeze.pdf)}, }

Organization

The source code takes the form of an Eclipse project. It is organized as follows: * Package data contains all the classes used to represent data: articles, entities, etc. * Pacakge evaluation contains classes used to measure the performance of the retrieval tool * Package processing contains classes related to named entity recognition (NER). * Package retrieval contains classes used to get the web pages. * Package search contains classes used to perform the web search. * Package tools: various classes used throughout the software.

The rest of the files are resources: * Folder lib contains the external libraries, especially the NER-related ones (cf. the Dependencies section). * Folder log contains the log generated during the processing. * Folder out contains the articles and the files generated during the process. * Folder res contains the XML schemas (XSD files), as well as the configuration files required by certain NER tools.

Installation

First, get the last version of the project. Second, you need to download some additional files to get the required data.

Most of the data files are too large to be compatible with GitHub constraints. For this reason, they are hosted on FigShare. Before using Nerwip, you need to retrieve these archives and unzip them in the Eclipse project.

  1. Go to our FigShare page.
  2. You need the data related to the different NER tools (models, dictionaries, etc.), and you can ignore the corpus files (used for another project).
    • Download all 4 Zip files containing the NER data,
    • Extract the res folder,
    • Put it in the Eclipse project, in place of the existing res folder. Do not remove the existing folder, just overwrite it (we need the existing folders and files).

Finally, some of the NER tools integrated in Nerwip require some key or password to work. This is the case of: * Subee: our Wikipedia/Freebase-based NER tool requires a Freebase key to work correctly. * OpenCalais: this NER tool takes the form of a Web service. All keys are set up in the dedicated XML file keys.xml, which is located in res/misc.

Use

For now, there is not interface, not even a command-line one. All the processes need to be launched programmatically, as illustrated by class fr.univavignon.transpolosearch.Test. I advise to import the project in Eclipse and directly edit the source code in this class. A more appropriate interface will be added once the software is more stable. The output folder is out.

Dependencies

Here are the dependencies for TranspoloSearch: * Misc.: * A bunch of JARs from the Google APIs Client Library for Java * jsoup to handle HTML files * Apache Commons Codec * JSON.simple to parse JSON documents * JSTAT to cluster events * NER Tools: * Libraries: * alias-i LingPipe * HeidelTime * Nero * TagEN * Certain classes were taken from our own tool Nerwip (and sometimes modified) * Web services: * Thomson Reuters OpenCalais * OpeNER * Libraries required by certain NER tools: * TreeTagger, needed by HeidelTime. * Non-included libraries: some libraries are not included and must be installed manually. * OpenFST, needed by Nero (see its README file in folder res, for instructions). * Wapiti, needed by Nero (again, see its README file in folder res, for instructions).

Todo

  • Define a black list corresponding to satirical journals (Gorafi, Infos du monde, Nordpresse, Sud ou Est, etc.)
  • Article filtering: once the content has been retrieved, filter articles not published during the targeted period (they could describe events taking place during this period though, so maybe only article published before the period?)
  • Article retrieval:
  • Search engines:
    • Add the Duck Duck Go search engine. As of 2017/04/21, the Instant Answer API is too restricted to return results we could use in TranspoloSearch. See this page for a description of the API. Also, it is powered by other search engines already integrated in TranspoloSearch.
    • Add the Yahoo search engine. Apparently, Yahoo is powered by Bing since 2011, so not worth it since we already have Bing.
    • Add the Baidu search engine. As of 2017/04/22, the documentation is in Chinese only (https://www.programmableweb.com/api/baidu).
    • Add the Orange search engine, which focuses on French (http://www.lemoteur.fr/).
    • Add the BoardReader search engine, which focuses on Q/A and Forum websites (http://boardreader.com/).
    • Add the Exalead search engine, originally designed for intranets (https://www.exalead.com/search/web/).
    • Add Twitter support.

References

  • [ML'17] V. Labatut & G. Marrel. La visibilité politique en ligne : Contribution à la mesure de l’e-reputation politique d’un maire urbain, Big Data et visibilité en ligne - Un enjeu pluridisciplinaire de l’économie numérique, 32p, 2017. ⟨hal-01904352⟩
  • [MLE'15] G. Marrel, V. Labatut & M. El Bèze. Le Web comme miroir du travail politique quotidien ? : Reconstituer l'écho médiatique en ligne des événements d'un agenda d'élu, 13ème Congrès de l'Association Française de Science Politique (AFSP), 25p, 2015. ⟨hal-01904338⟩

Owner

  • Name: Complex Networks
  • Login: CompNet
  • Kind: organization
  • Location: Avignon, France

GitHub Events

Total
Last Year