https://github.com/compnet/transpolosearch

Web-based information extraction for political science

Keywords

information-retrieval political-science web-search

Last synced: 11 months ago · JSON representation

Repository

Web-based information extraction for political science

Basic Info

Host: GitHub
Owner: CompNet
License: gpl-2.0
Language: Java
Default Branch: master
Homepage:
Size: 58.6 MB

Statistics

Stars: 1
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 0

Topics

information-retrieval political-science web-search

Created about 11 years ago · Last pushed about 7 years ago

Metadata Files

Readme License

TranspoloSearch v2

Web-based information extraction for political science

Copyright 2015-18 Vincent Labatut

TranspoloSearch is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation. For source availability and license information see licence.txt

Lab site: http://lia.univ-avignon.fr/
GitHub repo: https://github.com/CompNet/TranspoloSearch
Contact: vincent.labatut@univ-avignon.fr

Description

This software takes the name of a public person and a period, and retrieve all events available online involving this person during this period. It first perform a web search using various engines, then retrieves the corresponding Web pages, performs NER (named entity recognition), uses these entities to cluster the articles, and considers each cluster as the description of a specific event. It is designed to handle Web pages in French, but should work also for English. It has been used in references [MLE'15] and [ML'17].

If you use this software, please cite reference [MLE'15]: bibtex @InProceedings{Marrel2015, author = {Marrel, Guillaume and Labatut, Vincent and El Bèze, Marc}, title = {Le {Web} comme miroir du travail politique quotidien~? Reconstituer l'écho médiatique en ligne des événements d'un agenda d'élu}, booktitle = {13ème Congrès de l'Association Française de Science Politique}, year = {2015}, pages = {25}, address = {Aix-en-Provence, FR}, url = {[hal-01904338](http://www.congres-afsp.fr/st/st7/st7marrellabatutelbeze.pdf)}, }

Organization

The source code takes the form of an Eclipse project. It is organized as follows: * Package data contains all the classes used to represent data: articles, entities, etc. * Pacakge evaluation contains classes used to measure the performance of the retrieval tool * Package processing contains classes related to named entity recognition (NER). * Package retrieval contains classes used to get the web pages. * Package search contains classes used to perform the web search. * Package tools: various classes used throughout the software.

The rest of the files are resources: * Folder lib contains the external libraries, especially the NER-related ones (cf. the Dependencies section). * Folder log contains the log generated during the processing. * Folder out contains the articles and the files generated during the process. * Folder res contains the XML schemas (XSD files), as well as the configuration files required by certain NER tools.

Installation

First, get the last version of the project. Second, you need to download some additional files to get the required data.

Most of the data files are too large to be compatible with GitHub constraints. For this reason, they are hosted on FigShare. Before using Nerwip, you need to retrieve these archives and unzip them in the Eclipse project.

Go to our FigShare page.
You need the data related to the different NER tools (models, dictionaries, etc.), and you can ignore the corpus files (used for another project).
- Download all 4 Zip files containing the NER data,
- Extract the res folder,
- Put it in the Eclipse project, in place of the existing res folder. Do not remove the existing folder, just overwrite it (we need the existing folders and files).

Finally, some of the NER tools integrated in Nerwip require some key or password to work. This is the case of: * Subee: our Wikipedia/Freebase-based NER tool requires a Freebase key to work correctly. * OpenCalais: this NER tool takes the form of a Web service. All keys are set up in the dedicated XML file keys.xml, which is located in res/misc.

Use

For now, there is not interface, not even a command-line one. All the processes need to be launched programmatically, as illustrated by class fr.univavignon.transpolosearch.Test. I advise to import the project in Eclipse and directly edit the source code in this class. A more appropriate interface will be added once the software is more stable. The output folder is out.

Dependencies

Here are the dependencies for TranspoloSearch: * Misc.: * A bunch of JARs from the Google APIs Client Library for Java * jsoup to handle HTML files * Apache Commons Codec * JSON.simple to parse JSON documents * JSTAT to cluster events * NER Tools: * Libraries: * alias-i LingPipe * HeidelTime * Nero * TagEN * Certain classes were taken from our own tool Nerwip (and sometimes modified) * Web services: * Thomson Reuters OpenCalais * OpeNER * Libraries required by certain NER tools: * TreeTagger, needed by HeidelTime. * Non-included libraries: some libraries are not included and must be installed manually. * OpenFST, needed by Nero (see its README file in folder res, for instructions). * Wapiti, needed by Nero (again, see its README file in folder res, for instructions).

Todo

Define a black list corresponding to satirical journals (Gorafi, Infos du monde, Nordpresse, Sud ou Est, etc.)
Article filtering: once the content has been retrieved, filter articles not published during the targeted period (they could describe events taking place during this period though, so maybe only article published before the period?)
Article retrieval:
- Fine-tune the generic reader by considering all the articles in a given corpus which are too short (less than 1000 characters) or too long (more than 3 000 characters?). One identified problem is the case of pages containing not one article, but rather a list of articles (sometimes with the first paragraph of each article).
- Maybe use the Boilerpipe API instead of our custom tool? (see https://code.google.com/archive/p/boilerpipe and https://github.com/kohlschutter/boilerpipe)
- For some sources, only a part of certain articles is available (restricted access, requiring some sort of registration). We could set up a reader-specific option, in order to allow to either: 1) give up the retrieval in this case (interesting if the restriction is only temporary, eg. you can access the article next month); or 2) get what we can, i.e. generally the first paragraphs (this is the current behavior, which is appropriate if the rest of the article will never be available).
- Add the specifically defined readers for the following information sites:
- 20 Minutes
- Arrêt sur Images
- Atlantico
- Au Féminin
- BFM TV
- Capital
- Closer
- Dernières Nouvelles d'Alsace
- Europe 1
- France Culture
- France TV
- France TV Info
- Huffington Post France
- JeuxVidéos.com
- L'Equipe
- L'Est Républicain
- L'Opinion
- La Croix
- La Croix du Nord
- La Dépêche du Midi
- La Tribune
- LCP
- Le Dauphiné Libéré
- Le Petit Journal
- Les Echos
- Marianne
- Mediapart
- Nord-Eclair
- Ouest France
- Paris Match
- Paris Normandie
- RFI
- RTBF
- RTL
- Rue 89 (and regional variants, e.g. www.rue89strasbourg.com)
- Sciences et Avenir
- Sud Ouest
- TF1
- Valeurs Actuelles
- Voici
- For these journal-specific readers, we could define a generic process consisting in looking for an HTML element with a predefined class for authors, another for title, etc. One would just have to define the appropriate classes (or other HTML info): updating such reader would be easier.
Search engines:
- Add the Duck Duck Go search engine. As of 2017/04/21, the Instant Answer API is too restricted to return results we could use in TranspoloSearch. See this page for a description of the API. Also, it is powered by other search engines already integrated in TranspoloSearch.
- Add the Yahoo search engine. Apparently, Yahoo is powered by Bing since 2011, so not worth it since we already have Bing.
- Add the Baidu search engine. As of 2017/04/22, the documentation is in Chinese only (https://www.programmableweb.com/api/baidu).
- Add the Orange search engine, which focuses on French (http://www.lemoteur.fr/).
- Add the BoardReader search engine, which focuses on Q/A and Forum websites (http://boardreader.com/).
- Add the Exalead search engine, originally designed for intranets (https://www.exalead.com/search/web/).
- Add Twitter support.

References

[ML'17] V. Labatut & G. Marrel. La visibilité politique en ligne : Contribution à la mesure de l’e-reputation politique d’un maire urbain, Big Data et visibilité en ligne - Un enjeu pluridisciplinaire de l’économie numérique, 32p, 2017. ⟨hal-01904352⟩
[MLE'15] G. Marrel, V. Labatut & M. El Bèze. Le Web comme miroir du travail politique quotidien ? : Reconstituer l'écho médiatique en ligne des événements d'un agenda d'élu, 13ème Congrès de l'Association Française de Science Politique (AFSP), 25p, 2015. ⟨hal-01904338⟩

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/compnet/transpolosearch

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

TranspoloSearch v2

Description

Organization

Installation

Use

Dependencies

Todo

References

Owner

GitHub Events

Total

Last Year