《國際野生物貿易研究組織:2024利用大數據技術監測開源信息中的腐敗風險研究報告(英文版)(20頁).pdf》由會員分享,可在線閱讀,更多相關《國際野生物貿易研究組織:2024利用大數據技術監測開源信息中的腐敗風險研究報告(英文版)(20頁).pdf(20頁珍藏版)》請在三個皮匠報告上搜索。
1、MARCH 2024Antony BagottGabriel poTO MONITOR CORRUPTION RISKS IN PUBLICLY AVAILABLE INFORMATION:A TECHNICAL GUIDEusing big data techniquesTRAFFIC REPORTABOUT US TRAFFIC is a leading non-governmental organisation working globally to ensure that trade in wild species is legal and sustainable,for the be
2、nefit of the planet and people.Reprod uction of material appearing in this report requires written permission from the publisher.The designations of geographical entities in this publication,and the presentation of the material,do not imply the expression of any opinion whatsoever on the part of TRA
3、FFIC or its supporting organisations concerning the legal status of any country,territory,or area,or of its authorities,or concerning the delimitation of its frontiers or boundaries.The research on which this Guide is founded was made possible by the generous support of the American people through t
4、he United States Agency for International Development(USAID).The contents are the responsibility of the authors and do not necessarily reflect the views of USAID,the United States Government,or individual Targeting Natural Resource Corruption(TNRC)consortium members.ABOUT About Targeting Natural Res
5、ource Corruption The Targeting Natural Resource Corruption(TNRC)project is working to improve biodiversity outcomes by helping practitioners to address the threats posed by corruption to wildlife,fisheries and forests.TNRC harnesses existing knowledge,generates new evidence,and supports innovative p
6、olicy and practice for more effective anti-corruption programming.Learn more at tnrcproject.orgPublished by:TRAFFIC International,Cambridge,United Kingdom.SUGGESTED CITATIONBagott,A.,po,G.TRAFFIC(2024).Using Big Data Techniques to monitor corruption risks in publicly available information:a technica
7、l guide.TRAFFIC(2024).Copyright of material published in this report is vested in TRAFFIC.UK Registered Charity No.1076722DISCLAIMERThis publication is made possible by the generous support of the American people through the United States Agency for International Development(USAID).The contents are
8、the responsibility of the author(s)and do not necessarily reflect the views of USAID,the United States Government,or individual Targeting Natural Resource Corruption project consortium members.contentspage 4INTRODUCTIONpage 14challenges and lessons learnedpage 6OUR APPRAOCHData Processing Modelpage
9、10risk factorspage 19APPENDICESEndnotesImage creditspage 17results and conclusionsINTRODUCTIONTHE ANALYSIS OF LARGE DATA SETS TO UNCOVER TRENDS AND RISKS HAS BECOME COMMONPLACE WITH INCREASES IN COMPUTING POWER AND THE AVAILABILITY OF LARGE DATA REPOSITORIESAfter having piloted such approaches to ex
10、plore corruption risks in forestry concessions in 2022,we have sought to apply our learnings to additional regions or countries in which forestry data are widely available.By using existing technology in an innovative way,TRAFFIC aims to determine whether“big data”1 approaches to analyse publicly av
11、ailable information can provide credible evidence for monitoring and investigation where there is the potential that corruption in the forestry sector has occurred.But our model can also be implemented as a preventative tool when managed well,increasing the visibility of conflicts of interest in con
12、tract procurement can serve to decrease instances of corruption.In this report,we offer technical guidance on using tools to analyse large datasets and reveal the potential corruption risk of individuals or companies involved in the forestry sector.We also outline indicators of corruption risk and d
13、escribe the challenges faced when working with this kind of information to assist those seeking to apply similar techniques in the field of conservation as well as anti-corruption.6 USING BIG DATA TECHNIQUESDATA PROCESSING MODELThe data processing model describes the steps taken to collect,store,sor
14、t,analyse and enrich the data.It outlines the major actions,decisions,data types and analytical tools required to reach each result.The model is presented in the flowchart on the next page.This is followed by a full description of each stage in the process.Identification of core data sources:The fir
15、st step is to identify the core data sources,i.e.,the publicly available sources of information that will provide us with names of entities(such as people and companies)and their involvement in forestry(such as their roles and their harvesting rights).Key data sources from which to obtain this infor
16、mation include:Lists of politically exposed persons(PEPs)2 We search for PEPs within the focal country,aiming to identify those who are linked(even indirectly)to forestry or who have allocated logging permits/concessions to companies.In some cases,PEP lists will contain information about income and
17、expenditure,assets,liabilities,gifts,real estate,vehicles,and more.Registries of forestry concessionsWe search for information relating to the allocation of logging rights within the focal country.As a minimum,this should include the names of the companies involved,the responsible person(s)within ea
18、ch company,and the PEP(s)who allocated the rights.Where available,we also include additional information such as the location allocated,the permitted amount of harvestable timber,and the purpose for harvesting.OUR APPROACHFIGURE 1A flowchart outlining the data processing model used.This model assume
19、s that all data is obtained from online sourcesKEY7 USING BIG DATA TECHNIQUES8 USING BIG DATA TECHNIQUESFIGURE 2A website listing forestry concessions granted in a country over the last decade.Some sections with potentially sensitive information have been blurred.Data collectionWhere large volumes o
20、f data are involved,automated data collection processes such as web scraping are typically far more efficient than manual processes.The efficiency of these processes depends on the available functionality on the site,the consistency of the site structure,the complexity of the webpage design,the pres
21、ence of images and/or text,and the technical ability of the person performing the scraping.Depending on the permissions and structure of the source website,we can either download the data via the native site functionality,scrape the website using a pre-built visual web scraper,or scrape the website
22、using a custom web scraper.Native site functionalityIf the site permits users to download the data for example,through an API(Application Programming Interface)or simply through an“Export”button,we use this method.Pre-built visual web scrapersIf the required data cannot be downloaded using native si
23、te functionality,we use pre-built visual web scraping software such as ParseHub to allow for efficient data collection.Visual web scrapers work best in situations where the site structure is consistent,the webpages are laid out in a coherent way,and the data consists of text rather than images.Custo
24、m web scrapersIn cases where a visual web scraper is unable to collect the data,it may be necessary to build a custom web scraper in-house.These would typically be set up for each website.Due to the high cost and time implications of this method,it should be reserved for situations in which no alter
25、native method will work,no other source contains the same information,and only when there is a high chance that it will yield a large quantity of valuable data.Data storage and processingThe way the data is initially stored depends on its format following its collection.If the collected data is stru
26、ctured,in tabular format for example,it can be stored in a format such as CSV,XML or JSON for later analysis and import into other software.If the data is unstructured,in narrative format as a PDF for example,we need to perform additional processing on it before it can be understood by most analytic
27、al software.The processing is as follows:USING BIG DATA TECHNIQUES 9FIGURE 3ParseHub,a visual web scraping tool that provides a simple“point and click”interface to select and scrape elements of a webpage.Some sections with potentially sensitive information have been blurred.FIGURE 4Rosoka Text Analy
28、tics,which works as a plugin to IBM i2 Analysts Notebook to visually display connections between people,objects,locations,and events.This image has been blurred to remove potentially sensitive references to people and/or places.Contextualisation of dataBrowsing the contents of a large number of file
29、s with different file types can be a time-consuming process within Windows native file explorer.Digital investigation tools such as Nuix Workstation allow users to examine thousands of unstructured data types,pick out relevant information,and identify entities within the text through regular express
30、ions(regex).Nuix Workstation also performs Optical Character Recognition(OCR)on text documents,ensuring that documents are machine-readable.Natural Language Processing(NLP)NLP tools such as Rosoka Text Analytics can find links between entities in the text.When paired with visual analysis software su
31、ch as IBM i2 Analysts Notebook,it is possible to view a visual representation of the relationships between entities(people,objects,locations and events)within the text.10 USING BIG DATA TECHNIQUESIdentification of entities of interestOnce the data is in a format that can be transferred into other an
32、alytical software,it can be queried and analysed to identify entities of interest,i.e.,key names and companies.Cross-referencing data pointsA simple way to cross-reference datapoints across two datasets is by using the Fuzzy Lookup Add-In for Excel.3 This ensures that spelling variations,alterations
33、,and typos of the same entity are accounted for.It also detects matches when forename(s),middle names,and surname(s)are presented in different orders or are omitted altogether which is particularly useful when the name is arranged differently according to the naming convention of the language used,o
34、r when one part of the name is retained in one source but omitted in another.Users can adjust the level of similarity that causes a match.Pre-built visual web scrapers Producing dashboard data visualisations and social network chartsDashboard visualisations of data can be produced using a range of p
35、rograms,including Power BI and IBM i2 Analysts Notebook.These visualisations can be queried manually,but the programs also provide a number of automated features that facilitate the process.i2 Analysts Notebook allows users to create a network chart of the entities involved in the datasets and the r
36、elationships between them for example,a professional relationship between an employee and an employer,a financial relationship between a company and a politician,a familial relationship between a father and son,and so on.These analytical tools can help users find entities with multiple degrees of se
37、paration,or those with a statistically significant number of relationships with other entities.Applying a scoring systemIn cases where a range of different corruption risk factors from various datasets can be exported on a large scale,it may help to apply a scoring system.We relied on anti-corruptio
38、n experts inside and outside of TRAFFIC to identify corruption risk factors.More information about the specific risk factors we explored can be found in the Risk Factors section of this report.The scoring system combines these various risk factors,applies weighting,scales them accordingly,and produc
39、es a final risk score.4 This score can then be used to assess the relative corruption risk of a large number of different individuals and companies.Data enrichmentOnce entities of interest have been identified,the data we hold can be enriched using social media sites,company registries,and leaked do
40、cuments.This process usually requires some level of manual open-source investigation,but it can be partially automated through online investigations platforms such as Videris and through data repositories such as OCCRP Aleph.Additional potential insights include evidence of beneficial ownership,poli
41、tical financing,sanctions,and any held assets or wealth.Lists of company officers and evidence of social and professional associations may also prove useful.USING BIG DATA TECHNIQUES 11FIGURE 5Videris,an online investigation platform that allows users to search for up-to-date data about individuals
42、and companies from a range of online sources,visualise the information in a relationship chart,and enrich it further as needed.12 USING BIG DATA TECHNIQUESRisk factors are used to provide context to the relative corruption risk of each forestry actor.They are particularly important when applying a s
43、coring system.In this case,risk factors should be split into“scores”and“multipliers”.“Scores”are used for factors that indicate an increased risk of corruption.The scores are added to each other to create an initial risk score.“Multipliers”are used for factors that by themselves do not necessarily i
44、ndicate additional corruption risk,but could increase the impact of any existing risk(s).These are used to multiply the initial risk score to create the final score.Risk factor“scores”that have been used effectively include:Sanitary loggingConcessions listed as“sanitary”logging could be based on doc
45、umentation with false information,i.e.,falsely stating that the trees are infected or dead.According to an expert we consulted,between 3%and 6%sanitary clearance is standard.Therefore,anything over 10%could indicate a level of risk,particularly if a large volume of sanitary wood is permitted to be l
46、ogged within a certain concession area.Percentage of auctions wonA single person or company being consistently or regularly successful in their bids for forestry contracts could indicate a greater risk of corruption,bribery,or nepotism in the bidding process-especially for large contracts(i.e.,those
47、 with a higher revenue).Unexplained wealthIf PEPs have unexplained wealth,such as undeclared property or assets,monetary assets that exceed their salary,luxury cars,and so on,there is a higher potential risk of corruption.Price by volumeContracts that were exceptionally cheap(i.e.,with a low price p
48、er cubic metre of wood)or exceptionally expensive(with a high price per cubic metre of wood)could indicate corruption,e.g.,someone receiving a suspiciously good deal could indicate that nepotism or bribery played a role,or someone paying an excessive amount for a contract could indicate a risk of mo
49、ney laundering,etc.RISK FACTORS USING BIG DATA TECHNIQUES 13Risk factor“multipliers”that have been used effectively include:Number of contractsIn cases where forestry actors are corrupt,those with a larger number of contracts are more likely to have a negative impact on forestry areas(and are more l
50、ikely to be involved in higher-level corruption)than forestry actors with a smaller number of contracts.However,contract number alone is not an accurate indicator of risk.PEP statusForestry actors who are also PEPs-or are related to/affiliated with PEPs-are more at risk of having a conflict of inter
51、est when dealing with the allocation of timber harvest rights.However,in countries where forestry areas are often state-owned by nature,PEP status alone is not an accurate indicator of risk.Other factors that we hope to include in future methodologies include:Tree cover loss dataBy comparing forestr
52、y concession harvesting limits with satellite data on tree cover loss using the interactive map provided by Global Forest Watch,we could determine whether certain forestry actors appear to have exceeded their permitted harvesting over a set period of time.This risk factor is complicated by the irreg
53、ularity of tree sizes due to variations in species and age,meaning it could be difficult to determine the volume of wood that equates to a certain area of tree cover loss.Before undertaking this kind of analysis,it is recommended to research local laws regarding the minimum size of trees that are pe
54、rmitted to be harvested.Basing any estimates on that figure would allow for at least a ballpark figure of the volume of extraction taking place.Political cycle dynamicsUndertaking a comparative analysis of the political cycle(i.e.,changes in government)with the concession data could highlight cases
55、where political influence has had a statistical impact on where and to whom harvest rights are allocated.Natural calamity dataAlthough not a risk factor in itself,natural calamity data can help to distinguish justified sanitary logging from illegal sanitary logging.14 USING BIG DATA TECHNIQUESIdenti
56、fication of core data sourcesWe have encountered challenges with the automated collection of data from some sources due to the nature of the way the data is presented.These challenges have been listed below:Website designIt is not always easy to access information on multiple concessions or permits
57、at once(e.g.,through an API or a bulk download to Excel).In these cases,using a web scraper is necessary to avoid an inefficient data extraction process.Sources with imagerySources that present their concession data through an image,such as a map,are useful for exploring individual concessions but p
58、revent the use of a standard visual web scraper due to their unstructured nature.In these cases,a custom web scraper may be required.Poor quality photocopiesSome sources provide data as photocopies of text documents in PDF format.In some cases,these photocopies are of poor quality and have not yet u
59、ndergone OCR to read the text contained.This can prove a problem for OCR software.Quality improvements can be performed upon PDFs through image manipulation software.Improvements can include changes to contrast,brightness,sharpening,black/white balance,and more.These improvements are limited by the
60、initial quality of the original source.In all cases,visual improvements constitute a manual task,so should only be performed when there is a high chance that the target document will yield valuable data.Variation in data availabilityDifferent websites make available different levels of data.For exam
61、ple,some concessions sources might only focus on a specific region or a certain type of contract.This makes it difficult to make comparisons CHALLENGES AND LESSONS LEARNED USING BIG DATA TECHNIQUES 15between countries and requires the user ensures they fully understand the data that is made availabl
62、e or omitted.Where there is a high level of variability in the availability of datasets,using risk factors that are relative rather than absolute i.e.,looking at the ratio rather than the actual difference between separate datapoints can help to soften the impact of any missing information.For examp
63、le,focusing on the proportion(rather than the absolute number)of sanitary logging contracts per forestry actor can ensure that even when certain types of contracts are omitted,comparisons can continue to be made.However,this should be performed on a case-by-case basis and care should be taken to avo
64、id misrepresenting the data.Lack of clear metadata and guidanceSome Understanding the exact nature of the data being made available is often not straightforward,and often requires the user to undertake significant research to avoid making ill-informed conclusions.Where sufficient metadata or guidanc
65、e does not exist,speaking with or reading reports by other researchers who have used the same data can help to improve a users understanding of the data and the efficiency of their analysis.Having a user with an understanding of forestry is beneficial as they can understand any nuances in the data a
66、nd have experience of finding third-party resources(such as NGO reports)to help inform them.Language barriersThe language skills of the user undertaking the research can affect the speed and accuracy of the analysis.Translation tools such as Google Translate and DeepL can help to an extent,but care
67、should be taken to avoid mistranslations.For example,when translating proper nouns(e.g.,family names)that have an equivalent literal meaning(e.g.,“Smith”,“Miller”,“Hunter”,etc.),online translation tools regularly provide the literal translation rather than the family name itself.Data collection:Web
68、scrapingVisual web scrapers present several technical challenges.These challenges and their solutions have been listed below:Data scraping produces no resultsCertain data scraping projects produce no results in web scraping platforms.This issue may be caused by a variety of reasons,including:Element
69、s fail to be selected on the pageVisual web scrapers aim to intelligently select the appropriate elements(i.e.,the images,text,etc.)on each web page,based on the pattern of elements that have been selected on previous pages.Sometimes,this does not proceed as planned due to inconsistent page design.I
70、n these instances,the web scraper needs to be“trained”using additional pages.Authentication requiredSome websites require a login before data can be scrapped.In these instances,the action of logging into the website will need to be added as the first step to the project.Long load timesWeb scrapers w
71、ill occasionally run too quickly for slower websites.In these instances,the project can be set to wait for a specified amount of time for elements to appear.This will decrease the speed of the scrape,depending on the amount of time set to wait per element.Web scraper is blockedSome sites have techno
72、logy that prevents web scrapers from accessing the data.In these instances,the IP Rotation feature can be turned on to avoid this.However,doing so will significantly decrease the speed of the scrape.Data appears with unreadable charactersWhen exported to CSV and viewed in Excel,datasets that contain
73、 non-Latin script(e.g.,Cyrillic script)do not appear as expected.This is because the CSV export does not support UTF-8 encoded data.To view the script,we need to tell Excel to treat the file as UTF-8 encoded.To do this,we import the file into Excel instead of opening it directly.When doing so,we cho
74、ose the appropriate encoding(UTF-8)and the data is imported correctly.16 USING BIG DATA TECHNIQUES Document data is in another languageTo allow for consistency between datasets and ease of use,it may be necessary to translate datasets from their original language into English.For large datasets,the
75、entire document can be translated through third-party translation tools available online.Datasets do not convert into CSVCertain files are too large to be converted into CSV.Attempting to convert the file into CSV will simply result in the software crashing.In these cases,they should be downloaded a
76、s another format such as JSON,and any analysis of the data should be run using the JSON file.Identification of entities of interest:Applying a scoring system Wide diversity of parameters and datapoints results in an unbalanced range of scoresWhen corruption risk factor scores are based on a diverse
77、set of parameters that span a wide range with a handful of anomalous datapoints on one or both of its ends,the scores can be skewed.The handful of data points at either end of the range end up garnering very large or very small scores,while most data points remain in a homogenous group in one sectio
78、n of the range.This reduces the effectiveness of the final risk score.Converting scores to a logarithmic form(e.g.,base 10)solves this problem.It essentially changes the curve in the“line”of datapoints,improving the spread of scores across the range.This can be performed easily using the inbuilt fun
79、ctionality within a wide range of platforms,e.g.,the LOG function in Excel,or the log10()function in R.Scaling these scores down to manageable numbers(10,5,1,etc.)improves the readability of the data while also allowing us to add a“weighting”to each score based on our determination of the severity o
80、f the risk factor.0500010000150002000025000010002000300040005000ScoreCountSanitary logging scores 0500010000150002000025000010002000300040005000ScoreCountSanitary logging scores00.511.522.533.544.55010002000300040005000ScoreCountSanitary logging scores,converted to log10FIGURE 6A comparison of chart
81、s of sanitary logging scores(based on the percentage and volume of sanitary logging in the analysed forestry contracts)before and after conversion to logarithmic form,showing the improvement in the spread of scores across the range.USING BIG DATA TECHNIQUES 17Due to the sensitive nature of the data,
82、we cannot share the specific results of our analysis here.Instead,we can outline the type of results that we produced.The results of our data processing model include the following outputs:Scoring CompilationThe Scoring Compilation is a list of all individuals and companies involved,allowing for a l
83、arge-scale comparison of the relative potential corruption risk between entities.Our Scoring Compilation includes all datasets used,as well as the formulas employed to reach our final scores.RESULTSFIGURE 7An anonymised version of TRAFFICs Scoring Compilation,showing each of the score categories tha
84、t contribute to the total risk score for the forestry performers/PEPs/bidders in the list.18 USING BIG DATA TECHNIQUES Risk ProfilesBased on the information provided in the Scoring Compilation,we can create risk profiles of specific entities to allow for a deeper insight into high-risk individuals o
85、r companies.The Risk Profiles created by TRAFFIC consist of a personal profile detailing the main information known about the PEP,an overview of the total risk rating and the contributing risk factors,a list of suggested actions,a typology chart,and a catalogue of the sources used.Our results will b
86、e shared with relevant financial institutions to follow up on the high-risk individuals and companies that weve identified to allow further research into their financial and business relationships and take action where needed.By sharing the methodology and the process that weve undertaken,we also ho
87、pe to allow others to apply similar techniques to uncover potential corruption in our focal countries and elsewhere.It should be noted that although the data processing model outlined here can help to identify potential instances of corruption(indeed,in one case we independently identified a PEP as
88、a high corruption risk a few months before they were convicted for embezzling public funds),we as an NGO cannot accurately verify the data on our own.It is with the support of financial institutions that we can take our insights further and find verifiable proof of corruption through their existing
89、systems and processes.A suggestion for taking this work further would be to collate this proof and verify if the risk ratings in our Scoring Compilation match up with real-world corruption.In sectors with a large number of repeated contracts such as forestry,the use of big data techniques can be a f
90、ruitful method of monitoring and investigating corruption risks.As long as the data are made publicly available,this control system can be used not only by internal government agencies but also by media or civil society,helping to ensure accountability even when supply chains are suspected of being
91、mired by corruption.Furthermore,if data from timber supply chains including cross-border trade were made available in other countries,big data approaches such as these could bring further opportunities for ensuring that trade in natural resources is legal and sustainable.CONCLUSIONFIGURE 8An anonymi
92、sed typology chart detailing the connections between a PEP,their associates,logging concessions,and any corruption-related incidents.USING BIG DATA TECHNIQUES 19ENDNOTESIMAGE CREDITS1 Throughout the project,we processed types of information that are defined as big data(e.g.,social media interactions
93、)and employed approaches that are typically used on big data(e.g.,automated collection,processing,and analysis).However,it should be noted that not all the datasets we encountered can be defined as“big data”:i.e.,extremely high-volume,high-velocity,and/or high-variety datasets that require automated
94、 analysis to reveal patterns or trends.2 In some countries,there exist official PEP lists consisting of information that the PEPs themselves have submitted.This is because public officials in these countries are required to declare information about their assets.However,there is massive variability
95、in these regulations across the globe.Where these regulations do not exist,in certain cases independent transparency organisations have compiled similar lists.In other cases,these lists dont exist at all in which case,it is more difficult to undertake this kind of work.3 The Fuzzy Lookup Add-In uses
96、 the Jaccard similarity coefficient to measure the similarity between finite sets of objects.This is defined as the size of the intersect of the two sets divided by the size of the union of the two sets,as per the following(where J is the Jaccard distance,A is set 1,and B is set 2):4 Converting the
97、initial scores to logarithmic scale can help to even out any excessively large ranges.The following formula can then be used to“scale”the scores,i.e.,to convert the measurement r from its original range(from r min to r max)into a weighted target range(from t min to t max):Coverjensenartofficial/Pixa
98、bay2A.Walmsley/TRAFFIC4TheDigitalArtist/Pixabay edited with Carla McMahon/iStock6franganillo/Pixabay12TRAFFIC/A.Walmsley14geranimo/Unsplash17BrianPenny/PixabayJANUARY 2024TRAFFIC+44(0)1223 331 997traffictraffic.orgW O R K I N G T O E N S U R E T H AT T R A D E I N W I L D S P E C I E S I S L E G A L A N D S U S TA I N A B L E,F O R T H E B E N E F I T O F T H E P L A N E T A N D P E O P L E