Company

IndigitiseSee more

addressAddressCanberra, ACT
CategoryManagement

Job description

Requirements

We are seeking a Web Scraping Services professional with the following requirements.
  1. Solution Requirements (Use case)
  2. a) The solution should be able to manage scraping tasks across over one hundred websites efficiently, scaling up and down as needed to accommodate the increased demand.
  3. b) It should have mechanisms in place to manage various website structures and layouts, adapting to changes without any service interruptions.
  4. c) The solution should support different data formats (HTML, XML, JSON) and be able to extract structured data from them.
  5. d) Ability to manage multiple scraping tasks simultaneously to improve efficiency without losing performance or reliability.
  6. e) Capability to rotate IP addresses and use proxies effectively to avoid being blocked by websites and maintain privacy.
  7. f) Comprehensive monitoring capabilities to track scraping jobs, detect errors, and manage them, including automatic retry mechanisms.
  8. g) Configuration options to customize scraping behavior according to specific requirements, such as defining scraping intervals or specifying data extraction rules.
  9. h) Adherence to legal and ethical guidelines, including recognizing website terms of service and robots.txt rules, to ensure compliance with data usage policies within AER.
  10. i) Integration and storage with I Manage (internal storage) solution to securely store scraped data.
  
  1. Functional Requirements
    1. a) Data extraction
      1. i) Ability to scrape data from distinct types of Websites, including static, dynamic, and AJAX-heavy sites.
      2. ii) Ability to adapt to changes in website structures and layouts with minimal or limited intervention from user.
  • iii) Support for extracting data from PDFs and other document formats accessible through web pages.
  1. iv) Capability to parse and extract information from structured formats like HTML, XML, and JSON.
  2. v) Capability to rotate IP addresses and use proxies effectively to avoid being blocked by websites and maintain privacy.
  3. vi) Capability to parse and extract structured data (e.g., Tables) from websites and ingest them into structured datasets.
  • vii) Customizable data extraction patterns (using XPath, CSS selectors, regular expressions, etc.) such as scheduled scraping (cron jobs or similar scheduling systems) which may define scraping intervals or other extraction rules.
  • viii) Support for incremental crawling to update only new or changed content.
  1. ix) Automated retry logic for handling failed requests due to network issues or server errors.
  2. x) Adherence to legal and ethical guidelines, including recognizing website terms of service and robots.txt rules, to ensure compliance with data usage policies within AER.
  3. b) Data processing
    1. i) Text cleaning and preprocessing (removing HTML tags, decoding HTML entities, etc.).
    2. ii) Data transformation capabilities (format conversion, date normalization, numerical formatting).
  • iii) Support for multi-language content extraction and processing.
  1. c) Data storage
    1. i) Options for storing scraped data in various formats (CSV, JSON, XML, databases).
    2. ii) Integration capabilities with databases (SQL and NoSQL) and cloud storage services (Azure Blob Storage, AWS S3).
  • iii) Mechanism for data deduplication and conflict resolution.
  1. d) User experience
    1. i) Intuitive, user-friendly, web-based interface for configuring scraping tasks and visualizing progress. The interface should be able to meet Web Content Accessibility Guidelines standard to 2.1 level.
    2. ii) The system should enable users to navigate through the systems with minimal clicks and wait times to navigate through the systems and perform activities. After the user has completed initial training, they should be able to reliably complete the basic tasks without further instruction (ideally no-code or low-code) and the interface should facilitate these tasks.
  • iii) Access to documentation, community forums and support for user assistance.
  1. iv) Reporting tools for generating analytics and insights from scraped data.
  2. v) Alerting mechanisms for errors, completion notifications and system health-status.
  
  1. Non-functional Requirements
    1. a) Performance
      1. i) High throughput to ensure data is extracted and processed quickly.
      2. ii) Efficient use of resources to minimize the cost of operation on hardware and cloud services.
  • iii) Services should be able to cater for growth and changing preferences without impacting performance, functionality, or stability.
  1. b) Reliability
    1. i) Fault tolerance to manage system failures gracefully.
    2. ii) The system should be able to cater to SLA of 99.9% availability.
  • iii) Data integrity checks to ensure the accuracy and completeness of the scraped data.
  1. iv) System shall ensure the integrity of the data. Unintentional alterations must be prevented.
  2. c) Scalability
    1. i) Ability to scale horizontally to manage increases in workload.
    2. ii) Flexible architecture to add new features or integrate with other services without significant overhead.
  3. d) Integration
    1. i) Ability to integrate easily with other tools using standard API integration preferably REST/SOAP APIs.
    2. ii) If necessary, there should be a library of pre-built integration adapters for integration with other tools to minimize development costs. Preference would be for these adapters to be available from the seller (as opposed to other third-party providers).
  4. e) Security
    1. i) All data in the system must be hosted in Australia and under Australian laws.
    2. ii) Cloud infrastructure should be Infosec Registered Assessor Program (IRAP) certified.
  • iii) The systems and data must be protected in accordance with the Government Information Security Manual (ISM) and Protective Security Policy Framework (PSPF).
  1. iv) All data in the system must be owned by AER. This includes data in any non-production environments, backups, and archives.
  2. v) The system must enable Single Sign On (SSO).
  3. vi) The system must enable Role Based Access Control.
  • vii) Data must be always encrypted using the Advanced Encryption Standard or similar algorithm. Alternative approaches to data encryption that provide equivalent security will be considered. Refer to the ISM for guidance on encryption requirements. 
  • viii) The system must keep track of actions in a format accessible to authorized users. The audit logs must not be able to be changed or deleted after creation. 
  1. Services Requirements
  2. a) Implementation services that are offered directly by the seller.
  3. b) Maintenance and support services post-integration, including troubleshooting and best practice guidance.

Please feel free to call Manoj on 0468 492 *** or simply click on Apply now button.
Refer code: 2244336. Indigitise - The previous day - 2024-05-23 03:35

Indigitise

Canberra, ACT
Jobs feed

Amazon Delivery Driver

Amazon Flex

Armadale, WA

Medical Laboratory Scientist - Research Assistant

Nsw Health Pathology

Sydney, NSW

Safety & Compliance Officer - Goondiwindi

National Heavy Vehicle Regulator

Goondiwindi, QLD

Research Assistant - EECS, Faculty of EAIT

The University Of Queensland

Queensland

Coordinator

Bunnings Warehouse

Mount Isa, QLD

Maintenance Team Leader - Residential Aged Care

Baptistcare Nsw & Act

Point Clare, NSW

Airport Services Officer

Shire Of Wyndham East Kimberley

Broome, WA

Warehousing Assistant

Flexiforce Australia

Goulburn, NSW

Sheet Metal Fabricator

Hays Recruitment

Parramatta, NSW

Sales Representative

Conquest Personnel

Altona, VIC

Share jobs with friends

Related jobs

Provision of Web Scraping Services

Senior Engagement Manager, AWS Professional Services, Public Sector

Amazon Web Services

Canberra, ACT

3 weeks ago - seen

Web Developer (TSPV)

Helix Technology Services

Canberra, ACT

3 weeks ago - seen

Multiple Angular & Amazon Web Services (AWS) Developers

Resolve Recruit

Symonston, ACT

4 weeks ago - seen

Multiple Angular & Amazon Web Services Developers

Recruitment Hive

Canberra, ACT

4 weeks ago - seen

Consulting Services Manager, AWS Professional Services, Public Sector

Amazon Web Services

Canberra, ACT

2 months ago - seen

Assistant Director - Web Services UX

Aps

Canberra, ACT

2 months ago - seen

Business Contracts Manager, APJ Business Contracts Management

Amazon Web Services

Canberra, ACT

5 months ago - seen

Technical Account Manager, ES - APJC - ANZ

Amazon Web Services

Canberra, ACT

5 months ago - seen

Web Content Creator

Kirra Services

Australian Capital Territory

5 months ago - seen